ConSIG: Consistent Identification of Gene/Protein Signature from OMIC Data

Start Analysis with ConSIG

Please Select the Mission Type:

Start a New Mission

Retrieve a Submitted Mission

Upload Your OMIC Data:

Browse...

File Format of Uploaded Data:

csv

xls/xlsx

txt

Test Using Sample Data:

The proteomics benchmark dataset PXD005144 is from the PRoteomics IDEntifications (PRIDE) database, which contains 66 samples of patients with pancreatic cancer and 36 samples of people with chronic pancreatitis (Saraswat M, et al. Cancer Med. 6(7): 1738-1751, 2017). Sample data could be downloaded HERE ( Click to Save).

This sample data is the transcriptomics benchmark dataset GSE28702 from the Gene Expression Omnibus database, which contains 42 samples of responders to FOLFOX therapy and 41 samples of non-responders (Tsuji S, et al. Br J Cancer. 106(1): 126-132, 2012). Sample data could be downloaded HERE ( Click to Save).

Please Input Your Mission ID:

Browse the Example Result:

Graphic Illustration of ConSIG Workflow

Cite ConSIG

F. C. Li, J. Y. Yin, M. K. Lu, Q. X. Yang, Z. Y. Zeng, B. Zhang, Z. R. Li, Y. Q. Qiu, H. B. Dai, Y. Z. Chen*, F. Zhu*. ConSIG: consistent discovery of molecular signature from OMIC data. Briefings in Bioinformatics. 23(4): bbac253 (2022). PMID: 35758241.

Browser and Operating System (OS) Tested for Smoothly Running ConSIG

ConSIG is powered by R shiny. It is free and open to all users with no login requirement and can be readily accessed by a variety of popular web browsers and operating systems as shown below.

OS	Chrome	Firefox	Edge	Safari
Linux (Ubuntu-17.04)	v96.0.4664.110	v52.0.1	N/A	N/A
MacOS (v10.1)	v96.0.4664.93	v70.0.1	N/A	v8
Windows (v10)	v96.0.4664.93	v70.0.1	v96.0.1054.53	N/A

Table of Contents

1. Brief Introduction of ConSIG

1.1 The Underlying Algorithmic Theory of ConSIG

1.2 Required Formats of the Input Files

1.3 The Compatibility of Browser and Operating System (OS)

2. Step-by-step Instruction on the Usage of ConSIG

2.1 Data Uploading and Preprocessing

2.2 Parameter Setting of ConSIG

2.3 Running the ConSIG to generate the optimal signature

2.4 Performance Evaluation of Signature Identified by ConSIG

2.5 Enrichment Analysis Based on Identifed Signature

3. Filter Feature Selection Methods Used for Consistency Comparasion with ConSIG

3.1 Univariate Filter Methods

3.2 Multivariate Filter Methods

4. Enrichment Analysis

4.1 GO Term Enrichment Analysis

4.2 DO Term Enrichment Analysis

1. Brief Introduction of ConSIG

1.1 The Underlying Algorithmic Theory of ConSIG

ConSIG a new strategy based on SVM-RFE was proposed and constructed by (1) integrating the repeated random sampling with consensus scoring and (2) evaluating the ranking consistency among multiple datasets. The workflow of ConSIG was illustrated and demonstrated as follows: firstly, the combined dataset was separated into multiple unique training-test datasets using repeated random sampling. Each training dataset was constructed by a random half of the samples and corresponding test dataset comprised the remaining. Secondly, all these datasets were randomly grouped into N sampling groups (each with M unique training-test datasets). In each sampling group, the gene signature was identified from training dataset using RFE-SVM algorithm. Meanwhile, the classification performance of the signature was evaluated by corresponding test dataset using SVM model with the optimal parameters. Thirdly, to increase the stability among the signatures identified from various datasets, the ranking consistency among M training-test datasets in each sampling group were evaluated by a sequential algorithm of consensus scoring.

1.2 Required Formats of the Input Files

In general, ConSIG only supports the analysis of comparative transcriptomics and proteomics data matrices, so users need to extract the downstream data of the samples into data matrices before performing the analysis. The file required at the beginning of ConSIG analysis should be a sample-by-feature matrix in a csv, xls/xlsx or txt format. The sample name and label name are sequentially provided in the first 2 columns of the input file. Names of these 2 columns must be kept as “Sample” and “Class” without any changes during the entire analysis and names of the remaining columns should be UniProt ID, gene symbol or Entrez ID if a following enrichment analysis is needed. The sample name is uniquely assigned according to the preference of users; the class ID refers to 2 differential analytical classes of samples, and is labeled with ordinary number “0” and “1”.

1.3 The Compatibility of Browser and Operating System (OS)

ConSIG is powered by R shiny. It is free and open to all users with no login requirement and can be readily accessed by a variety of popular web browsers and operating systems as shown below.

OS	Chrome	Firefox	Edge	Safari
Linux (Ubuntu-17.04)	v96.0.4664.110	v52.0.1	N/A	N/A
MacOS (v10.1)	v96.0.4664.93	v70.0.1	N/A	v8
Windows (v10)	v96.0.4664.93	v70.0.1	v96.0.1054.53	N/A

2. Step-by-step Instruction on the Usage of ConSIG

Analysis and subsequent performance assessment are started by clicking on the “Home” panel on the homepage of ConSIG. The collection of web services and the whole process provided by ConSIG includes: (Step α) data uploading and preprocessing, (Step β) ConSIG parameter setting, (Step γ) running the ConSIG, (Step δ) performance evaluation of ConSIG, and (Step ε) enrichment analysis based on signature identified by ConSIG.

2.1 Step α: Data Uploading and Preprocessing

ConSIG is designed for data matrix analysis of comparative transcriptomics and proteomics, so in this step, users can only upload the data we specified in 1.2. For a user who needs to start a new round of analysis, he first needs to click on the first item "Start a new mission" in the "Please Select the Mission Type" option, and then click on "Browse..." ConSIG also provides two sets of sample data for testing, one for proteomic data PXD005144 and the other for transcriptomic data The user can select the desired sample data for analysis by clicking the corresponding button.

After uploading corresponding data, 3 steps are subsequently provided for data preprocessing, which involve missing value imputation, data filtering and data normalization. The imputation methods used here are BPCA Imputation, Column Mean Imputation, Column Median Imputation, Half of the Minimum Pos-value, KNN Imputation, SVD Imputation and Zero Imputation. And 2 methods frequently applied to data filtering are covered, which include Mean Intensity Value and Standard Deviation. Moreover, 21 popular data normalization methods are also adopted in POSREG, which involve Auto Scaling, Contrast, Cubic Splines, Cyclic Loess, Eigen MS, MSTUS, PQN, Quantile, Level scaling, Linear Baseline, Li-Wong, Mean, Median, Pareto Scaling, Power Scaling, Range Scaling, Total Sum, Vast Scaling, VSN, Log Transformation and Cube Root Transformation. After selecting or defining preferred methods and parameters, please proceed by clicking the PROCESS button, summary and visualization of the data before and after data preprocessing are automatically generated. All resulting data and figures can be downloaded by clicking the corresponding Download button.

2.2 Step β: Parameter Setting of ConSIG

The predictor genes of ConSIG were selected based on SVM-Recursive Feature Elimination (SVM-RFE), which is a wrapper method that selects predictor genes by eliminating non–predictor genes according to a gene-ranking function generated from a SVM-based class differentiation system. Therefore, it is crucial to choose the appropriate kernel function type and parameters for different data. There were generally two types of confirmation mode in ConSIG: Parameter-tuning by grid search and Parameter-setting by user-defining.

The grid search is a tuning technique that attempts to compute the optimum values of hyperparameters. It is an exhaustive search that is performed on a the specific parameter values of a model. If you have not performed prior parameter tuning on your uploaded data or if you do not know much about your data and the SVM method, you can choose to use this confirmation mode, where you simply select a range of parameters and ConSIG will automatically search for the most suitable hyperparameters within the range for your uploaded data.

ConSIG uses repeated random sampling to divide the pre-evaluated dataset into N × M unique training-test datasets. Each training dataset consists of half of the random samples, and the corresponding test dataset consists of the remaining samples. These unique training-test datasets are randomly grouped into N sampling groups (each group has M unique training-test datasets). In each sampling group, genetic features are identified from the training dataset using the RFE-SVM algorithm. And the results of different sampling groups will be used to verify the consistency of ConSIG. Therefore, users need to set the parameters "The number of sampling groups (N)" and "Training-test datasets in each sampling group (M)" according to their needs. " to weigh the accuracy of feature elimination and the time spent on computation.

Once you have set the parameters, click on the "Start ConSIG" button that appears at the bottom of the sidebar to start the ConSIG program and begin feature removal.

2.3 Step γ: Running the ConSIG to generate the optimal signature

Due to the use of multiple random sampling strategy, ConSIG requires SVM model building and feature weight determination for many separate subsets of data, so it can take a relatively long time to run. For this reason, we built the "ConSIG Program Process Monitor" page in ConSIG to monitor the progress of the program in real time, where users can see the real-time process report, real-time evaluation of the consisGene programe and visualization of real-time elimination.

In addition to this, ConSIG also provides the ability to retrace a submitted task with the unique Mission ID or URL that was generated when the task was completed, so users can simply record their unique Mission ID or URL after completing a task submission and then close ConSIG with confidence.

2.4 Step δ: Performance Evaluation of Signature Identified by ConSIG

Expert working on the discovery of predictive proteomic biomarkers have always been plagued by the difficulty of reproducing their research results, even with the same input dataset and FS method. To increase the confidence of domain experts in their research findings and identified biomarkers, consistency has thus become important criterion. Therefore, a variety of sub-datasets were first generated by multiple random sampling of the original datasets. Second, multiple feature-lists are identified based on these different sub-datasets. Finally, the consistency among multiple feature-lists discovered from different sub-datasets are assessed using their Relative Weighted Consistency (CWrel). The CWrel is calculated based on multiple signatures, it counts the occurrence times of each feature in every single set of signatures and the total occurrence times of all features in all signatures, then uses the specific ratio of these two to represent the overall robustness, and is thus applied in ConSIG to allow more reliable evaluation of the consistency of various feature-lists.

In order to more intuitively reflect the superiority of ConsistGene in terms of consistency, 7 different filter feature selection methods based on varied feature searching and scoring theories were employed and analyzed in ConSIG, which contained univariate filter methods (Fold Change Analysis, Linear Model & Bayes, Student t-test, and Wilcoxon Rank-sum Test) and multivariate filter methods (Correlation-based Method, PLS-DA, and Relief).

In the consistency assessment you can choose any number of classical filter methods to compare with ConSIG. All the methods you choose will be used in the same way as ConSIG for identifying multiple signatures from different data subsets. And the consistency of these signatures identified by different methods will then be evaluated based on CWrel on a uniform scale.

2.5 Step ε: Enrichment Analysis Based on Identifed Signature

The optimal feature-list identified in any biomarker discovery study should be directly related to the phenotype (preferably as upstream as possible), and plays real role in the phenotype as opposed to merely being correlated (Goh WWB, et al. Brief Bioinform. 20: 347-55, 2019).

To measure the level of phenotype-association, all features in the identified optimal signature are first enriched based on their involved Biological Process, Cellular Component, Molecular Function or all terms in Gene Ontology (GO) database to illustrating the process/component/function using clusterProfiler package, and then enriched based on the Disease ontology (DO) database and the gene-disease associations to provide insights in analyzing high-throughput data to elucidate molecular mechanisms of complex diseases using DOSE package. Finally, the enrichment results will be presented in various graphical formats in ConSIG and will be available for download.

3. Filter Feature Selection Methods Used for Consistency Comparasion with ConSIG

3.1 Univariate Filter Methods

Fold Change Analysis. Fold Change (FC) is a basic and widely used method for identifying different gene expression, referring to the fold change between two samples (Feng J, et al. Bioinformatics. 28: 2782-8, 2012). The FC has been widely applied in metabolomics to identify urinary metabolomic biomarkers of aminoglycoside nephrotoxicity in newborn rats (Hanna MH, et al. Pediatr Res. 73: 585-91, 2013).

Student t-test. Student t-test compares the mean of the data sets, judges whether the two are the same and whether the difference between the two is self-evident. It is a test under normal curve theory (Kumar N, et al. Bioinformation. 13: 202-8, 2017).

Wilcoxon rank-sum test.Wilcoxon rank-sum test is generally used to detect whether 2 data sets come from the same population, which is frequently used in statistical practice for the comparison of measures of location when the underlying distributions are far from normal or not known in advance (Rosner B, et al. Biometrics. 59: 1089-98, 2003).

3.2 Multivariate Filter Methods

Correlation-based Method. Correlation-based Method is a multivariate method of filter, which evaluates attribute subset according to the prediction ability of each feature in it and the correlation between them. The subsets with strong prediction ability and low internal correlation in the feature subsets perform well, which is the core hypothesis of this method (Batushansky A, et al. Biomed Res Int. 2016: 8313272, 2016).

PLS-DA. Partial Least Squares Discriminant Analysis (PLS-DA) uses the partial least squares (PLS) algorithm to establish a model for predicting the categories of samples or discriminative variable selection (Lee LC, et al. Analyst. 143: 3526-39, 2018). It consists of a classical PLS regression analysis in which the response regressor is the class label. PLS components are built by trying to find a proper compromise between describing the data and predicting the response.

Relief. Relief is a multivariate filter approach, which can estimate attributes very efficiently. The key idea of it is to estimate an attribute based on the degree of value distinction between near in-stances(Kononenko, 1994). In this method, the values of a significant attribute are correlated with the attribute values of an instance of the same class, and uncorrelated with the attribute values of an instance of the other class. Relief has been applied to explore the association between taste and metabolite profiles of Japanese refined sake(Sugimoto et al., 2010), and identify metabolic markers in prostate cancer(Osl et al., 2008).

4. Filter Feature Selection Methods Used for Consistency Comparasion with ConSIG

4.1 GO Term Enrichment Analysis

The Gene ontology (GO), which annotates genes as biological processes, molecular functions and cellular components in a directed acyclic graph structure, has been adopted as the most common way to search for shared functions among genes is to incorporate the biological knowledge provided by the biological ontology (Ashburner, et al. Nat Genet. 25(1):25-9, 2000). ConSIG uses the enrichGO function in the clusterProfiler package to enrich all features in the final signature for GO term, and the enrichment results will be presented as a bubble map, gene-concept betwork and enrichment map of GO term.

4.2 DO Term Enrichment Analysis

Disease ontology (DO) annotates human genes in the context of disease. DO is important annotation in translating molecular findings from high-throughput data to clinical relevance (Lynn M S, et al. Nucleic Acids Res. 40(D1):D940-6, 2012). ConSIG uses the enrichDO function in the DOSE package (Guangchuang Y, et al. Bioinformatics. 31(4):608-9, 2015) to enrich the features in the final signature for DO term, and the enrichment results will be presented as a bubble map, eene-concept betwork and upsetplot of DO term.