POSREG: Proteomic Signature Discovered by Simultaneously Optimizing Its Reproducibility and Generalizability

POSREG was constructed to discover the optimal feature-list for given proteomic studies based on a comprehensive assessment from multiple perspectives. It works by (1) identifying various feature-lists of good reproducibility (Trends Biotechnol. 36: 488-498, 2018) using their Relative Weighted Consistency (Inf Fusion. 35: 132-147, 2017) and aggregate them into ensemble feature rank using ensemble learning, (2) then assessing the generalizability of ensemble feature rank to aquire optimal signature (Nat Commun. 7: 10259, 2016), and (3) finally confirming the top-ranked list by the enrichment analysis-based phenotype-association (Transl Psychiatry. 9: 233, 2019). This tool is capable of analyzing the proteomic dataset acquired by three different quantification measurements: SWATH-MS, Peak Intensity and Spectral Counting.

Thanks a million for using and improving POSREG, and please feel free to report any errors to Dr. LI at lifengcheng@zju.edu.cn.

Cite POSREG:

F. C. Li, Y. Zhou, Y. Zhang, J. Y. Yin, Y. Q. Qiu, J. Q. Gao, F. Zhu*. POSREG: proteomic signature discovered by simultaneously optimizing its reproducibility and generalizability. Brief Bioinform. doi: 10.1093/bib/bbac040 (2022). PMID: 35183059

Browser and Operating System (OS) Tested for Smoothly Running POSREG:

The POSREG is powered by R shiny. It is free and open to all users with no login requirement & can be readily accessed by a variety of popular web browsers and operating systems as shown below.

The source code of POSREG enabling the assessment on local computer is also provided. To run the local version of POSREG, three sequential procedures should be performed. First, install R and RStudio environments after downloading their Installation Files . Second, download POSREG Source Code . Third, install all required R packages by executing the commands provided in Packages Installation Manual . Fourth, change the working directory of local GRPAE by changing the home directory of POSREG in the line 84 of server.R file. And finally, run POSREG in RStudio by click the "Run App" button. This local version enables the discovery of the optimal feature-list based on all feature elimination methods.

Welcome to Download the Sample Data for Testing and for File Format Correcting

Data-independent Acquisition (DIA)

In Data-independent acquisition (DIA) analysis, all peptides within a defined mass-to-charge (m/z) window are subjected to fragmentation; the analysis is repeated as the mass spectrometer marches up the full m/z range which results in accurate peptide quantification without being limited to profiling predefined peptides of interest.(Allison Doerr et al. Nat Methods 12:35, 2015). For the DIA mode, there was only one quantification measurements, which is measurement SWATH-MS (Shubham Gupta et al. Mol Cell Proteomics. 18:806-817, 2019).

The sample data acquired by DIA method (SWATH-MS) is the proteomics benchmark dataset PXD003972. This dataset contains 20 GRB2 (OST) knock-in mouse samples and 20 GRB2 (WT) mouse samples and could be downloaded .

Data-dependent Acquisition (DDA)

In traditional data-dependent acquisition (DDA), a proteomic sample is digested into peptides, ionized and analyzed by mass spectrometry. Peptide signals that rise above the noise in a full-scan mass spectrum are selected for fragmentation, producing tandem (MS/MS) mass spectra that can be matched to spectra in a database (Allison Doerr et al. Nat Methods 12:35, 2015). For the DDA mode, there were two quantification measurements: peak intensity & spectral counting (Ben-Bo Gao et al. Mol Cell Proteomics. 7:2399-409, 2008)

The sample data acquired by DDDA method (Peak Intensity/Spectral Counting) is the proteomics benchmark dataset PXD005144. This dataset contains 66 samples of patients with pancreatic cancer and 36 samples of people with chronic pancreatitis and could be downloaded .

Summary and Visualization of the Uploaded Raw Data

A. Overview of the Uploaded Raw Data

B. Distribution Visualiztion of the Uploaded Raw Data

Summary and Visualization of the Data after Preprocessing

Table of Contents

1. The Compatibility of Browser and Operating System (OS)

2. Required Formats of the Input Files

2.1 Preprocessed Data Acquired Based on SWATH-MS

2.2 Preprocessed Data Acquired Based on Peak Intensity

2.3 Preprocessed Data Acquired Based on Spectral Counting

3. Step-by-step Instruction on the Usage of POSREG

3.1 Data Upload & Preprocess

3.2 Reproducibility Evaluation

3.3 Collectively Assess Generalizability

3.4 Phenotype-association by Enrichment

4. A Variety of Methods for Feature selection

1. The Compatibility of Browser and Operating System (OS)

POSREG is powered by R shiny. It is free and open to all users with no login requirement and can be readily accessed by a variety of popular web browsers and operating systems as shown below.

2. Required Formats of the Input Files

In general, the file required at the beginning of POSREG analysis should be a sample-by-feature matrix in a csv, xls/xlsx or txt format. The sample name and label name are sequentially provided in the first 2 columns of the input file. Names of these 2 columns must be kept as “Sample” and “Class” without any changes during the entire analysis and names of the remaining columns are UniProt ID or Entrez ID. The sample name is uniquely assigned according to the preference of users; the class ID refers to 2 differential analytical classes of samples, and is labeled with ordinary number “0” and “1”.

2.1 Preprocessed Data Acquired Based on SWATH-MS

SWATH-MS is a newly developing quantification measurement which comprehensively detects and quantifies nearly all ionized peptide fragments (Anjo SI, et al. Proteomics. 17, 2017). For its distinguished sensitivity, reproducibility, accuracy and extensive dynamic range for analyzing proteomics data (Fu J, et al. Front Pharmaco. 9: 681, 2018), SWATH-MS has been known as one of the most popular techniques in current MS-based proteomics studies, which has the great potential to address the limitations of identifying diagnostic or therapeutic targets (Ludwig C, et al. Mol Syst Biol. 14: e8126, 2018).

In this situation, a variety of software is proposed for preprocessing the data acquired based on SWATH-MS, which includes DIA-UMPIRE (a comprehensive computational workflow and open-source software for processing the data independent acquisition mass spectrometry-based proteomics data), OpenSWATH (an open-source software that allows targeted analysis of DIA data based on SWATH-MS in an automated, high-throughput fashion), PeakView (a commercial software which covers all major components of in-silico processes in a SWATH workflow, from extended assay library building to final statistical analysis and reporting), Skyline (a freely-available and open source Windows client application for building selected reaction monitoring, multiple reaction monitoring, parallel reaction monitoring (targeted MS/MS), DIA/SWATH and targeted DDA with MS1 quantitative methods) and Spectronaut (a computational tool for targeted analysis of DIA measurement based on SWATH-MS independent of mass spectrometer vendor). Sample data of this data type can be downloaded .

2.2 Preprocessed Data Acquired Based on Peak Intensity

Peak intensity and spectral counting are two main quantification measurements by the mode of data dependent acquisition in current proteomics studies, and peak intensity renders more favorable accuracy and wider dynamic range than spectral counting (Asara JM, et al. Proteomics. 8: 994-9, 2008). Peak intensity is more applicable for higher resolution mass spectrometry and shows better quantification accuracy. While for lower resolution mass spectrometry, its precision is impaired on account of great amount of thermal noise (Bantscheff M, et al. Anal Bioanal Chem. 389: 1017-31, 2007).

Under this circumstance, a list of software is rendered for preprocessing the data acquired based on peak intensity, which involves MaxQuant (an integrated suite of algorithms specifically developed for processing the high-resolution, quantitative mass-spectrometry data, which is one of the most frequently used platforms for analyzing the MS-based proteome information), MFPaQ (a web-based application that runs on a server on which Mascot Server 2.1 and Perl 5.8 must be installed), OpenMS (a robust, open-source, cross-platform software specifically designed for the flexible and reproducible analysis of high-throughput MS data), PEAKS (a software platform with complete solution for discovery proteomics, including the protein identification and quantification, analysis of posttranslational modification and sequence variants, and peptide/protein de novo sequencing), Progenesis (a new generation of bioinformatics vehicle targeting small molecule discovery analysis for metabolomics and proteomics, which quantifies proteins based on peptide ion signal peak intensity), Proteios SE (a software which integrates protein identification search engine access into several proteomic workflows, both gel-based and liquid chromatography-based, and allows seamless combination of search results, protein inference, protein annotation and quantitation tool), Scaffold (a commercial bioinformatic tool, which attempts to increase the confidence in protein identification reports through the use of several statistical methods) and Thermo Proteome Discoverer (a software for workflow-driven data analysis in proteomics integrating all different steps in a quantitative proteomics experiment (MS/MS spectrum extraction, peptide identification and quantification) into the user-configurable, automated workflows). Sample data of this data type can be downloaded .

2.3 Preprocessed Data Acquired Based on Spectral Counting

Apart from the peak intensity, the other popular quantification measurement by the mode of data dependent acquisition in current proteomics studies is the spectral counting. As a quite simple label-free quantification technique, spectral counting outperforms by its extensive quantification coverage where all the spectra of identified proteins are taken into account (Mueller LN, et al. J Proteome Res. 7: 51-61, 2008). Because of its quick screening of the differences between samples and its broad estimation of the protein identification, spectral counting has been considered as the best quantification measurement in this filed (Bringans SD, et al. EuPA Open Proteom. 14: 1-10, 2017).

To preprocess data acquired based on spectral counting, a number of software is described as following, which involves Abacus (a computational tool for extracting and preprocessing spectral count data for label-free quantitative proteomic analysis), Census (a quantitative software tool which can analyze high-throughput mass spectrometry data from shotgun proteomics experiments in an efficient way and various stable isotope labeling experiments (e.g., 15N, 18O, SILAC, iTRAQ and TMT) in addition to the labeling-free experiments), DTASelect (a Java tool that is used to organize, filter, and interpret results generated by SEQUEST (one of the most widely used protein database searching programs for tandem mass spectrometry)), IRMa-hEIDI (a toolbox which provides an interactive application to assist in the validation of Mascot search results, and allows automatic filtering of Mascot identification results as well as manual confirmation or rejection of individual PSM (a match between a fragmentation mass spectrum and a peptide)), MaxQuant (an integrated suite of algorithms specifically developed for processing high-resolution, quantitative MS data, which has been keeping up with recent advances in high-resolution instrumentation and with the development of fragmentation techniques), MFPaQ (a software tool that facilitates organization, mining, and validation of Mascot results and offers different functionalities to work on validated protein lists, as well as data quantification using isotopic labeling methods or label free approaches), ProteinProphet (a statistical model which is designed for computing probabilities that proteins are present in a sample on the basis of peptides assigned to tandem mass (MS/MS) spectra acquired from a proteolytic digest of the sample) and Scaffold (a feature-rich software suite to assist in analysis, visualization, quantification, annotation and validation of complex LC-MS/MS experiments). Sample data of this data type can be downloaded .

3. Step-by-step Instruction on the Usage of POSREG

This website is free and open to all users and there is no login requirement, and can be readily accessed by all popular web browsers including Google Chrome, Mozilla Firefox, Safari and Internet Explorer 10 (or later), and so on. Analysis and subsequent performance assessment are started by clicking on the “Analysis” panel on the homepage of POSREG. The collection of web services and the whole process provided by POSREG can be summarized into 4 steps: (3.1) data upload & preprocess, (3.2) reproducibility evaluation, (3.3) collectively assess generalizability and (3.4) phenotype-association by enrichment. A report containing evaluation results is also generated and can be downloaded in the format of PDF, HTML and DOC. The flowchart below summarizes the flowchart of analyzing processes in POSREG.

3.1 Data Upload & Preprocess

The first radio checkbox in STEP-1 on the left side of the Analysis page is for users to select the way of uploading data. Users can choose to upload their own proteomics data or to directly load sample data.

Under the circumstance of uploading customized proteomics data from users, 3 data formats which include csv, xls/ xlsx and txt can be selected in the remaining radio checkbox below. After selecting the corresponding radio checkboxes, datasets provided by the users for further analysis can be then directly uploaded by clicking Browse button. Preview of the uploaded data is subsequently provided on the web page.

In the situation of loading sample data, 2 sets of sample data are also provided in this step facilitating a direct access and evaluation of POSREG. They are data acquired based on data-independent acquisition (DIA) and data-dependent acquisition (DDA), respectively, which include (1) Proteomic data acquired by DIA method (SWATH-MS). The sample data is the proteomics benchmark dataset PXD003972 from the PRoteomics IDEntifications (PRIDE) database. Sample data could be downloaded HERE ( Right Click to Save). (2)Proteomic data acquired by DDA method (Peak Intensity/Spectral Counting). The sample data is the proteomics benchmark dataset PXD005144 from the PRoteomics IDEntifications (PRIDE) database. Sample data could be downloaded HERE ( Right Click to Save). After selecting the sample data acquired based on DIA or DDA, review of the uploaded data is subsequently provided on the web page.

After uploading corresponding proteomics data, 3 steps are subsequently provided for data preprocessing, which involve missing value imputation, data filtering and data normalization. The imputation methods used here are BPCA Imputation, Column Mean Imputation, Column Median Imputation, Half of the Minimum Pos-value, KNN Imputation, SVD Imputation and Zero Imputation. And 2 methods frequently applied to data filtering are covered, which include Mean Intensity Value and Standard Deviation. Moreover, 21 popular data normalization methods are also adopted in POSREG, which involve Auto Scaling, Contrast, Cubic Splines, Cyclic Loess, Eigen MS, MSTUS, PQN, Quantile, Level scaling, Linear Baseline, Li-Wong, Mean, Median, Pareto Scaling, Power Scaling, Range Scaling, Total Sum, Vast Scaling, VSN, Log Transformation and Cube Root Transformation. After selecting or defining preferred methods and parameters, please proceed by clicking the PROCESS button, summary and visualization of the data before and after data preprocessing are automatically generated. All resulting data and figures can be downloaded by clicking the corresponding Download button.

3.2 Reproducibility Evaluation

Feature selection is subsequently provided in this step. POSREG offers 12 feature selection methods for analyzing proteomics data and a detailed explanation on each feature selection method is provided in the Section 4 of this Manual. Users can select 3 feature selection methods and set proper parameters of the selected methods. After selecting preferred methods and parameters, please proceed by clicking the PROCESS button, the results of the reproducibility among identified feature-lists based on multiple random sampling are automatically generated, which can be divided into two parts: (A) The Dependence of Reproducibility on the Number of Features Selected and (B) The Occurrence (in Percent) of Each Protein among All Feature-lists. If users have any questions about the previous steps, please click the BACK button and try again.

3.3 Collectively Assess Generalizability

POSREG provides users with 2 SVM kernel functions, which include linear kernel and radial basis function kernel. After inputting the number of top-ranked feature-lists, please select the SVM kernel function and set the corresponding parameter in STEP-3 on the left side of the Analysis page. Finally, please proceed by clicking the PROCESS button, the collective consideration of both generalizability and reproducibility is automatically generated, which can be divided into two parts: (A) The Level of Generalizability for Those Feature-lists Top-ranked by Reproducibility and (B) The Collective Consideration of Both Generalizability and Reproducibility. If users have any questions about the previous steps, please click the BACK button and try again.

3.4 Phenotype-association by Enrichment

By choosing the preferred enrichment type (All GO BP & CC & MF, Biological Processes, Cellular Components, Molecular Functions or KEGG Pathway) of of users and clicking the PROCESS button, the functional enrichment analysis report is automatically generated, which is the bubble-chart illustrating the terms enriched based on the identified optimal feature-list. If users have any questions about the previous steps, please click the BACK button and try again.

4. A Variety of Methods for Feature selection

Chi-squared Test. Chi-square test (CHIS) is a widely used hypothesis testing method for counting data. It firstly hypothesizes the independence of two events and then determines the correctness of the theoretical value by observing the deviation between the actual value and the theoretical value (McHugh ML, et al. Biochem Med (Zagreb). 23: 143-9, 2013). Using CHIS statistic for feature selection is similar to importing hypothesis testing about the distribution of classes (Zhang H, et al. Biomed Res Int. 2014: 589290, 2014).

Correlation-based Method. Correlation-based Method is a multivariate method of filter, which evaluates attribute subset according to the prediction ability of each feature in it and the correlation between them. The subsets with strong prediction ability and low internal correlation in the feature subsets perform well, which is the core hypothesis of this method (Batushansky A, et al. Biomed Res Int. 2016: 8313272, 2016).

Entropy-based Filters. Entropy-based Filters is a filter-based feature ranking technique including information gain, gain ratio and symmetrical uncertainty (Tang J, et al. Brief Bioinform. doi: 10.1093/bib/bbz061). Information gain selects the features based on the information contribution related to the class variable without considering feature interaction. Gain ratio is the non-symmetrical measure that is introduced to compensate for the bias of the information gain. Symmetrical uncertainty criterion compensates for information gain’s inherent bias.

Fold Change Analysis. Fold Change (FC) is a basic and widely used method for identifying different gene expression, referring to the fold change between two samples (Feng J, et al. Bioinformatics. 28: 2782-8, 2012). The FC has been widely applied in metabolomics to identify urinary metabolomic biomarkers of aminoglycoside nephrotoxicity in newborn rats (Hanna MH, et al. Pediatr Res. 73: 585-91, 2013).

Linear Model &Bayes. Linear Model &Bayes assesses the differential intensities by measuring features based on t-statistics and fold changes simultaneously (Tang J, et al. Brief Bioinform. doi: 10.1093/bib/bbz061).

PLS-DA. Partial Least Squares Discriminant Analysis (PLS-DA) uses the partial least squares (PLS) algorithm to establish a model for predicting the categories of samples or discriminative variable selection (Lee LC, et al. Analyst. 143: 3526-39, 2018). It consists of a classical PLS regression analysis in which the response regressor is the class label. PLS components are built by trying to find a proper compromise between describing the data and predicting the response.

Random Forest. Random forest refers to a classifier that uses multiple decision trees to train and predict samples. It belongs to supervised learning (Touw WG, et al. Brief Bioinform. 14: 315-26, 2013). The algorithm has become very popular for pattern recognition in OMIC data, mainly because it provides two aspects that are very important for data mining: high accuracy and information on the variable importance for classification.

Random Forest-Recursive Feature Elimination. Random Forest-Recursive Feature Elimination (RF-RFE) combines random forest and recursive feature elimination. It is a recursive backward feature elimination procedure (Zhou L, et al. Anal Bioanal Chem. 403: 203-13, 2012). In each iteration, a random forest is constructed to measure the features’ importance and the feature of least importance is removed. This procedure is repeated until there is no feature left. Finally, the features are ranked based on the deleted sequence, and the top ranked feature is the last deleted one.

Significance Analysis for Microarrays. Significance Analysis for Microarrays (SAM) is a statistical approach for the identification of molecular quantities that differ significantly between two measurement sets (Constantinou C, et al. J Proteome Res. 10: 869-79, 2011).

Student t-test. Student t-test compares the mean of the data sets, judges whether the two are the same and whether the difference between the two is self-evident. It is a test under normal curve theory (Kumar N, et al. Bioinformation. 13: 202-8, 2017).

Support Vector Machine-Recursive Features Elimination. Support Vector Machine-Recursive Features Elimination (SVM-RFE) identify the least useful attributes to eliminate for further analysis or development of prediction models (Ding Y, et al. BMC Bioinformatics. 7 Suppl 2: S12, 2006). It is a permutation-based (non-parametric) hypothesis testing method for the identification of molecular quantities that differ significantly between two measurement sets that represent different physiological conditions (Larsson O, et al. BMC Bioinformatics. 6: 129, 2005).

Wilcoxon rank-sum test.Wilcoxon rank-sum test is generally used to detect whether 2 data sets come from the same population, which is frequently used in statistical practice for the comparison of measures of location when the underlying distributions are far from normal or not known in advance (Rosner B, et al. Biometrics. 59: 1089-98, 2003).

@ ZJU

Please feel free to visit our website at https://idrblab.org

Email

Dr. Fengcheng Li (lifengcheng@zju.edu.cn)

Dr. Ying Zhou (11918212@zju.edu.cn)

Prof. Feng Zhu* (zhufeng@zju.edu.cn)

Related

Address

College of Pharmaceutical Sciences,

Zhejiang University,

Hangzhou, China

Postal Code: 310058

Phone/Fax

+86-571-8820-8444

α. Data Upload and Preprocess

Welcome to Download the Sample Data for Testing and for File Format Correcting

Summary and Visualization of the Uploaded Raw Data

A. Overview of the Uploaded Raw Data

B. Distribution Visualiztion of the Uploaded Raw Data

Summary and Visualization of the Data after Preprocessing

A. Missing Value Imputation

B. Data Filtering

C. Data Normalization

β. Reproducibility Evaluation

Introduction to the Reproducibility Evaluation Step of POSREG

The Reproducibility among Identified Feature-lists based on Multiple Random Sampling

A. The Dependence of Reproducibility on the Number of Selected Features

B. The Occurrence (in Percent) of Each Protein among All Feature-lists

The Reproducibility among Identified Feature-lists based on Multiple Random Sampling

A. The Dependence of Reproducibility on the Number of Selected Features

B. The Occurrence (in Percent) of Each Protein among All Feature-lists

The Reproducibility among Identified Feature-lists based on Multiple Random Sampling

A. The Dependence of Reproducibility on the Number of Selected Features

B. The Occurrence (in Percent) of Each Protein among All Feature-lists

γ. Collectively Assess Generalizability

AUC-Based Golden-Section Search for Acquiring Optimal Signature with Highest Generalizability

AUC-Based Golden-Section Search for Acquiring Optimal Signature with Highest Generalizability

A. AUC-Based Golden-Section Search for Acquiring Optimal Signature with Highest Accuracy

Evolution Curve of AUC-Based Golden-Section Search

Complete Records of AUC-Based Golden-Section Search

B. Final Optimal Proteomics Signature

C. Clustering Heatmap Illuminating Classification Ability of Identified Signature

Collective Consideration of Both Generalizability and Reproducibility

A. The Level of Generalizability for Those Feature-lists Top-ranked by Reproducibility

B. The Collective Consideration of Both Generalizability and Reproducibility

ε. Evaluation of Selected Signature

Confirmation of the Optimal Feature-list by Enrichment Analysis-based Phenotype-association

Confirmation of the Optimal Feature-list by Enrichment Analysis-based Phenotype-association

A. Bubble-chart Illustrating the Terms Enriched Based on the Identified Optimal Feature-list