Instruction to the User
1. Please Choose a Format File Unified by METAFS in the Left Side Panel
2. Please Process the Uploaded Data by Clicking the “Upload Data” Button in the Left Side Panel
Summary and Visualization of Raw Data
A. Summary of the Raw Data
B. Distribution of Protein Intensities Before and After Log Transformation
Summary and Visualization of the Uploaded Raw Data
A. Summary of the Raw Data
B. Distribution of Protein Intensities Before and After Log Transformation
Table of Contents
1. Input File(s) of MetaFS
2. One Example Illustrating The Whole Workflow Step By Step
2.1 Quantification Data Upload
2.2 Data Pre-treatment
2.3 Filtering/Missing Value Imputation
2.4 Feature Selection
2.5 Performance Assessment
3. Various Kinds of Feature Selection Methods for Identifying the Differential Proteins
3.1 Chi-square
3.2 Correlation-based Feature Selection
3.3 Entropy-based Filters
3.4 Fold Change
3.5 Linear Models and Empirical Bayes
3.6 Partial Least Squares Discriminant Analysis
3.7 Orthogonal Partial Least Squares Discriminant Analysis
3.8 Relief
3.9 Random Forest with Recursive Feature Elimination
3.10 Significance Analysis for Microarrays
3.11 Support Vector Machine Recursive Features Elimination
3.12 T-test
3.13 Wilcoxon Rank-sum Test
4. Various Kinds of Criteria for Assessing the Feature Selection Methods
4.1 Method’s unsupervised clustering performance of the identified significantly differential peptides/proteins
4.2 Method’s robustness of the significantly differential peptides/proteins among multiple datasets
4.3 Method’s predictive accuracies based on the supervised classification models
4.4 Method’s capability of identifying the true positive markers
The required file should provide a matrix of sample-feature in a format of csv (the input format of PXD002099 is shown in the Figure). In input file, the unique sample-ID as well as the corresponding label information must be listed at first two columns of the required file and kept as “SampleName” & “Label”, respectively. The peptides/protein’s abundances across all the samples need to be no log scaled, and the peptides/protein’s unique ID must be listed on the first row of the input file. Particularly, the correct format of input file can be easily produced using popular quantification software (e.g. MaxQuant, Progenesis). The correct input file format is shown in the Figure
In order to assess the capability of identifying the true positive markers, another specific file is required for further analysis (the input format of PXD002099 is shown in the Figure). In this file, users only need to provide the concentration matrix of the spiked proteins in samples. All the samples in this file have spiked proteins, and the sample-IDs as well as the corresponding classes are requested at the first 2 columns, whose annotation offered as “Sample ID” & “Label”, respectively. Importantly, the samples should be labeled in two different conditions. Also, the first row must provide the unique IDs of all the spiked proteins.
In total, the overall process of MetaFS can be divided into five procedures: (1) uploading raw microbial peptides/proteins quantification dataset, (2) pre-treatment of raw dataset and normality test, (3) missing values imputation, (4) identification of differential abundance proteins and (5) evaluating the performance of FS methods.
The example data (PXD000672) was employed for illustrating the workflow of MetaFS step by step. The user manual can be downloaded Here .
In the step of metaproteomic data uploading, the required file is supposed to provide a matrix of sample-feature in a format of .csv. The ID of samples should be unique. The sample-ID as well as the corresponding classes’ information must be listed at first two columns of the required file and kept as “Sample-Name” & “Label”, respectively. In input file, the peptides/protein’s abundances across all the samples need to be no log scaled, and the peptides/protein’s unique ID must be listed on the first row of the input file. Particularly, the correct format of input file can be easily produced using popular quantification software (e.g. MaxQuant). An example file could be downloaded directly from the first “HERE” link under the “Upload Quantification Data” button in the first step of “Analysis” module.
In order to assess the capability of identifying the true positive markers, a specific file is required to provide information of the spiked proteins. The sample-ID as well as the corresponding classes is requested at the first 2 columns, whose annotation offered as “sample” & “class”, respectively. Also, the sample-ID is uniquely assigned. The classes of sample should be two different conditions. An example file could also be obtained from the second “HERE” link under the “Load Sample Data” button in the first step of “Analysis” module.
Three sets of sample data are also provided in this step facilitating a direct access and evaluation of MetaFS. These sample data are all benchmark datasets collected from the PRoteomics IDEntifications (PRIDE) database developed by the European Bioinformatics Institute. Particularly, the sample data for SWATH-MS is the dataset PXD000672 containing 12 non-tumorous samples and 12 samples of patients with clear cell renal cell carcinoma (Guo T et al., 2015); the sample data for protein intensity is the dataset PXD005144 with 66 samples of pancreatic cancer patients and 36 samples of chronic pancreatitis patients (Saraswat M et al., 2017); and the sample data for spectral counting is the dataset PXD001819 providing yeast cell lysat samples of different concentrations (0.5 vs 50 fmol/microgram) acquired by MS2 spectral counting (Ramus C et al., 2016). By clicking the “Load Data” button, the sample dataset selected by the users can be uploaded for further analysis.
The pre-treatment procedure was required before downstream statistical analysis. Transformation, centering, scaling and normalization are the key operation procedures in the upstream data analysis. Data was often transformed into the log scale (Callister et al., 2006;van den Berg et al., 2006), which aimed at converting the distribution of peptides/proteins intensities into a more symmetric or normal distribution (Xia and Wishart, 2011). Centering aimed at converts all the concentrations to fluctuations around zero instead of around the mean of the protein concentrations (van den Berg et al., 2006). Scaling could adjust the fold difference between the detected proteins (van den Berg et al., 2006). Normalization referred to removing the unwanted variations to make individual observations/samples more directly comparable (Xia and Wishart, 2011). In total, varieties of pre-treatment methods were included in this procedure, and the detailed information has been provided in a previous study (Tang et al., 2019a). For more detailed information (including their algorithms) of each pre-treatment method, please click HERE.
Moreover, as previously reported, the assumption of normality should be checked for some specific FS methods (Sedgwick, 2015). Currently, the QQ plot is a widely accepted visually method for testing the normal distribution (Lv et al., 2017). Thus, the QQ plot was also provided for allowing users to directly perform the normality test in this procedure. If the points in the QQ plot generally falls on the line y = x, it indicates that the unknown distribution conforms to the normal distribution (Lv et al., 2017).
Data filtering & missing value imputation are subsequently provided in this procedure. Data filtering methods could reduce the dimensionality of data (Yan et al., 2018). The filtering method used here is the basic filtering, and several imputation methods frequently applied to treat missing value are contained, including the Zero Imputation, the Singular Value Decomposition (SVD) and the K-nearest Neighbor (KNN) methods. By clicking the “PROCESS” button, a summary of the processed data and a plot of the intensity distribution before and after data pre-treatment are automatically generated in the “Analysis” page.
In order to obtain differential abundance proteins between two distinct groups, an appropriate FS method must be applied for maximally identifying the relevance and eliminating data redundancy. In sum, 13 FS methods were integrated and provided in this tool, and the detailed instructions on each method could be seen in the following section.
Four well-established criteria for comprehensively evaluating the performance of FS method are provided in MetaFS. These criteria included (a) Method’s clustering performances of the identified differential features; (b) Method’s robustness of selected significantly differential proteins among multiple datasets; (c) Method’s predictive accuracies based on the supervised classification models and (d) Method’s capability of identifying the true positive markers.
The output files of MetaFS included (1) histograms, boxplots and QQ plots before and after pre-treatment, (2) a variety of statistical results (.png and .csv) of the significantly differential features via each FS method and (3) various evaluation results on the performance of each FS method via four independent criteria (e.g., Venn diagrams, unsupervised hierarchical clustering, ROC curve and so on. Users can directly download all these resulting files and performance assessment documents in the specific format (.png and .csv) from the “Download” button at each step of the MetaFS “Analysis” module.
In this tool, 13 FS methods popular for biomarker discovery of MS-based metaproteomic were integrated, which contained: (1) Chi-square; (2) Correlation-based Feature Selection; (3) Entropy-based Filters; (4) Fold Change; (5) Linear Models and Empirical Bayes; (6) Partial Least Squares Discriminant Analysis; (7) Orthogonal Partial Least Squares Discriminant Analysis; (8) Relief; (9) Random forest recursive feature elimination; (10) Significance Analysis for Microarrays; (11) Support Vector Machine Recursive Features Elimination; (12) Univariate T Test; (13) Wilcoxon Rank-sum Test. As previously reported, the assumption of normality should be checked for some specific FS methods (Sedgwick, 2015). In other words, the pretreated data should be tested for normal distribution before selecting some FS methods (Sedgwick, 2015), for example, T-test (Jorge <i>et al</i>., 2009). And the quantile-quantile (QQ) plot is a widely accepted visually method for normality test (Lv <i>et al</i>., 2017). The description of each FS method including their requirements for data structure is as follows:
3.2 Correlation-based Feature Selection (CFS)
3.3 The Entropy-based Filters (ENTROPY)
3.5 Linear Models and Empirical Bayes (LMEB)
3.6 Partial Least Square Discriminant Analysis (PLS-DA)
3.7 Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA)
3.9 Random Forest with Recursive Feature Elimination (RF-RFE)
3.10 Significance Analysis of Microarrays (SAM)
3.11 Support Vector Machine Recursive Features Elimination (SVM-RFE)
3.12 Univariate T Test (T-test)
3.13 The Wilcoxon Rank-sum Test (Wilcox)
Compared with the previous publication on the assessment of FS methods (Tang <i>et al</i>., 2019b), two new and widely accepted criteria (unsupervised clustering performance and robustness performance between features identified) was further added in this study. In total, MetaFS integrated four independent criteria to assess the performance of FS method. Multiple combinations of the criteria might offer a more comprehensive assessment on the FS method applied. The results of evaluation of all criteria could be directly showed and be completely downloaded from the online tool. Every criterion as well as their correlative measures was offered as follows:
4.1 Method’s unsupervised clustering performance of the identified significantly differential peptides/proteins
4.3 Method’s predictive accuracies based on the supervised classification models
4.4 Method’s capability of identifying the true positive markers
@ ZJU
Please feel free to visit our website at https://idrblab.org
Dr. Jing Tang (tangj@cqu.edu.cn)
Dr. Minjie Mou (3160103528@zju.edu.cn)
Dr. Yongchao Luo (1012299105@qq.com)
Prof. Feng Zhu* (zhufeng@zju.edu.cn)
Address
College of Pharmaceutical Sciences,
Zhejiang University,
Hangzhou, China
Postal Code: 310058
Phone/Fax