Welcome to MetaFS


MetaFS is a web-based platform for enabling performance evaluation of various feature selection methods from multiple perspectives. MetaFS integrated four well-established criteria (each with a distinct underlying theory) to ensure more comprehensive evaluation than any single criterion. It provided the most complete set of the available feature selection methods. TWO key features characterized MetaFS as a useful online tool for metaproteomics data analysis are:

(1) 13 Feature Selection Methods: 13 feature selection methods in total popular for biomarker discovery of MS-based metaproteomic were integrated and analyzed.

(2) Simultaneous improvement from multiple perspectives: 4 well-established criteria available for assessing the performance of feature selection methods were provided.

In addition, the open source about different feature selection methods are released as on the web site of MetaFS, which can be downloaded HERE .




Instruction to the User


1. Please Choose a Format File Unified by METAFS in the Left Side Panel

Format Unified by MetaFS (The unified quantification data by MetaFS )

Sample data file of the standard format unified by MetaFS could be downloaded HERE, and a file providing a set of GOLDEN STANDARDS (the spiked proteins) could also be downloaded HERE.

2. Please Process the Uploaded Data by Clicking the “Upload Data” Button in the Left Side Panel

Summary and Visualization of Raw Data


A. Summary of the Raw Data








B. Distribution of Protein Intensities Before and After Log Transformation







Summary and Visualization of the Uploaded Raw Data


A. Summary of the Raw Data


B. Distribution of Protein Intensities Before and After Log Transformation

Table of Contents

1. Input File(s) of MetaFS

2. One Example Illustrating The Whole Workflow Step By Step

2.1 Quantification Data Upload

2.2 Data Pre-treatment

2.3 Filtering/Missing Value Imputation

2.4 Feature Selection

2.5 Performance Assessment

3. Various Kinds of Feature Selection Methods for Identifying the Differential Proteins

3.1 Chi-square

3.2 Correlation-based Feature Selection

3.3 Entropy-based Filters

3.4 Fold Change

3.5 Linear Models and Empirical Bayes

3.6 Partial Least Squares Discriminant Analysis

3.7 Orthogonal Partial Least Squares Discriminant Analysis

3.8 Relief

3.9 Random Forest with Recursive Feature Elimination

3.10 Significance Analysis for Microarrays

3.11 Support Vector Machine Recursive Features Elimination

3.12 T-test

3.13 Wilcoxon Rank-sum Test

4. Various Kinds of Criteria for Assessing the Feature Selection Methods

4.1 Method’s unsupervised clustering performance of the identified significantly differential peptides/proteins

4.2 Method’s robustness of the significantly differential peptides/proteins among multiple datasets

4.3 Method’s predictive accuracies based on the supervised classification models

4.4 Method’s capability of identifying the true positive markers


1. Input File(s) of MetaFS

The required file should provide a matrix of sample-feature in a format of csv (the input format of PXD002099 is shown in the Figure). In input file, the unique sample-ID as well as the corresponding label information must be listed at first two columns of the required file and kept as “SampleName” & “Label”, respectively. The peptides/protein’s abundances across all the samples need to be no log scaled, and the peptides/protein’s unique ID must be listed on the first row of the input file. Particularly, the correct format of input file can be easily produced using popular quantification software (e.g. MaxQuant, Progenesis). The correct input file format is shown in the Figure

In order to assess the capability of identifying the true positive markers, another specific file is required for further analysis (the input format of PXD002099 is shown in the Figure). In this file, users only need to provide the concentration matrix of the spiked proteins in samples. All the samples in this file have spiked proteins, and the sample-IDs as well as the corresponding classes are requested at the first 2 columns, whose annotation offered as “Sample ID” & “Label”, respectively. Importantly, the samples should be labeled in two different conditions. Also, the first row must provide the unique IDs of all the spiked proteins.

2. One Example Illustrating The Whole Workflow Step By Step

In total, the overall process of MetaFS can be divided into five procedures: (1) uploading raw microbial peptides/proteins quantification dataset, (2) pre-treatment of raw dataset and normality test, (3) missing values imputation, (4) identification of differential abundance proteins and (5) evaluating the performance of FS methods.

The example data (PXD000672) was employed for illustrating the workflow of MetaFS step by step. The user manual can be downloaded Here .

Step 1. Quantification Data Upload

In the step of metaproteomic data uploading, the required file is supposed to provide a matrix of sample-feature in a format of .csv. The ID of samples should be unique. The sample-ID as well as the corresponding classes’ information must be listed at first two columns of the required file and kept as “Sample-Name” & “Label”, respectively. In input file, the peptides/protein’s abundances across all the samples need to be no log scaled, and the peptides/protein’s unique ID must be listed on the first row of the input file. Particularly, the correct format of input file can be easily produced using popular quantification software (e.g. MaxQuant). An example file could be downloaded directly from the first “HERE” link under the “Upload Quantification Data” button in the first step of “Analysis” module.

In order to assess the capability of identifying the true positive markers, a specific file is required to provide information of the spiked proteins. The sample-ID as well as the corresponding classes is requested at the first 2 columns, whose annotation offered as “sample” & “class”, respectively. Also, the sample-ID is uniquely assigned. The classes of sample should be two different conditions. An example file could also be obtained from the second “HERE” link under the “Load Sample Data” button in the first step of “Analysis” module.

Three sets of sample data are also provided in this step facilitating a direct access and evaluation of MetaFS. These sample data are all benchmark datasets collected from the PRoteomics IDEntifications (PRIDE) database developed by the European Bioinformatics Institute. Particularly, the sample data for SWATH-MS is the dataset PXD000672 containing 12 non-tumorous samples and 12 samples of patients with clear cell renal cell carcinoma (Guo T et al., 2015); the sample data for protein intensity is the dataset PXD005144 with 66 samples of pancreatic cancer patients and 36 samples of chronic pancreatitis patients (Saraswat M et al., 2017); and the sample data for spectral counting is the dataset PXD001819 providing yeast cell lysat samples of different concentrations (0.5 vs 50 fmol/microgram) acquired by MS2 spectral counting (Ramus C et al., 2016). By clicking the “Load Data” button, the sample dataset selected by the users can be uploaded for further analysis.

Step 2. Data Pre-treatment

The pre-treatment procedure was required before downstream statistical analysis. Transformation, centering, scaling and normalization are the key operation procedures in the upstream data analysis. Data was often transformed into the log scale (Callister et al., 2006;van den Berg et al., 2006), which aimed at converting the distribution of peptides/proteins intensities into a more symmetric or normal distribution (Xia and Wishart, 2011). Centering aimed at converts all the concentrations to fluctuations around zero instead of around the mean of the protein concentrations (van den Berg et al., 2006). Scaling could adjust the fold difference between the detected proteins (van den Berg et al., 2006). Normalization referred to removing the unwanted variations to make individual observations/samples more directly comparable (Xia and Wishart, 2011). In total, varieties of pre-treatment methods were included in this procedure, and the detailed information has been provided in a previous study (Tang et al., 2019a). For more detailed information (including their algorithms) of each pre-treatment method, please click HERE.

Moreover, as previously reported, the assumption of normality should be checked for some specific FS methods (Sedgwick, 2015). Currently, the QQ plot is a widely accepted visually method for testing the normal distribution (Lv et al., 2017). Thus, the QQ plot was also provided for allowing users to directly perform the normality test in this procedure. If the points in the QQ plot generally falls on the line y = x, it indicates that the unknown distribution conforms to the normal distribution (Lv et al., 2017).


Step 3. Filtering/Missing Value Imputation

Data filtering & missing value imputation are subsequently provided in this procedure. Data filtering methods could reduce the dimensionality of data (Yan et al., 2018). The filtering method used here is the basic filtering, and several imputation methods frequently applied to treat missing value are contained, including the Zero Imputation, the Singular Value Decomposition (SVD) and the K-nearest Neighbor (KNN) methods. By clicking the “PROCESS” button, a summary of the processed data and a plot of the intensity distribution before and after data pre-treatment are automatically generated in the “Analysis” page.


Step 4. Feature Selection

In order to obtain differential abundance proteins between two distinct groups, an appropriate FS method must be applied for maximally identifying the relevance and eliminating data redundancy. In sum, 13 FS methods were integrated and provided in this tool, and the detailed instructions on each method could be seen in the following section.


Step 5. Performance Assessment

Four well-established criteria for comprehensively evaluating the performance of FS method are provided in MetaFS. These criteria included (a) Method’s clustering performances of the identified differential features; (b) Method’s robustness of selected significantly differential proteins among multiple datasets; (c) Method’s predictive accuracies based on the supervised classification models and (d) Method’s capability of identifying the true positive markers.

The output files of MetaFS included (1) histograms, boxplots and QQ plots before and after pre-treatment, (2) a variety of statistical results (.png and .csv) of the significantly differential features via each FS method and (3) various evaluation results on the performance of each FS method via four independent criteria (e.g., Venn diagrams, unsupervised hierarchical clustering, ROC curve and so on. Users can directly download all these resulting files and performance assessment documents in the specific format (.png and .csv) from the “Download” button at each step of the MetaFS “Analysis” module.


3. Various Kinds of Feature Selection Methods for Identifying the Differential Proteins

In this tool, 13 FS methods popular for biomarker discovery of MS-based metaproteomic were integrated, which contained: (1) Chi-square; (2) Correlation-based Feature Selection; (3) Entropy-based Filters; (4) Fold Change; (5) Linear Models and Empirical Bayes; (6) Partial Least Squares Discriminant Analysis; (7) Orthogonal Partial Least Squares Discriminant Analysis; (8) Relief; (9) Random forest recursive feature elimination; (10) Significance Analysis for Microarrays; (11) Support Vector Machine Recursive Features Elimination; (12) Univariate T Test; (13) Wilcoxon Rank-sum Test. As previously reported, the assumption of normality should be checked for some specific FS methods (Sedgwick, 2015). In other words, the pretreated data should be tested for normal distribution before selecting some FS methods (Sedgwick, 2015), for example, T-test (Jorge <i>et al</i>., 2009). And the quantile-quantile (QQ) plot is a widely accepted visually method for normality test (Lv <i>et al</i>., 2017). The description of each FS method including their requirements for data structure is as follows:

3.1 Chi-square (CHIS)

Chi-square (CHIS) becomes a widely applied statistical method for weighting divergence distribution if the hypothetical feature is actually independent of the class value (Koletsi and Pandis, 2016). As we know, the 2 test is used to judge the independence of events and analyze the deviations between the observed value and the theoretical value of the sample, but its behavior is unstable for very small expected counts (McHugh, 2013). This method is a non-parametric test and does not require the normal distribution of data pretreated (McHugh, 2013). Using CHIS for feature selection is similar to importing a hypothesis testing for class distribution (Zhang et al., 2014). The CHIS has been applied to study the microbiome in urine in women with urgency urinary incontinence (Pearce et al., 2014).

3.2 Correlation-based Feature Selection (CFS)

Correlation-based Feature Selection (CFS) is a multivariate method of filter, which evaluates attribute subset according to the prediction ability of each feature in it and the correlation between them. The subsets with strong prediction ability and low internal correlation in the feature subsets perform well, which is the core hypothesis of this method (Hall and Smith, 1999). It does not rely on any data transformation method and data distribution (Hall and Smith, 1999).. The CFS has been applied to metabolome profiles of urine for the diagnosis of breast cancer (Kim et al., 2010) and accurately classify ovarian cancer samples based on proteomic data (Liu et al., 2002b).

3.3 The Entropy-based Filters (ENTROPY)

The Entropy-based Filters (ENTROPY) are filter-based feature ranking techniques which include three classes: information gain, gain ratio and symmetrical uncertainty (Farina et al., 2008). Information gain can select features according to the information contribution connected to class variables without considering the interaction among features. Gain ratio is an asymmetric measurement method to compensate for information gain bias. Symmetrical uncertainty metric compensates for the inherent bias of information gain. The Entropy-based Filters have been applied to develop a method which can automatically detect and extract blood vessels in retinal images (Chanwimaluang and Fan, 2003) and discover the wound metabolic biomarkers in Arabidopsis thaliana (Boccard et al., 2010).

3.4 Fold Change (FC)

Fold Change (FC) selects features with large shifts between case and control groups. FC can be calculated by the ratio of the mean intensities of proteins between two groups. This method generates features lists more reproducibly than the ordinary and modified t-statistics do (Witten and Tibshirani, 2007). The FC has been widely used in metabolomics analysis to identify urinary metabolomic markers of aminoglycoside nephrotoxicity in newborn rats (Hanna et al., 2013) and also applied in proteomics analysis to identify human protein markers in saliva of individuals with periodontitis (Belstrøm et al., 2016).

3.5 Linear Models and Empirical Bayes (LMEB)

Linear Models and Empirical Bayes (LMEB) is used to evaluate differential abundance in metabolites by drawing volcano plot, which measures differentially accumulated features based on fold changes and t statistics simultaneously. The statistic of this method is reformulated according to the moderated t-statistic (van Ooijen et al., 2018), and the pretreated MS data should be approximately Gaussian distributed to fulfill the statistical hypothesis of this method (Lahti et al., 2013b). The LMEB has been applied to explore the association between the human intestinal microflora and serum lipids in human (Lahti et al., 2013a), and used to access differentially expressed genes in microarray experiments (Smyth, 2004).

3.6 Partial Least Square Discriminant Analysis (PLS-DA)

Partial Least Squares Discriminant Analysis (PLS-DA) is a method of chemometrics used for classifying purposes. On the one hand, it could predict variables which maximize the differences among predetermined samples. On the other hand, it could also infer class relationship of the unclassified sample groups on the basis of a known class distributions calibration-set (Bartel et al., 2013). It is made up of classical Partial Least Square (PLS) regression analysis where the response regressor is a class label, and it’s a PLS variant employed if the Y is categorical. The method is based on the regression analysis and requires the normal distribution of analyzed data (Ghasemi and Zahediasl, 2012).This technique is particularly suitable for dealing with far more predictors than observations (Pérez-Enciso and Tenenhaus, 2003). PLS-DA has been applied to classify bacterial communities in fecal samples and colon mucosal samples of mice (Munyaka et al., 2016).

3.7 Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA)

Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA) is developed as an improvement of the PLS-DA method, which uses multivariate data to distinct more than two groups. In the OPLS-DA, a regression-based model is calculated between a response variable only including classification information and the multivariate data (Westerhuis et al., 2010). Like PLS-DA, this method also requires the normally distributed data (Boccard and Rudaz, 2016). In contrast to PLS-DA, the major advantages in interpretation employing OPLS-DA is its capacity to distinct the predictive variation from non-predictive one (Bylesjö et al., 2006). The OPLS-DA has been applied to explore the influence of dietary resistant starch to the gut microbiome and human metaproteome (Maier et al., 2017).

3.8 Relief (REF)

Relief (REF) is a multivariate filter approach, which can estimate attributes very efficiently. The key idea of it is to estimate an attribute based on the degree of value distinction between near in-stances(Kononenko, 1994). In this method, the values of a significant attribute are correlated with the attribute values of an instance of the same class, and uncorrelated with the attribute values of an instance of the other class. Relief has been applied to explore the association between taste and metabolite profiles of Japanese refined sake(Sugimoto et al., 2010), and identify metabolic markers in prostate cancer(Osl et al., 2008).

3.9 Random Forest with Recursive Feature Elimination (RF-RFE)

Random forest recursive feature elimination (RF-RFE) is a recursive process of backward feature elimination. It takes every feature into account and constructs a random forest in each iteration to measure the significance of feature. The least important feature is selected and then further removed. The process will be repeated again unless there are no features left. In the end, all the features are sorted based on the order of deletion and the feature that ranks first is the last deleted one (Degenhardt et al., 2019). This method has strong feature selection ability and can analyze data with different structures including nonlinear data as well as multivariate and collinear data matrix. Besides, even if the size of data sets is small, it can achieve effective feature selection (Granitto et al., 2006). The RF-RFE has been used in PTR-TOF-MS data analysis to mine data from raw spectra (Cappellin et al., 2011), and has been applied in serum metabolomics analysis to reveal the imbalance of fatty acid metabolism in patients with chronic liver disease (Zhou et al., 2012).

3.10 Significance Analysis of Microarrays (SAM)

Significance Analysis of Microarrays (SAM) is a hypothesis test based on permutation, which can identify the number of features with significant differences between two conditions. Because the data of samples may not follow the normal distribution, the method uses a non-parametric statistic to expand the applicable scope of the data. Namely, SAM makes no assumptions about the distribution of the data (Young et al., 2011). And it is particularly suitable in estimating the rate of error discovery as well as the miss rate. This method was established to analyze changes in gene expression, and now becomes a mature statistical approach for metabolomics study and has been used to screen out the most discriminant biomarkers (Sun et al., 2013). For instance, it has been applied to solve the problem of significance analysis in quantitative proteomics (Roxas and Li, 2008) and analyze the proteomics data of non-small cell lung cancer (Yanagisawa et al., 2003).

3.11 Support Vector Machine Recursive Features Elimination (SVM-RFE)

Support Vector Machine Recursive Features Elimination (SVM-RFE) is an algorithm of wrapper, which sorts features by backward feature elimination. It measures the weights of features in terms of support vectors (Lin et al., 2012). This method uses the current feature sets to train SVM classifier iteratively, and removes the least important features indicated by SVM, so as to carry out feature selection (Ding and Wilkins, 2006). SVM-RFE could address the data with different structures including nonlinear as well as linear data based on the different kernel (Liu et al., 2002a;Liu, 2019). The SVM-RFE has been used for proteomic analysis to improve the accuracy of cancer classification (Rajapakse et al., 2005).

3.12 Univariate T Test (T-test)

Univariate T Test (T-test) is not a method for classification, and it sorts features according to their p values (Christin et al., 2013). After multiple testing corrections, if the p value of a specific feature is less than 0.05, this feature is considered to be significant. T-test focuses on the differences (or conversely the equality) among means. It’s one of the most widely used test methods in the field of medicine and is a powerful unbiased parametric test under the theory of normal curve, which requires that the data conform to or roughly conform to a normal distribution (Bridge and Sawilowsky, 1999). In other words, T-test can be applied only if the points in QQ plot of pretreated data roughly falls on the line y = x (Lv et al., 2017), unless there is a large enough sample size (Ghasemi and Zahediasl, 2012). This method has been applied in a quantitative proteomic analysis to examine the changes in expressed protein of T.forsythia when forming biofilm (Pham et al., 2010).

3.13 The Wilcoxon Rank-sum Test (Wilcox)

The Wilcoxon Rank-sum Test (Wilcox) a non-parametric substitution method for the T-test of two samples, which is only based on the descending order of observed values of two samples. Wilcox should be the best choice for those extremely skewed distributions, such as consisting of heavy tails (Bridge and Sawilowsky, 1999). In other words, not normally distributed data also could be analyzed via Wilcox (Hicks et al., 2016). The Wilcoxon test has been applied in serum proteomics analysis to investigate the prognostic biomarkers of gastric cancer (Qiu et al., 2009) and it was also used to study colonic metaproteomic characteristics of bacteria in obesity patients (Kolmeder et al., 2015).


4. Various Kinds of Criteria for Assessing the Feature Selection Methods

Compared with the previous publication on the assessment of FS methods (Tang <i>et al</i>., 2019b), two new and widely accepted criteria (unsupervised clustering performance and robustness performance between features identified) was further added in this study. In total, MetaFS integrated four independent criteria to assess the performance of FS method. Multiple combinations of the criteria might offer a more comprehensive assessment on the FS method applied. The results of evaluation of all criteria could be directly showed and be completely downloaded from the online tool. Every criterion as well as their correlative measures was offered as follows:

4.1 Method’s unsupervised clustering performance of the identified significantly differential peptides/proteins

An appropriate FS method was supposed to preserve or sometimes enlarge the difference in proteomics dataset in 2 different groups (Griffin et al., 2010). Based on protein intensities of samples, the unsupervised hierarchically clustering (visualization via heatmap) was therefore frequently applied as an effective metric (Griffin et al., 2010). Firstly, feature selection reduced the whole number of proteins studied. Then, columns (samples) and rows (proteins) are clustered via their similarities in protein intensity profile. The FS method will be considered well-performed under this criterion when there is an obvious separation between two groups samples on the heatmap. This criterion highlights the clustering performance and is an embodiment of the effectiveness of methods (Risso et al., 2014).

4.2 Method’s robustness of the significantly differential peptides/proteins among multiple datasets

With the help of this criterion, we defined consistency score to indict the common part of the identified markers in various sections of the available data quantitatively. In the identification of markers for the given data, with a bigger consistency score, it could represent the results to be more robust. The evaluation results of this criterion can reflect the universality of the selected FS method (Wang et al., 2015) and the reproducibility of identified significantly differential markers (Tang et al., 2019a).

4.3 Method’s predictive accuracies based on the supervised classification models

In this case, on the basis of support vector machine (SVM), the value of area under the curve (AUC) as well as the curve of receiver operating characteristic (ROC) (Valikangas et al., 2018a)was provided. Firstly, differential abundance features are identified by each FS method based on the processed dataset. Secondly, based on these identified features, the SVM models are then constructed. We recognized a method with both bigger areas of the ROC curve and larger AUC value is well-performed.

4.4 Method’s capability of identifying the true positive markers

As reported, an expected FS method is supposed to screen a rounded differential features list relevant to the spiked proteins (Lichtman et al., 2016;Zhao et al., 2016). Thus, the optimal feature set derived from the differential abundance proteins could be applied for measuring each algorithm’s ability on identifying the true positive markers. The ideal set of features should only include features relevant to the spiked proteins (true positives). These differential features based on each FS method contain spiked features (true positives) and non-spiked compound-related features (false positives). Then, the number of identified spiked proteins was counted for uncovering the performance of the method (Christin et al., 2013).

    @ ZJU

    Please feel free to visit our website at https://idrblab.org




    Email

    Dr. Jing Tang (tangj@cqu.edu.cn)

    Dr. Minjie Mou (3160103528@zju.edu.cn)

    Dr. Yongchao Luo (1012299105@qq.com)

    Prof. Feng Zhu* (zhufeng@zju.edu.cn)

    Address

    College of Pharmaceutical Sciences,

    Zhejiang University,

    Hangzhou, China

    Postal Code: 310058

    Phone/Fax

    +86-571-8820-8444