MMEASE: Meta-analysis of Metabolomic data

MMEASE is a web-based platform for meta-analysis of multiple metabolomics datasets, and is designed for biologists with little background in statistics to perform sophisticated analysis on metabolomics data. Six analysis steps are included: Data Upload & Integration, Batch Effect Removal, Sample Separation, Marker Identification, Metabolite Annotation, and Enrichment Analysis. Four key features characterized MMEASE as a useful online tool for metabolomics are:

(1) Meta-analysis: integration of metabolomics datasets from multiple experiments or laboratories is conducted and tested based on Zhang’s work ( Analytical Chemistry , 2014. 86(13):6245-53).

(2) Enhanced metabolite annotation: 262,483 metabolites can be annotated including 169,352 peptides, 29,290 endogenous and 42,330 exogenous metabolites.

(3) Diverse statistical analyzing methods: MMEASE provides 15 statistic methods for identifying metabolic markers, 7 of which are applied for the first time among popular online servers for metabolomics data analysis.

(4) More optional databases for enrichment analysis: metabolites enrichment analysis is conduced based on KEGG and SMPD pathways, HMDB bio-functions, CFam structures and species/genus origin of traditionalmedicine.

Help Document

Please make sure you have read the help instructions before uploading your datasets.

This help document provides a step-wise description on how to format and upload data to MMEASE, how to integrate data and remove bath effect, how to identify significant features and patterns through univariate and multivariate statistical methods. Finally, how to use metabolite set enrichment analysis and metabolic pathway analysis to help elucidate possible biological mechanisms.

1 - Stepwise processing results for example dataset

2 - Dataset uploading

3 - Data integration

4 - Batch effect removal

5 - Sample separation

6 - Marker identification

7 - Metabolite annotation

8 - Metabolite set enrichment analysis

1 - Stepwise processing results for example dataset (Back to top)

For starting MMEASE analysis, the csv files containing feature-by-sample matrix should be prepared in advance. Each csv file contained five essential columns providing the information of isotope, mass, adduct, intensity and retention time, and different csv input files were prepared for different analytical experiments. Particularly, the first two columns of each csv file gave the mass and retention time, and samples must be kept in columns with the sample names in the first row. The group label in the second row indicated distinct sample groups such as case and control. Input data values (mass, retention time, intensity) should be numeric, and the blank or “NA” should be adopted to indicate any missing values. An example input file with the corresponding contents separated by comma (csv) was provided HERE . Three are three different files for three experiments for downstream analysis using default parameters after data integration.

(1) Step 1 - Datasets Upload & Integration

In this step, the example datasets could be loaded automatically. In data integration section, the parameters including Primary RT Tolerance, Secondary RT Tolerance and M/Z Tolerance are set 10, 10, 0.05, respectively. Here, BMC/PAMR method are selected to remove batch effect.

(2) Step 2 - Sample Separation

In this step, there are four methods for sample separation. K-means Clustering method is selected to separate all samples for the example dataset. Three parameters, Number of Clusters (K), The Maximum Number of Iteration and Clustering Algorithm are set 2, 10 and Hartigan-Wong, respectively.

(3) Step 3 - Marker Identification

For biomarker identification, there are 13 methods for users. In this case, Student t-test method is used for identifying the biomarkers. In addition, comparison type and adjusted p value are set unequal variances and 0.05, respectively.

(4) Step 4 - Metabolite Annotation

In the step of metabolite annotation, Both mass spectrum and tandom mass spectrum are applied to annotate the metabolites. Take tandom mass spectrum as an example, MS/MS peak list (m/z & intensity) and parent ion mass are input in the server. Tolerance of Parent Ion, Tolerance of Mass/Charge, Ionization Mode and CID Energy are set 0.1, 0.5, positive mode and low (10V), respectively. The table of annotation result is shown below. The metabolite Acamprosate (ID: MMEASE0009095) with the hightest fit degree is selected for the mirror plot of annotation result.

(5) Step 5 - Metabolite Enrichment

In the step of metabolite enrichment, there are eight categories for users. KEGG pathway database is selected and the KEGG compound IDs are input in the web-server. The p-value, Adjusted p-value method and Species are set 0.05, none and Homo sapiens, respectively.

2 - Dataset uploading (Back to top)

Data used for analyzing can be uploaded as the comma-separated value (CSV) files.

Depending on aims or types of analysis that users want to perform, they can upload the data using any of the three available tab panel options--Data Integration, Metabolite Annotation, Enrichment Analysis

Data uploading instructions for data integration, sample separation, marker identification are provided in following:

(1) If you do data integration, please upload several comma-separated value (CSV) files.

(2) If you perform metabolite annotation or enrichment analysis, please paste your m/z features and metabolite list or upload the data file with proper content into the corresponding panel.

Note: CSV file should be formated with the first row reserved for column labels. Samples can be in columns, with group labels immediately following the sample names. The group label can be binary (control, case). Every CSV file must contain five-column format (mass, retention time, intensities, isotopes and adduct) but not as a mixture of both. Input data values (mass, retention time, intensities) must be numeric, and missing values should be left blank or marked as NA. At the present, the Web does not support uploading MS spectra raw data. Considering large MS spectra, users should process raw data into the peak tables firstly. Many MS spectral processing tools are freely available, of which users can use MetAlign, MZmine or XCMS. The uploaded CSV file should look something like in the following figure 1.

Figure 1. The content and format of .CSV file that you will upload

3 - Data integration (Back to top)

Integrate algorithm is based on compound alignment strategy proposed by Zhang et al. Metabolite features were matched between instruments by identifying those that had a m/z difference of <0.01 and a mapped retention time difference of <10 s.

Please note: here, to make sure integrated data quality, datasets before integrating should be from similar experiment (i.e. from the same type chromatography column). And the input values of primary and secondary RT tolerance all should be twice your RT differences; mz tolerance value suggested is 0.05.

In this step, different datasets are merged into a large dataset and then analyzed as if all datasets were derived from the same type experiment. Combinations of multiple batches or data sets in large cross-sectional epidemiology studies are frequently utilized in metabolomics. [ PMID: 23240878 ]

Integrating analysis means combining the information of multiple and independent studies, designed to study the same biological problem, in order to extract more general and more reliable conclusions. Variability in analytical conditions such as temperature, pressure, and humidity can greatly affect an analyte’s chromatographic elution time [J. Chemom. 2004,18, 231−241.]. Therefore, alignment is often necessary to correct the shift in chromatographic retention time among different experimental analyses. [PMID: 16689529 ]. RT tolerance (T(rt)) and M/Z tolerance (T(m/z) ) in data integration analysis are quite sensitive; therefore, the merging results may be quite different after these parameter settings. In the step, T(m/z) and T(rt) are user-defined threshold parameters for the upper bounds on (m/z) and (rt) shift tolerance. In practice, their choice depends on measurement precision and is determined by the experimental instrument setup. This integrate algorithm is based on compound alignment strategy proposed by Zhang et al. (the flowchart of integrate data were shown in Figure 2). An commonly used value is 20 (seconds) for LC-MS RT tolerance, 10 (ppm) for LC-MS M/Z tolerance.

The integrate program automatically performs the following steps:

(1) Find compound with the highest intensity across all samples in a specific dataset.

(2) Aggregate compound from all datasets.

(3) Find compound with the highest intensity across datasets and set this compound as the tentative reference compound.

(4) Set Align_RT and Align_MolecularMass with this compound’s retention time and molecular mass, respectively.

(5) Find compounds that meet the requirements: RT tolerance 1, MZ tolerance 1.

(6) Find a compound with the median retention time of all of the compounds in the tentative aligned compound group.

(7) Set this compound as the target reference compound and set Align_RT and Align_Molecular_Mass with this compound’s retention time and molecular mass, respectively.

(8) From all of aggregation compounds in step 2, find all the compounds that meet the requirements: RT tolerance 2, MZ tolerance 2. In the step 8, if occurring multiple compounds in a specific dataset，select the highest intensity compound.

Figure 2. The flowchart of data integrating

Note: if experimental conditions of datasets are very heteroplasmy, such as sourcing from different type chromatography column (e.g. RP, HILIC), please remove the dataset with large difference. Finally, please ensure that the datasets are valid and suitable for subsequent analysis again.

4 - Batch effect removal (Back to top)

Biomarker projects often include many batches of multiple experiments, where batch variations are commonly observed across different labs, array types, or platforms. Normalization procedures are often not sufficient to adjust data for batch effect.

The goal of metabolomics data pre-processing is to eliminate systematic variation, such that biologically-related metabolite signatures are detected by statistical pattern recognition.

In this step, the server adopt one of six options to remove batch effect: BMC, ComBat, DWD, GlobalNorm, XPN and None. In addition, mean centering also is one of the common pre-treatment methods in metabonomic studies.

(1) Batch mean-centering (BMC): This simple method transforms the data by subtracting the mean of each gene over all samples (per batch) from its observed expression value, such that the mean for each gene becomes zero. Mean-centering has been widely used in the past to compare relative gene expression of high and lowly expressed genes together within a single dataset, particularly for heatmaps and clustering programs. Additionally, Mean centering also is one of the common pre-treatment methods in metabonomic studies.

(2) Empirical Bayes method: (EB, also known as Extended Johnson-Li-Rabinovich or COMBAT) is a method using estimations for the LS parameters (mean and variance) for each gene. The parameters are estimated by pooling information from multiple genes with similar expression characteristics in each batch. There exist both a parametric and a non-parametric approach. Batch-effect correction using ComBat has been applied to data mergeing from apLCMS or XCMS sample processing results, and ComBat correction can remove some between-variance component, as the different classes are closer than for the raw data.[ PMID: 24990606 ].

(3) Distance-weighted discrimination (DWD): DWD (Benaito et al, 2004) used distance for batch effects. However the method requires many samples (> 25) in each batch for best performance. The systematic batch biases were corrected by DWD algorithm. DWD eliminates source effects across different studies by finding a hyper-plane that separates the two systematic biases and adjusts the microarray data by projecting them on the hyper-plane through subtracting out the DWD plane multiplied by the batch mean.

(4) Global normalization: Z-score normalization is one of the simplest mathematical transformations to make datasets more comparable. In this method, for each gene expression value x(ij) in each study separately all values are modified by subtracting the mean x(i) of the gene in that dataset divided by its standard deviation delta

(5) Cross-platform normalization (XPN): The basic idea behind the cross-platform normalization approach is to identify homogeneous blocks (clusters) of gene and samples in both studies that have similar expression characteristics. In XPN, a gene measurement within one such block can be considered as a scaled and shifted block mean, where both scaling and shifting are dependent on the gene i and sample j.)

(6) None: In few cases, you may not consider the batch effect and choose the 'None'.

5 - Sample Separation (Back to top)

The Sample Separation tab allows users to complete clustering and visualizing on integrated data. You can chooose one or more depending on your analysis.

Unsupervised methods are often applied to summarize the complex metabolomic data. They provide an effective way to detect data patterns that are correlated with experimental and /or biological variables. In this step, there are four options for sample separation of metabolomic data, including three clustering methods and one principal component analysis. [PMID: 25798438]. Of which, hierarchical cluster analysis (HCA), k-means clustering and self-organizing map (SOM) are the most prominent representatives in the analysis of metabolomics data. Principal component analysis (PCA) is the most commonly used unsupervised method in metabolomic studies (Woldetal., 1987; BroandSmilde, 2014).

(1) Hierarchical cluster analysis (HCA)

When the number of clusters is unknown, the most prevalent clustering techniques would be hierarchical clustering. In metabolomics studies, Hierarchical cluster analysis (HCA) clusters the data to form a tree diagram or dendrogram which shows the relationships between samples (Ebbels, 2007 PMID: 18007604 ). The metabolic profiles are clustered in a hierarchical tree. At the lowest level of the tree, each metabolic profile is considered as a separate cluster, while all samples are grouped in one cluster at the highest level. Starting from the lowest level, at each round of the algorithm, a correlation coefficient for each pair of the available clusters is estimated based on a particular distance metric. Clusters with the highest correlation coefficient are grouped into one cluster for the subsequent round of the algorithm. The acquired hierarchical tree has to be interpreted in the context of the biological problem at question.

Figure 3. The heatmap for integrated metabolomic dataset

(2) K-means Clustering

If the number of underlying clusters is known as k, k-Means clustering is prevalently used. K-means Clustering is a method of cluster analysis that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In k-means clustering, the Euclidean distance is used as a distance metric and variance is used as a measure of cluster scatter. The number of clusters k is an input parameter. When performing k-means, it is important to run diagnostic checks for determining the number of clusters in the dataset.

The program offer four algorithms of k-means clustering: Forgy, 1965; MacQueen, 1967; Hartigan, 1975; Hartigan and Wong, 1979; Lloyd, 1982. You can use one specific algorithm to complete the cluster.

Please note: input cluster number must be >=2 and below the numbers of samples.

Figure 4. The k-means plot for integrated metabolomic dataset

(3) Principal component analysis (PCA)

Principal component analysis (PCA) is the most commonly used unsupervised method in metabolomics studies (Woldetal., 1987; BroandSmilde, 2014) and can visualize the dataset and display the similarity and difference. PCA plot shows a scatterplot with axes corresponding to the two or more different principal components (e.g. scores plots, loadings plots). PCA is widely used in metabolomics.

Scree plot

A scree plot displays the eigenvalues associated with a component or factor in descending order versus the number of the component or factor. You can use scree plots in principal components analysis and factor analysis to visually assess which components or factors explain the most of the variability in the data.

Loadings plot:

Loading plot of metabolome from model group represents the impact of the metabolites on the clustering results. PCA loading plots displayed variables positively correlated with score plots. Statistically and significantly different metabolites responsible for the discrimination of the two groups were identified between the control and model group. The data points away from center that indicate that ions most responsible for the variance in the score plot.

Figure 5. The Scree plot for principal component analysis

Figure 6. The loading scatter plot for principal component analysis

Figure 7. The variable contribution plot for principal component analysis

Figure 8. The sample distribution plot for principal component analysis

Contribution plot of variables:

The contribution of each variable is represented in a barplot where each bar length corresponds to the loading weight (importance) of the feature in PCA for each component.

Scores plot:

Data were visualized with the PC scores plots, where each point represents an individual spectrum of a sample and displays the distribution of samples in multivariate space. The score plots of the first two principal components established whether there was any intrinsic difference in the metabolic composition of samples.

(4) Self-organizing map (SOM)

Self-organizing map (SOM) describes a mapping from a higher dimensional input space to a lower dimensional map space. The procedure for placing a vector from data space onto the map is to find the node with the closest weight vector to the vector taken from data space. Once the closest node is located, it is assigned the values from the vector taken from the data space. The SOM places similar input data in adjacent nodes. Therefore, SOM forms a semantic map where similar samples are mapped close together and dissimilar apart.

SOM has been applied to metabolic profiling for clustering blood plasma (Kaartinen et al., 1998), and NMR spectra of breast cancer tissues (Beckonet et al., 2003). More recently, Kouskoumvekaki et al. (2008) applied SOM to identify similarities among the metabolic profiles of different filamentous fungi. Meinicke et al. (2008) proposed one-dimensional SOM for metabolite-based clustering and visualization of marker candidates. In a case study on the wound response of Arabidopsis thaliana, they showed how the clustering and visualization capabilities of SOM can be utilized to identify relevant groups of biomarkers.

SOMs have been used in metabolomics studies to visualize metabolic phenotypes and feature patterns as well as to prioritize the metabolites of interest based on their similarity (Kohonen et al., 2000; Meinicke et al., 2008; Mäkinen et al., 2008; Goodwin et al., 2014)

Figure 9. The Energy variation plot for SOM

Figure 10. The clustering plot for SOM (n = 9)

Figure 11. The sample class distribution for SOM

Figure 12. The sample superfamily class distribution for SOM

6 - Marker identification (Back to top)

(1) Fold Change (FC): As a univariate analysis method commonly used in the metabolomics data analysis. FC is to compare the absolute value change between two group means. Given a metabolite, fold change was calculated as the ratio of the mean metabolite levels between two groups.

Top n differential metabolite in FC analysis were visualized by boxplot (Figure 13)

Table 13. The boxplot for results of fold change

And all metabolites were in a table in blew (Table 1).

Table 1. The analysis result of fold change

(2) PLS-DA:

Partial least squares (PLS; Fonvilleetal., 2010) is one of the most widely used supervised method in metabolomics. It can be used as a binary classifier (PLS-DA;i.e., binary variable of interest). PLS components explained the dataset covariance between the variable of interest and the metabolomics data. Therefore, the feature coefficients (loadings) of PLS components represent a measure of how much a feature contributes to the discrimination of the different sample groups. O-PLS models evolved from PLS models and factorize the data variance into two components: a first component which is correlated with the variable of interest and a second uncorrelated component (i.e., orthogonal). A progressive move from the use of PLS models to O-PLS models has been observed in the metabolomics field (Fonville et al., 2010). Classification of metabolomics samples is commonly performed by fitting the discriminant analysis versions of PLS and O-PLS models (i.e., PLS-DA, O-PLS-DA; Kemsley, 1996; Bylesjöetal., 2006).

Scores plot

Score plots generated from supervised PLS-DA method provide visualizable representations of information-rich spectral data by means of dimensionality reduction. PLS-DA is a supervised method that guides this transformation informed by between-group variability tsso better reveal group structure. The resultant two- or three-dimensional scores plot is used to identify spectral features contributing to between-group variability based on separations observed between groups in the scores plot.

Loadings plot:

Loading plot of metabolome from model group represents the impact of the metabolites on the clustering results. PLS-DA loading plots displayed variables positively correlated with score plots. Statistically and significantly different metabolites responsible for the discrimination of the two groups were identified between the control and model group. The data points away from center that indicate that ions most responsible for the variance in the score plot.

Scores plots generated from supervised PLS-DA method provide visualizable representations of information-rich spectral data by means of dimensionality reduction. PLS-DA is a supervised method that guides this transformation informed by between-group variability to better reveal group structure The resultant two- or three-dimensional scores plot is used to identify spectral features contributing to between-group variability based on separations observed between groups in the scores plot.

Validation

A common problem with PLS-DA is its propensity to data overfitting, Therefore, it is often seen to Validate model, include permutation testing and cross validation.

Permutation testing results of the PLS-DA models

Permutation test was implemented to validate the reliability of the model because of its propensity to overestimation of the separation performance, which could be inspected by permutation tests, but not always by cross-validation. [PMID: 22768978] A permutation test involves randomly reassigning the class labels and performing PLS-DA on the newly relabeled dataset. The process is repeated hundreds or thousands of times, and the performance of model are plotted as Validation plot for visual assessment. The x axis reflects the extent of the permutations with 1.0 representing the case that no class label is permuted and 0.0 meaning that all class labels are permuted. If the R2 and Q2 values calculated from the permutated data were lower than the original values in the validation plot, which confirmed the validity of the supervised model[PMID: 26443483]. The criteria for model validity are as follows: (1) All the Q2 values on the permuted dataset are lower than the Q2 value on the actual dataset. If this is not the case, it means that the model is capable of fitting well any kind of dataset which is overfitting. (2) The regression line (line joining the actual Q2 point to the centroid of the cluster of permuted Q2 values) has a negative value of intercept on the y-axis. [PMID: 18767870]

VIP scores

Variables were selected according to the VIP (Variable Importance in the Projection) values, which reflect the influence of each metabolite in sample groups. PLS-DA also produces variable importance measures. Two variable importance measures are available in MetaboAnalyst. The first, variable importance in projection (VIP), is a weighted sum of squares of the PLS loadings that takes into account the amount of explained Y-variance of each component. The other importance measure is based on a weighted sum of the PLS-regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components.

To discover potential biomarkers among the thousands of variables, the potential biomarkers with the highest VIP values were selected for further analysis.

Figure 14. The score plot for SOM

Figure 15. The VIP plot for SOM (n = 9)

Table 2. The results for PLS-DA

(3) OPLS-DA: OPLS-DA was introduced as an improvement of the PLS-DA method to discriminate two or more groups (classes) using multivariate data (Bylesjo et al. 2006; Trygg and Wold 2002). In OPLS-DA a regression model is calculated between the multivariate data and a response variable that only contains class information. The advantage of OPLS-DA compared to PLS-DA is that a single component is used as a predictor for the class, while the other components describe the variation orthogonal to the first predictive component. Wiklund et al. (2008) used the terms between treatment variations to describe the average effect of treatment and within treatment variation to describe the systematic remainder variation which is not related to the treatment. OPLS-DA is designed for modeling two classes of data to increase the class separation, simplify interpretation, and to find potential biomarkers.

Potential markers of interest were extracted from the combining S- and VIP- plots that were constructed from the OPLS analysis, and markers were chosen based on their contribution to the variation and correlation within the dataset.

The analysis results were very similar with PLS-DA.:

Aditionally, a further loading plot, the S-plot is a visual method that can be used for selection of biomarkers. Variables that are the farthest from the origin in the S-plot are selected as potential biomarkers. [PMID: 21458633]. S-plot is a scatter plot that can be utilized to explain the variable inﬂuence on the OPLS-DA model. By combining covariance (x-axis, p1) and correlation loading proﬁles (y-axis, p (corr)1), the S-plot can be utilized for the extraction of putative metabolites.

Figure 16. The visualization for analysis result of OPLS-DA

Analysis results for OPLS-DA is similar with PLS-DA (table 2).

(4) Student’s t –test: Student’s t –test is often used to obtain an overview or rough ranking of potentially important features before applying more sophisticated multivariate analyses in omics firstly. The statistical method is a univariate analysis and can be available for Metabolomics data analysis. It can analyze metabolomics features independently. When assessing differences between two groups, Student’s t–test was commonly applied. It is also applied frequently to combine with multivariate analyses for further difference metabolite selection.

Figure 17. The visualization for analysis result of t-test

In order to control for this multiple testing issue (i.e., false positive), the program offer many multiple test correction approaches.

Table 3. The results for Student's t-test

(5) Chi-squared test (CS): Chi-squared test is a filter-based feature ranking technique. It is used to examine the distribution of the class as it relates to the values of the feature in question. The null hypothesis is that there is no correlation; each value is as likely to have instances in any one class as any other class. Given the null hypothesis, the chi-square statistic measures how far away the actual value is from the expected value.The larger this chi-square statistic, the more likely it is that the distribution of values and classes are dependent; that is, the feature is relevant to the class.

Please Note: CS is likely more to find significance to extent that (1) the relationship is strong, (2) the sample size is large, and /or (3) the number of values of the two associated features is large.

The analysis results includ a series of features ordered by specific rule.

(6) Correlation-based feature selection (CFS): CFS is a simple filter algorithm which is able to assess subsets of attributes rather than individual variables. Its evaluation mechanism relies on both the predictive power of a given variable and itsdegree of correlation with other features already included in the selected subset. The balance promotes a high correlation with the class information and a low level of inter-correlation within the subset. A heuristic merit is calculated by taking prediction and redundancy into account discarding irrelevant and redundant information to select an optimal feature subset. Forward (iteratively adding variables to existing subsets until convergence) and backward (iteratively suppressing variables) search methods can be applied. CFS performs a parsimonious selection and can therefore drastically reduce the number of features in a highly inter-correlated dataset without the loss of prediction performance.

Figure 17. The visualization for correlation analysis of samples

CFS algorithm is usually applied to microarray datasets for gene selection. Recently, the method has also been applied to metabolic studies for the discovery of biomarkers. At the same times， the author also shows that CFS had a remarkably different behaviour producing feature sets of much lower cardinality which nevertheless resulted to classification models with high predictive power.

(7) Entropy-based Filters: Entropy-based filter methods include three feature ranking techniques: information gain, gain.ratio and symmetrical.uncertainty. In MMEASE, information gain and symmetrical uncertainty were used for feature selection.

The analysis results for entropy-based filter methods contain several features (mz features) ordered by importance

Table 4. The results for entropy-based filter methods

(8) Linear Model and Empirical Bayes Method: Linear Model and Emperical Bayes Method (Smyth, 2004) is assessing differential Expression in Microarray Experiments. In metabonomic studies, the method is also applied for evaluation of differences in the accumulation of metabolites.[PMID: 24828308]. Using volcano plots, which measure differentially accumulated metabolites based on t statistics and fold changes simultaneously. In volcano plots, each point in the volcano plot represents a single metabolite with the y- and x-axis representing the p value and fold change of that metabolite, respectively. Based on adjusted p-values (<0.05), a group of metabolites was selected as the most signiﬁcantly altered during starvation.

Figure 17. The volcano plot for linear model and empirical Bayes

Table 5. The results for linear model and empirical Bayes

(9) Relief: Recursive Elimination of Features (Relief) is multivariate filter approach. The main idea of Relief is that the values of a significant attribute are correlated with the attribute values of an instance of the same class, and uncorrelated with the attribute values of an instance of the other class. For a given instance, Relief determines its two nearest neighbors: one from the same class, and one from the other class. Then it estimates the value of an attribute ai by the difference between the conditional probabilities P (different value of ai∣nearest instance f rom different class) and P (different value of ai∣nearest instance from same class). Note that the n earest instances are identified according to the sum of differences to all attributes.

Table 6. The results for Relief

The method has been widely applied for gene selection on transcriptomic. At the present, there are also some studies applied to UPLC-TOF/MS metabolic fingerprinting for the discovery of wound biomarkers in Arabidopsis thaliana [Pubmedxxxx] and finding metabolic markers in prostate cancer using tandem mass spectrometry.

(10) Genetic Algorithm: Genetic Algorithm is originally based on Darwin’s theory of evolution which refers to the principle that better adapted ndividuals win against their competitors under equal external conditions and can be used for variable selections. Genetic Algorithm has been widely applied to identify metabolic signatures of drug side effects, cluster metabolomics data, and classify metabolic profiles between bladder cancer patients and normal volunteers.

A feature list was obtained by genetic algorithm - based feature selection

(11) RF-RFE: Random forest–recursive feature elimination (RF-RFE) which combines RF with RFE is a recursive backward feature elimination procedure. It begins with all the features. In each iteration, a random forest is constructed to measure the features’ importance and the feature which is the least important is removed. This procedure is repeated until there is no feature left. Finally, the features are ranked according to the deleted sequence, the top ranked feature is the last deleted and the most important.

Figure 18. The visualization for results of RF-RFE

Table 7. The results for RF-RFE

RF-RFE has been adopted to select the informative variables from the serum metabolome data.

(12) SAM technique: Significance Analysis for Microarrays (SAM) is a permutation-based (non-parametric) hypothesis testing method for the identification of molecular quantities that differ significantly between two measurement sets that represent different physiological conditions. SAM has been tailored for the analysis of transcriptional profiling data based on DNA microarrays and has similarly been used for the analysis of other omic datasets. Specifically in the case of the metabolomic analysis, SAM identifies as significant metabolites whose difference in concentration between two sets of samples is in absolute value larger than the difference that would have been anticipated due to random variations alone more than a significance threshold δ.

Figure 19. The visualization for results of SAM

Unlike parametric hypothesis testing methods, permutation-based (non-parametric) methods do not require the data to follow a particular distribution. They also provide an estimation of the false discovery rate (FDR), which is the probability that a given metabolite that was identified as differentially changing in concentration is a false positive. In addition, SAM provides an additional benefit: it allows the user to adjust the threshold of significance δ and observe the sensitivity of FDR and number of significant metabolites to the threshold change.

A series of mz features was obtained by SAM technique.

(13) SVM-RFE: SVM recursive feature elimination (SVM-RFE) is a wrapper approach that uses the norm of the weights w to rank the variables. Initially, all data is taken and a classifier is computed. The norm of w is then computed for each of the features and the feature with the smallest norm is eliminated. This process is repeated until all the features are ranked. It is a good choice to avoid overfitting when the number of features is high. (Click here to see elaborate introduction)

SVM-RFE is one of the best gene feature selection algorithms[PMID: 17666757]. At the present, SVM-RFE has gained popularity due to its effectiveness for discovering informative features or attributes in cancer classification and drug activity analysis.[PMID: 15446820]

Feature selection using SVM-RFE has also been done to identify the significant biomarkers causing the classification in metabonomic data.(Metabolomics (2011) 7:549–558)

The algorithm is used in microarray data analysis, particularly for disease gene finding. It eliminates redundant genes and yields better and more compact gene subsets.

Table 8. The results for SVM-RFE

(14) SDCS: Stable SVM recursive feature elimination is a new signature selection method that incorporates onsensus scoring of multiple random sampling and ultistep evaluation of gene-ranking consistency for maximally voiding erroneous elimination of predictor genes. (see for PMID: 17942933 )

Table 9. The results for SVM-RFE

(15) Wilcoxon Rank Sum Test: Wilcoxon rank sum test is based solely on the order in which the observations from two samples fall. Wilcoxon rank sum test has been applied to delineate role for sarcosine in prostate cancer progression, and reveal metabolic tissue biomarkers for gastric cancer.

Table 10. The results for Wilcoxon rank sum test

7 - Metabolite Annotation (Back to top)

Candidate molecules:

The annotaion of high precision tandem mass spectra of metabolites is a first and critical step for the identification of a molecule's structure. In the web, user can choose an database for candidate molecules annotation. Candidate molecules of different databases will be often with various types of chemical identifiers which consist of PubChem Compound ID (CID), KEGG ID, common name, METLIN ID, HMDB ID, and so on.

The program can do that Accurate m/z values were searched against the following online metabolite MS databases: Human Metabolome Database (HMDB, 41,514 metabolites) (Wishart et al. 2007, 2009, 2013) and the general Metabolite and Tandem MS Database (METLIN, 242,766 metabolites) (Smith et al. 2005) with a threshold window of ppm or Da. The wide window was used to guarantee a thorough search, but a more stringent mass tolerance (*1 ppm) was used when making the ﬁnal assignment.

Metabolite category:

The human metabolome is not just a single entity but consists of several components, including the following:

1) the endogenous metabolome (consisting of chemicals needed for, or excreted from, cellular metabolism),

2) the food metabolome (consisting of essential and nonessential chemicals derived from foods after digestion and subsequent metabolism by the tissues and the microbiota),

3) other xenobiotics derived from drugs

4) xenobiotics derived from environmental or workplace chemicals.

In metabolite annotation module, Metabolite category information can be offered (e.g endogenous, microbial, toxin / pollutant, food, food additives, drug, drug metabolite, cosmetic, traditional Chinese medicine ingredient, agricultural chemicals)

Table 11. The results for Wilcoxon rank sum test

8 - Metabolite Enrichment (Back to top)

Candidate molecules:

Considering a group of metabolites as a metabolite set if there are established, empirically observed or theoretically predicted functional associations among them. On the basis of these criteria, we have collected a total of 14566 metabolite sets organized into four categories— 309 pathway-associated metabolite sets, 748 biofunction-associated metabolite sets, which were further divided into three groups on the basis of the type of biofluid (CFS, blood or urine) from which they were reported, 11489 CFAM-associated metabolite sets and 959 metabolite sets on the basis of Species, Genus, Family and Order.

MMEASE support a set of submodule used for metabolite enrichiment analysis, including (1) KEGG pathway, (2) SMPDB pathway, (3) Chemical family, (4) Classes of food components and food additives, (5) Biological function classes, (6) Therapeutic classess of secondary metabolites of traditional medicine, (7) Species taxonomy, (8) Categories of toxins and environmental pollutants.

Please note: For eight enrichment analysis modules, Users should upload their data and choose proper module to perform enrichiment analysis.

Taking the KEGG metabolite pathway and HMDB bio-function analysis as examples, Results will display by barplot, pie or chord plot to visualize the enriched categories.

KEGG metabolite pathways:

The idea of metabolite pathway enrichment analysis is to identify coordinately changed Kyoto Encyclopedia of Genes and Genomes (KEGG; Ogata et al., 1999 PMID: 9847135) and Small Molecule Pathway Database (SMPDB; Frolkis et al., 2010 PMID: 19948758) pathways using metabolite data. [ http://bioinformatics.oxfordjournals.org/content/27/13/1878.full#ref-7]

As a starting input, metabolite common names, synonyms, or major database identifiers are supported. Query compounds map against the annotation library and marked to belong to the KEGG pathways. A hypergeometric test is then applied to calculate the statistical enrichment of KEGG-compounds of the pathways within the data.

As a final output, a ranking table of the reference metabolite pathways dataset is generated, including their p value. For each pathway, a hyperlink enables the user to link to the KEGG metabolite pathway website for more detail instruction. And here, an enrichment graph is also generated for further visualization. (e.g. pie plot if the number of enrichment pathways is below 5 ,else bar plot)

Please note: From KEGG database, we took the annotations on metabolite pathways.

Figure 20. The results for HMDB bio-function

HMDB biofunction:

The idea of biofunction (describing biological role or activity for a metabolite) enrichment analysis is to identify commonly biological role using metabolite data.

As a starting input, metabolite common names, synonyms, or major database identifiers are supported. Query compounds map against the annotation library and marked to belong to the HMDB biofunction terms. An hypergeometric test is then applied to calculate the statistical enrichment of HMDB metabolite of the biofunction terms within the data.

As a final output, a ranking table of the reference biofunction terms dataset is generated, including their p value. And here, an enrichment graph is also generated for further visualization. (e.g. pie plot if the number of enrichment pathways is below 5, else bar plot). Please note: From HMDB database, we took the annotations on biofunction terms.

Figure 21. The results for HMDB bio-function

Operation about other enrichment modules are similar with KEGG or HMDB, please refer to them.

@ ZJU

Please feel free to visit our website at https://idrblab.org

Email

Dr. Qingxia Yang (yangqx@cqu.edu.cn)

Dr. Bo Li D (libcell@cqnu.edu.cn)

Dr. Sijie Chen (chansigit@gmail.com)

Dr. Jing Tang (tangj@cqu.edu.cn)

Prof. Yan Lou* (yanlou@zju.edu.cn)

Prof. Feng Zhu* (zhufeng@zju.edu.cn)

Related

Address

College of Pharmaceutical Sciences,

Zhejiang University,

Hangzhou, China

Postal Code: 310058

Phone/Fax

+86-571-8820-8444

Step 1 - Datasets Upload & Integration

1. Visualization of Raw Data for MZ and RT Values Uploaded

2. Visualization of Data after Integration and Batch Effect Removal

Step 2 - Sample Separation

The Resulting Heatmap Generated by Hierarchical Clustering Analysis (HCA)

The Resulting Heatmap Generated by HCA

The Resulting Analysis Generated by K-means Clustering

The Resulting Figures Generated by K-means Clustering

(1) The Resulting K-means Plot

(2) The Detail Data Table

A. The results for clustering by k-means:

B. The cluster attributes and other information of samples:

The Resulting Analysis by Principal Component Analysis (PCA)

The Resulting Figures Generated by PCA

The Resulting Analysis by Self-organizing Map (SOM)

The Resulting Figures Generated by SOM

Step 3 - Marker Identification

Fold Change (FC) Analysis

The Resulting Figures Generated by Fold Change

Partial Least Squares Discriminant Analysis (PLS-DA)

The Resulting Figures Generated by PLS-DA

The Resulting Dataset Generated by PLS-DA

Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA)

The Resulting Data Generated by OPLS-DA

1). The results for OPLS-DA:

2). The variable importance of projection for OPLS-DA:

Student t-Test

The Analysis Results Generated by Student t-test

Significant score of all Metabolites

Significant score of all Metabolites

Chi-Squared Test

The Resulting Data Generated by Chi-Squared Test

Importance Score of all Metabolites

The Resulting Dataset Table Generated by Chi-squared Test

Correlation-based Feature Selection (CFS)

The Resulting Data Generated by Correlation-based Feature Selection

The Figures Generated by Correlation-based Feature Selection

The Table Generated by Correlation-based Feature Selection

Entropy-based Filters

The Resulting Data Generated by Entropy-based Filters

Linear Model and Emperical Bayes Method (LMEB)

The Resulting Data Generated by Linear Model with Emperical Bayes

The volcano plot for result of Linear Model with Emperical Bayes

The Resulting Dataset Table Generated by LMEB

Relief

The Resulting Dataset Table Generated by RELIEF Algorithm

Random Forest with Recursive Feature Elimination (RF-RFE)

The Resulting Data Generated by RF-RFE

The Resulting Analysis Figures Generated by RF-RFE

The Resulting Dataset Table Generated by RF-RFE

SAM Technique

The Resulting Data Generated by SAM

Support Vector Machine with Recursive Feature Elimination (SVM-RFE)

The Resulting Data Generated by SVM-RFE

Wilcoxon Rank-Sum Test with Permutation

The Resulting Data Generated by Wilcoxon Rank-Sum Test with Permutation

Step 4 - Metabolite Annotation

Please provide M/Z of compounds to be annotated

MS/MS Peak List (m/z & Intensity):

Metabolite Annotation

Annotation Result

Metabolite Annotation

The Table of Annotation Result

The Mirror Plot of Annotation Result

Step 5 - Metabolite Enrichment

Metabolite Enrichment