MultiClassMetabo: Superior Classification Model Using Metabolic Markers in Multiclass Metabolomics

Currently, multiclass metabolomics has attracted increasing attention, because multiclass question (such as different tumor types or different responses to a therapy) is commonly encountered in real-world applications. However, multiclass classification problem is intrinsically more difficult than a binary problem. Therefore, it is necessary to provide a publicly available service for comprehensively and comparatively evaluating the performance of those biomarker discovery and classification methods in multiclass metabolomics study.

MultiClassMetabo is constructed to enable the online services of (a) identifying metabolic markers by marker identification methods, (b) constructing classification models by classification methods, and (c) performing comprehensive assessment from multiple perspectives to construct the superior classification model for multiclass metabolomics. Particularly, three well-established criteria, each with a distinct underlying theory, are integrated to ensure a much more comprehensive evaluation than any single criterion. MultiClassMetabo provides a unique feature of selecting the most appropriate biomarker discovery and classification methods using consistent assessment for given multiclass metabolomics.

MultiClassMetabo is powered by R shiny. It is free and open to all users with no login requirement and can be readily accessed by a variety of popular web browsers (such as, Chrome, Firefox, Edge, and Safari) and operating systems (such as, Linux, MacOS, and Windows).

The local version of MultiClassMetabo enabling the assessment on local computer will be provided as soon as possible.

Thanks a million for using and improving MultiClassMetabo, and please feel free to report any errors to Dr. YANG at yangqx@njupt.edu.cn.

Table of Contents

1. The Compatibility of Browser and Operating System (OS)

2. Required Formats of the Input Files

3. Step-by-step Instruction on the Usage of MultiClassMetabo

3.1 Uploading the Customized Metabolomic Data or the Sample Data Provided in MultiClassMetabo

3.2 Methods of Identifying Metabolic Markers for Multiclass Metabolomics

3.3 Methods of Constructing Classification Model for Multiclass Metabolomics

3.4 Comprehensive Assessment to Construct the Superior Classification Model Based on Multiple Criteria

1. The Compatibility of Browser and Operating System (OS)

MultiClassMetabo is powered by R shiny. It is free and open to all users with no login requirement and can be readily accessed by a variety of popular web browsers and operating systems as shown below.

2. Required Formats of the Input Files

In general, the file required at the Step 1 of MultiClassMetabo should be a sample-by-feature matrix in a csv format. In the uploaded dataset, only sample name and class are required in the first 2 columns of the input file, and are kept as "sample name" and "class". The label ID is referred to the different classes of samples, and is labeled with ordinal number, e.g., 1, 2, 3. In the following columns of the input file, metabolites’ raw intensities across all samples are further provided. Unique IDs of each metabolite are listed in the first row of the csv file. The sample data of multiclass metabolomic data could be downloaded .

3. Step-by-step Instruction on the Usage of MultiClassMetabo

This website is free and open to all users and there is no login requirement, and can be readily accessed by all popular web browsers including Google Chrome, Mozilla Firefox, Safari and Internet Explorer 10 (or later), and so on. Analysis and subsequent performance assessment are started by clicking on the "Analysis" panel on the homepage of MultiClassMetabo. The collection of web services and the whole process provided by MultiClassMetabo can be summarized into 4 steps: (3.1) Upload Metabolomic data, (3.2) Identify Metabolic Markers, (3.3) Construct Multiclass Classification Model, and (3.4) Comprehensive Assessment for Constructing the Superior Classification Model.

3.1 Uploading the Customized Metabolomic Data or the Sample Data Provided in MultiClassMetabo

There is one radio checkboxes in STEP-1 on the left side of the Analysis page. Users can choose to upload their own metabolomics data or to directly load sample data. After selecting the corresponding radio checkboxes, datasets provided by the users for further analysis can be then directly uploaded by clicking "Browse". Preview of the uploaded data is subsequently provided on the web page. Moreover, users could process their data by uploading the raw data in a unified format.

The sample data are also provided in this step facilitating a direct access and evaluation of MultiClassMetabo. In this sample dataset, the metabolomic data were collected from 180 healthy pregnant women, representing six time points and providing sufficient coverage to model the progression of normal pregnancy. Here, three time points (the first, Intermediate and last time point) were selected as the example dataset (Luan H, et al. Gigascience. 9; 4: 16, 2015). By clicking the Load Data button, the sample dataset selected by the users can be uploaded for further analysis.

3.2 Identify Metabolic Markers for Multiclass Metabolomics

MultiClassMetabo provides five methods for identifying metabolic markers in multiclass metabolomic data including Kruskal–Wallis Test, One-Way ANOVA, Partial Least Squares-Discriminant Analysis, Random Forest and Support Vector Machine-Recursive Feature Elimination. A detailed explanation on each method is provided in this Manual. After selecting or defining preferred methods or parameters, please proceed by clicking the "PROCESS" button, the summary and plots of the biomarkers are automatically generated. All resulting data and figures can be downloaded by clicking the corresponding "Download" button.

Kruskal–Wallis Test (KWT)

Kruskal-Wallis Test (KWT) is a non-parametric statistical test and is applied when the goal is to test the difference between multiple samples and the underlying population distributions are nonnormal or unknown (Abenavoli A, et al. J Am Osteopath Assoc. 120: 647-54, 2020). A Kruskal-Wallis Test of the feature among various classes reveals a statistically significant difference. In metabolomics studies, Kruskal-Wallis Test is widely applied to identify metabolomic markers (Sawicka-Smiarowska E, et al. J Clin Med. 10: 5074, 2021).

One-Way ANOVA

One-Way ANOVA (analysis of variance) is one of the most frequently used statistical methods and focuses on the difference of variances. One-Way ANOVA compares the means of two or more independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different (Wu G, et al. Biochem Biophys Res Commun. 358: 1108-13, 2007), Recently, one-way ANOVA is widely used to discover features in metabolomic researches (He Y, et al. Biomed Chromatogr. 33: e4478, 2019).

Partial Least Squares-Discriminant Analysis (PLS-DA)

In metabolomic study, partial least squares-discriminant analysis (PLS-DA) is the most well-known tool to perform classification and regression. The popularity of PLS-DA is due to the widespread availability in most of the well-known statistical software packages. In addition, one of the perceived advantages of PLS-DA is that it can analyze highly collinear and noisy data (Gromski PS, et al. Anal Chim Acta. 879: 10-23, 2015). PLS-DA is widely applied in various metabolomics studies (Khan A, et al. Analyst. 145: 1695-705, 2020).

Random Forest (RF)

The random forest (RF) statistical learning method, a relatively new variable-importance ranking method, measures the variable importance of potentially influential parameters through the percent increase of the mean squared error (Kapwata T, et al. Geospat Health. 11: 434, 2016). The strength of RF lies in its flexibility, interpretability and ability to handle large number of features, typically larger than the sample size (Mayer J, et al. Bioinformatics. 34: 1336-44, 2018). Random forest is widely to select features in metabolomics (Oh TG, et al. Cell Metab. 32: 878-88.e6, 2020).

Support Vector Machine-Recursive Feature Elimination (SVM-RFE)

Support vector machine-recursive feature elimination (SVM-RFE) is an efficient feature selection technique and has shown promising applications in the analysis of the metabolome data. SVM-RFE measures the weights of the features according to the support vectors, noise and non-informative variables in the high dimension data may affect the hyper-plane of the SVM learning model (Lin X, et al. J Chromatogr B. 910: 149-55, 2012). Nowadays, SVM-RFE is widely used in metabolomics studies for revealing metabolite biomarkers (Gromski PS, et al. Anal Chim Acta. 829: 1-8, 2014).

3.3 Construct Classification Model for Multiclass Metabolomics

MultiClassMetabo provides nine methods of constructing classification model for multiclass metabolomic data including AdaBoost, Bagging, Decision Trees, K-Nearest Neighbor, Linear Discriminat Analysis, Naive Bayes, Partial Least Squares, Random Forest, Support Vector Machine. A detailed explanation on each method is provided in this Manual. After selecting or defining preferred methods or parameters, please proceed by clicking the "PROCESS" button, the summary and plots of the classification model are automatically generated. All resulting data and figures can be downloaded by clicking the corresponding "Download" button.

AdaBoost

Boosting is a general strategy for learning classifiers by combining simpler ones. The idea of boosting is to take a weak classifier, and use it to build a much better classifier, thereby boosting the performance of the weak classification algorithm. This boosting is done by averaging the outputs of a collection of weak classifiers. The most popular boosting algorithm is AdaBoost because it is adaptive (Dou L, et al. J Proteome Res. 20: 191-201, 2021). The popular boosting classifier AdaBoost is used to perform the prediction task, which has been widely applied in bioinformatics (Yang X, et al. Comput Struct Biotechnol J. 18: 153-61, 2019). AdaBoost algorithm is a good-performance classifier in untargeted metabolomics (Chetnik K, et al. Metabolomics. 16: 117, 2020).

Bagging

Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is selected with replacement, which means that the individual data points can be chosen more than once. After several data samples are generated, these weak models are then trained independently, and the average or majority of those predictions yield a more accurate estimate (Datta S, et al. BMC Bioinformatics. 11: 427, 2010). Bagging is a crucial concept in statistics and machine learning that helps to avoid overfitting of data (Mi X, et al. Biometrics. 75: 674-84, 2019). Bagging is capable of improving classification or regression performance in metabolomics studies (Asakura T, et al. Anal Chim Acta. 1037: 230-6, 2018).

Decision Trees (DT)

A Decision Tree algorithm is one of the most popular machine learning algorithms. It uses a tree like structure and their possible combinations to solve a particular problem. It belongs to the class of supervised learning algorithms where it can be used for both classification and regression purposes. Decision tree is a reliable and effective decision making technique that provide high classification accuracy with a simple representation of gathered knowledge (Luna JM, et al. Proc Natl Acad Sci U S A. 116: 19887-93, 2019). The most commonly used applications of decision trees are data mining and data classification in different areas of medical decision making (Podgorelec V, et al. J Med Syst. 26: 445-63; 2002). Decision trees algorithm could be used to built classification model based on the metabolomic profiles (Shao CH, et al. Oncotarget. 8: 38802-10, 2017).

K-Nearest Neighbor (KNN)

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on supervised learning technique. KNN algorithm stores all the available data and classifies a new data point based on the similarity. KNN algorithm can be used for Regression as well as for Classification. KNN is a non-parametric algorithm, which means it does not make any assumption on underlying data (Wang Y, et al. IEEE Trans Neural Netw Learn Syst. 31: 1544-56, 2020). The KNN classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested examples and the training examples (Abu Alfeilat HA, et al. Big Data. 7: 221-48, 2019). KNN could be applied for metabolomic data classification (Fan X, et al. Comput Intell Neurosci. 2021: 1051172, 2021).

Linear Discriminat Analysis (LDA)

Linear discriminant analysis (LDA) is a dimensionality reduction technique that is commonly used for supervised classification problems. It is used for modelling differences in groups for separating two or more classes. It is used to project the features in higher dimension space into a lower dimension space. LDA is a widely used classification method with ready implementability and close relationships with many modern machine learning techniques (Ye Q, et al. Neural Netw. 105: 393-404, 2018). Now, LDA helps to represent data for more than two classes. Linear discriminant analysis takes the mean value for each class and considers variants in order to make predictions assuming a Gaussian distribution. It is one of several types of algorithms that is part of crafting competitive machine learning models. LDA is widely used for discovering metabolomic biomarkers (Saude EJ, et al. Am J Respir Crit Care Med. 179: 25-34, 2009).

Naive Bayes (NB)

The Naive Bayes algorithm is one of the popular classification machine learning algorithms that helps to classify the data based upon the conditional probability values computation. It implements the Bayes theorem for the computation and used class levels represented as feature values or vectors of predictors for classification (Miasnikof P, et al. BMC Med. 13: 286, 2015). Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets. Naive Bayes algorithm is a fast algorithm for classification problems (Zhang H, et al. Food Chem Toxicol. 143: 111513, 2020). Naive Bayes could be used for identification and validation of a multivariable prediction model in metabolomics (Adam MG, et al. Gut. 70: 2150-8, 2021).

Partial Least Squares (PLS)

Partial least squares (PLS), a well known dimension reduction method, has been gaining a lot of attention in high dimensional classification problems of computational biology. PLS has been promoted as a multivariate linear regression method that can deal with large number of predictors, small sample size, and high collinearity among predictors (Boulesteix AL, et al. Brief Bioinform. 8: 32-44, 2007). PLS operates by forming linear combinations of the predictors in a supervised manner, and then regresses the response on these latent variables. It can handle both univariate and multivariate response and is computationally fast. All of these properties make PLS an attractive candidate for high dimensional genomic data problems such as classification of tumor samples that are in the order of tens or hundreds based on thousands of features (Fort G, et al. Bioinformatics. 21: 1104-11, 2005). PLS is widely used for prediction in metabolomics (Lee SY, et al. J Sci Food Agric. 98: 240-52, 2018).

Random Forest (RF)

The random forest classifier is a supervised learning algorithm which you can use for regression and classification problems. It is among the most popular machine learning algorithms due to its high flexibility and ease of implementation. That’s because it consists of multiple decision trees just as a forest has many trees. On top of that, it uses randomness to enhance its accuracy and combat overfitting, which can be a huge issue for such a sophisticated algorithm. These algorithms make decision trees based on a random selection of data samples and get predictions from every tree (de Santana FB, et al. Food Chem. 293: 323-32, 2019). The random forest classification offers a rapid, sensitive, and accurate solution for identifying signatures in omics data (Roguet A, et al. Microbiome. 6: 185, 2018). Random forest could be used to validate the biomarker metabolites and establish a diagnostic model (Yang QJ, et al. J Cachexia Sarcopenia Muscle. 9: 71-85, 2018).

Support Vector Machine (SVM)

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. SVM is a supervised machine learning algorithm used for both classification and regression. The objective of SVM algorithm is to find a hyperplane in an N-dimensional space that distinctly classifies the data points (Nedaie A, et al. Neural Netw. 98: 87-101, 2018). The dimension of the hyperplane depends upon the number of features. If the number of input features is two, then the hyperplane is just a line (de Boves Harrington P. Anal Chim Acta. 954: 14-21, 2017). SVM has great potential in metabolomics research because of the speed advantage (de Boves Harrington P. Anal Chem. 87: 11065-71, 2015).

3.4 A Comprehensive Assessment for Superior Multiclass Classification Model Based on Multiple Criteria

In this study, the comprehensive assessment of methods for identifying metabolomic markers and constructing classification models was achieved using three evaluation criteria for multiclass metabolomic studies. Additionally, one metric was selected to quantify the performance of the methods under each criterion. Based on the well-defined cutoff of each metric, the performance of different methods can be categorized into superior, good, or poor.

Criterion Ca: Separation Degree of Samples in the Clustering Using Metabolomic Markers

K-means clustering is a commonly used method to partition data into several groups that minimizes variations in values within clusters (Jacob S, et al. Diabetes Care. 40: 911-9, 2017). First, samples are randomly assigned to one of a prespecified number of groups. Then, the mean value of the observations in each group is calculated, and samples are replaced into the group with the closest mean. Finally, the process mentioned above proceeds iteratively until the mean value of each group no longer changes (Jacob S, et al. Diabetes Care. 40: 911-9, 2017). Therefore, the plot of k-means clustering can be used to evaluate method’s effect on differential metabolic analysis. The more distinct group variations indicate better performance of the applied method (Välikangas T, et al. Brief Bioinform. 19: 1-11, 2018). In the clustering analysis, a method of identifying metabolic markers is regarded as superior when an obvious separation is observed for different classes. If the purity value (a representative measure to assess the quality of clustering) is close to 1, the quality of the clustering is excellent (Huang S, et al. PLoS One. 9: e90109, 2014). When the purity values are within the ranges of >0.8, ≤0.8 & >0.5, and ≤0.5, the corresponding methods are categorized into those with superior, good, and poor performance, respectively.

Criterion Cb: Consistency of Metabolic Markers Identified in Different Subgroups

Under this criterion, a consistency score is defined to quantitatively measure the overlap of identified metabolic markers among different partitions of a given dataset (Wang X, et al. Mol Biosyst. 11: 1235-40, 2015). The higher consistency score represents the more robust results in metabolic marker identification for that given dataset. In the consistency analysis, a method of identifying metabolic markers is regarded as superior when there is a large overlap among three lists of metabolic markers identified from three subgroups. If the CWrel value (a powerful measure to assess the consistency of markers) is close to 1, the robustness of the metabolomic markers is high (Song X, et al. J Am Med Inform Assoc. 26: 242-253, 2019). When the CWrel values are within the ranges of >0.3, ≤0.3 & >0.15, and ≤0.15, the corresponding methods are categorized into those with superior, good, and poor performance, respectively.

Criterion Cc: Accuracy of the Classification Model using Metabolic Markers

Under this situation, receiver operating characteristic (ROC) curve together with area under the curve (AUC) values are provided. First, differential metabolic features are identified by partial least squares discriminant analysis (PLS-DA). Second, the SVM models are constructed based on these differential features identified. After k-folds cross validation, a method with larger area under the ROC curve and higher AUC value is recognized as well performed (De Livera AM, et al. Anal Chem. 84: 10768-76, 2012; Risso D, et al. Nat Biotechnol. 32: 896-902, 2014; Piotr S. Gromski, et al. Metabolomics. 11: 684-95, 2015). For a multiclass metabolomic dataset, a ROC curve of a classification model is applied for assessing the accuracy of a certain classification method. If the AUC value is close to 1, the performance of the classification model is excellent (Jiang J, et al. Hematology. 23: 221-227, 2018). When the AUC values are within the ranges of >0.9, ≤0.9 & >0.7, and ≤0.7, the corresponding methods are categorized into those with superior, good, and poor performances, respectively.