ANPELA is an online tool that specializes in the optimization of proteome quantification.
ANPELA 2.0 is updated for satisfying the research demand in Single-cell Proteomics (SCP). Particularly, it (a) describes the first systematic workflow for quantifying the SCP data generated by both flow and mass cytometry, (b) assesses quantification performance based on multiple independent criteria, and (c) identifies the proper quantification for studied dataset by comprehensively ranking over 1,000 available workflows. These unique functions make ANPELA capable of conducting Cell Subpopulation Identification (CSI) and Pseudo-time Trajectory Inference (PTI) for current SCP research.
ANPELA 1.0 primarily targets the quantification for Bulk Proteomics. Particularly, it (1) enables the label-free proteome quantification (LFQ) based on three measurements of SWATH-MS, Peak Intensity and Spectral Counting, (2) realizes LFQ performance evaluation from different perspectives, and (3) conducts the identification of the optimal LFQs based on comprehensive performance ranking.
Citing the ANPELA:
1. Zhang Y, Sun HC, Lian XC, Tang J, Zhu F*. ANPELA: significantly enhanced quantification tool for cytometry-based single-cell proteomics. Advanced Science. 10(15): e2207061 (2023). doi: 10.1002/advs.202207061; PMID: 36950745
2. Tang J, Fu JB, Wang YX, Li B, Li YH, Yang QX, Cui XJ, Hong JJ, Li XF, Chen YZ, Xue WW, Zhu F*. ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies. Briefings in Bioinformatics. 21(2): 621-636 (2020). doi: 10.1093/bib/bby127; PMID: 30649171
Browser and Operating System (OS) Tested for Smoothly Running ANPELA:
ANPELA is powered by R shiny. It is free and open to all users with no login requirement and can be readily accessed by a variety of popular web browsers and operating systems as shown below.
Manual for Using the Standalone Version of ANPELA:
To run this standalone tool, three sequential procedures should be performed. First, install R and RStudio environments after downloading their Installation Files . Second, download the Source Code of standalone ANPELA. Third, run ANPELA in RStudio by executing the R commands provided in User Manual . Besides the same sets of assessment metrics and plots as that of the online ANPELA, the standalone one enables the discovery of the optimal ones from hundreds of preprocessing workflows based on overall performance ranking. The exemplar input/output files can be downloaded HERE .
ANPELA 2.0 is capable of AUTOMATICALLY detecting the raw SCP flow-cytometry-standard file (.fcs) generated by flow/mass-cytometry. ANPELA 1.0 can detect the diverse formats of data generated by all quantification software for SWATH-MS, Peak Intensity and Spectral Counting.
The Previous Version of ANPELA can be accessed at: http://idrblab.cn/anpela2020/
Thanks a million for using and improving ANPELA, and please feel free to report any errors to Dr. Zhang.
Welcome to Download the Sample Data for Testing and for File Format Correcting
- Cell Subpopulation Identification
The compressed file (in the .zip format) containing raw FCS files generated by flow cytometry, and the corresponding data of golden standards for performance evaluation using Criterion Cd could be downloaded .
The compressed file (in the .zip format) containing raw FCS files generated by mass cytometry, and the corresponding data of golden standards for performance evaluation using Criterion Cd could be downloaded .
- Pseudo-time Trajectory Inference
The compressed file (in the .zip format) containing raw FCS files generated by flow cytometry, the metadata file providing the correspondence between filenames and time points, as well as the corresponding data of pathway hierarchy for performance evaluation using Criterion Cd could be downloaded .
The compressed file (in the .zip format) containing raw FCS files generated by mass cytometry, the metadata file providing the correspondence between filenames and time points, as well as the corresponding data of pathway hierarchy for performance evaluation using Criterion Cd could be downloaded .
Summary and Visualization of the Uploaded Raw SCP Data
- The Expression of Proteins (columns) in Different Cells (rows)
- Stacked Density Plot for Different Samples
WARNING
The filenames of your uploaded FCS files are inconsistent with those of your uploaded metadata file.
Note that ANPELA requires the user to upload FCS files whose filename order is exactly the same as that of the metadata file.
Please refresh the page and reupload the raw FCS files & metadata in the correct format.
WARNING
The number of uploaded FCS file(s) is not enough for the subsequent analysis.
Particularly, for two-class research, at least two samples for each class are required; for trajectory inference research, at least two time points are required.
Please refresh the page and reupload the raw FCS files & metadata in the correct format.
Summary and Visualization of the Uploaded Raw SCP Data
- The Expression of Proteins (columns) in Different Cells (rows)
- Stacked Density Plot for Different Samples
WARNING
The filenames of your uploaded FCS files are inconsistent with those of your uploaded metadata file.
Note that ANPELA requires the user to upload FCS files whose filename order is exactly the same as that of the metadata file.
Please refresh the page and reupload the raw FCS files & metadata in the correct format.
WARNING
The number of uploaded FCS file(s) is not enough for the subsequent analysis.
Particularly, for two-class research, at least two samples for each class are required; for trajectory inference research, at least two time points are required.
Please refresh the page and reupload the raw FCS files & metadata in the correct format.
Summary and Visualization of the Uploaded Raw Data
The Data File is Successfully Uploaded, which is Recognized as the Resulting Data File Generated by the Quantification Software:
Please Upload the Corresponding Label File Indicating the Classes of Each Sample
Summary and Visualization of the Uploaded Raw Data
The Data File is Successfully Uploaded, which is Recognized as the Resulting Data File Generated by the Quantification Software:
Please Upload the Corresponding Label File Indicating the Classes of Each Sample
Summary and Visualization of the Uploaded Raw Data
The Data File is Successfully Uploaded, which is Recognized as the Resulting Data File Generated by the Quantification Software:
Please Upload the Corresponding Label File Indicating the Classes of Each Sample
Summary and Visualization of the Uploaded Raw Data
The Label File is Successfully Uploaded, Please Upload the Corresponding Data File Generated by Popular Software of the Selected MOA
Summary and Visualization of the Uploaded Raw Data
The Label File is Successfully Uploaded, Please Upload the Corresponding Data File Generated by Popular Software of the Selected MOA
Summary and Visualization of the Uploaded Raw Data
The Label File is Successfully Uploaded, Please Upload the Corresponding Data File Generated by Popular Software of the Selected MOA
Instruction to the User
1. Please Choose a Format File Unified by ANPELA in the Left Side Panel
2. Please Process the Uploaded Data by Clicking the "Upload Data" Button in the Left Side Panel
Instruction to the User
1. Please Choose Your Preferred “Mode of Acquisition (MOA)” in the Left Side Panel
SWATH-MS Data (sequential windowed acquisition of all theoretical fragment ion mass spectra)
Sample data file of this MOA could be downloaded HERE, together with an additional label file
Peak Intensity (pre-processing the data acquired based on precursor ion signal intensity)
Sample data file of this MOA could be downloaded HERE, together with an additional label file
Spectral counting (pre-processing the data acquired based on MS2 spectral counting)
Sample data file of this MOA could be downloaded HERE, together with an additional label file
2. Please Upload the Data File Generated by Popular Software of the Selected MOA in the Left Side Panel
3. Please Upload the Label File Indicating the Classes of Each Sample in the Left Side Panel
4. Please Process the Uploaded Data by Clicking the “Upload Data” Button in the Left Side Panel
Instruction to the User
1. Please Choose Your Preferred “Mode of Acquisition (MOA)” in the Left Side Panel
SWATH-MS Data (sequential windowed acquisition of all theoretical fragment ion mass spectra)
Sample data file of this MOA could be downloaded HERE, together with an additional label file
Peak Intensity (pre-processing the data acquired based on precursor ion signal intensity)
Sample data file of this MOA could be downloaded HERE, together with an additional label file
Spectral counting (pre-processing the data acquired based on MS2 spectral counting)
Sample data file of this MOA could be downloaded HERE, together with an additional label file
2. Please Upload the Data File Generated by Popular Software of the Selected MOA in the Left Side Panel
3. Please Upload the Label File Indicating the Classes of Each Sample in the Left Side Panel
4. Please Process the Uploaded Data by Clicking the “Upload Data” Button in the Left Side Panel
Instruction to the User
1. Please Choose Your Preferred “Mode of Acquisition (MOA)” in the Left Side Panel
SWATH-MS Data (sequential windowed acquisition of all theoretical fragment ion mass spectra)
Sample data file of this MOA could be downloaded HERE, together with an additional label file
Peak Intensity (pre-processing the data acquired based on precursor ion signal intensity)
Sample data file of this MOA could be downloaded HERE, together with an additional label file
Spectral counting (pre-processing the data acquired based on MS2 spectral counting)
Sample data file of this MOA could be downloaded HERE, together with an additional label file
2. Please Upload the Data File Generated by Popular Software of the Selected MOA in the Left Side Panel
3. Please Upload the Label File Indicating the Classes of Each Sample in the Left Side Panel
4. Please Process the Uploaded Data by Clicking the “Upload Data” Button in the Left Side Panel
Summary and Visualization of Raw Data
A. Summary of the Raw Data
B. Distribution of Protein Intensities Before and After Log Transformation
Summary and Visualization of the Uploaded Raw Data
A. Summary of the Raw Data
B. Distribution of Protein Intensities Before and After Log Transformation
Summary and Visualization of the Uploaded Raw Data
A. Summary of the Raw Data
B. Distribution of Protein Intensities Before and After Log Transformation
Summary and Visualization of the Uploaded Raw Data
A. Summary of the Raw Data
B. Distribution of Protein Intensities Before and After Log Transformation
Summary and Visualization of the Uploaded Raw Data
A. Summary of the Raw Data
B. Distribution of Protein Intensities Before and After Log Transformation
Table of Contents
1. Step-by-step Instruction on the Usage of ANPELA
1.1 Uploading Quantification Data
1.2 Data Transformation & Pretreatment
1.3 Data Filtering & Missing Value Imputation
1.4 Performance Assessment of Label-free Quantification from Multiple Perspectives
2. Various Kinds of Quantification Software for Pre-processing Raw Proteomics Data
2.1 Software for Pre-processing the Data Acquired Based on SWATH-MS
2.2 Software for Pre-processing the Data Acquired Based on Peak Intensity
2.3 Software for Pre-processing the Data Acquired Based on Spectral Counting
3. A Variety of Methods for Data Manipulation at Different Manipulation Stages
3.1 Methods for Transformation
3.2 Methods for Pretreatment
3.2.1 Methods for Centering
3.2.2 Methods for Scaling
3.2.3 Methods for Normalization
3.3 Methods for Missing Value Imputation
4. Diverse MS Systems for Proteome Quantification
4.1 AB SCIEX Q-TOF Systems
4.2 Agilent Q-TOF Mass Spectrometer
4.3 Bruker Hybrid Q-TOF Mass Spectrometer
4.4 Thermo Fisher Scientific Orbitrap
5. References
Analysis and subsequent performance assessment are started by clicking on the “Analysis” panel on the homepage of ANPELA. The collection of web services and the whole process provided by ANPELA includes: (Step 1) uploading the quantification data, (Step 2) method's assumption assessment and data transformation & pretreatment, (Step 3) data filtering & missing value imputation, and (Step 4) performance assessment of the proteome quantification.
By click “Upload Quantification Data”, users are allowed to upload their data in various formats generated by popular software tools for label-free quantification. All software tools aim at processing the raw proteomics data acquired by 3 quantification measurements (SWATH-MS, peak intensity and spectral counting). Users are asked to upload the specific file containing the data generated by those tools, together with a label file indicating the classes of each sample (detail information of the file format can be found in the Section 2 of this Manual). Moreover, in case that users want to process their data before ANPELA analysis, they are allowed to upload their processed data in a unified format defined by ANPELA which could be readily found HERE ( Right Click to Save). By clicking the “Upload Data” button, the quantification data provided by the users can be uploaded for further analysis.
Three sets of sample data are also provided in this step facilitating a direct access and evaluation of ANPELA. These sample data are all benchmark datasets collected from the PRoteomics IDEntifications (PRIDE) database developed by the European Bioinformatics Institute. Particularly, the sample data for SWATH-MS is the dataset PXD000672 containing 12 non-tumorous samples and 12 samples of patients with clear cell renal cell carcinoma (Guo T, et al. Nat Med. 21(4):407-413, 2015); the sample data for protein intensity is the dataset PXD005144 with 66 samples of pancreatic cancer patients and 36 samples of chronic pancreatitis patients (Saraswat M, et al. Cancer Med. 6(7):1738-1751, 2017); and the sample data for spectral counting is the dataset PXD001819 providing yeast cell lysat samples of different concentrations (0.5 vs 50 fmol/microgram) acquired by MS2 spectral counting (Ramus C, et al. J Proteomics. 132:51-62, 2016). By clicking the “Load Data” button, the sample dataset selected by the users can be uploaded for further analysis.
The manipulation methods were reported to be based on their own statistical assumption about the data, which might make them inappropriate for manipulating some proteomic data. Taking pretreatment methods as examples, there were generally three types of assumptions: (Assumption A) all proteins were assumed to be equally important; (Assumption B) the level of protein abundance was assumed to be constant among all samples; (Assumption C) the intensities of the vast majority of the proteins were assumed to be unchanged under the studied conditions. Due to these distinct assumptions, some methods may be fundamentally inappropriate for certain dataset and cannot be assessed for the studied datasets. Therefore, before any performance assessment, users should first analyze the nature of their datasets, and then assess and indicate whether the method’s assumption held for these data.
Users are provided with the option to conduct pretreatment on their uploaded data. In total, 3 types of transformation methods frequently applied to manipulate the label-free proteomics data are included. Furthermore, the current version of ANPELA offers 18 pretreatment methods popular for centering, scaling and normalizing the proteomics data. A detail explanation on each method is provided in the Section 3 of this Manual. By clicking the “PROCESS” button, a summary of the processed data and a plot of the intensity distribution before and after data manipulation are automatically generated. All resulting data and figures can be downloaded by clicking the “Download” button. Moreover, the sample outputs of "Summary of the Processed Data" and "Distribution of Protein Intensities" that performs interactively in the same way as real output are provided.
Data filtering and missing value imputation are subsequently provided in this step. The filtering method used here is the basic filtering, and 7 imputation methods frequently applied to treat missing value are covered, which include Background Imputation, Bayesian Principal Component Imputation, Censored Imputation, K-nearest Neighbor Imputation, Local Least Squares Imputation, Singular Value Decomposition and Zero Imputation. A detail explanation on each imputation method is provided in the Section 3 of this Manual. By clicking the “PROCESS” button, a summary of the processed data and a plot of the intensity distribution before and after data manipulation are automatically generated. All resulting data and figures can be downloaded by clicking the “Download” button. Moreover, the sample outputs of "Summary of the Processed Data" and "Distribution of Protein Intensities" that performs interactively in the same way as real output are provided.
Five well-established criteria for a comprehensive evaluation on the performance of LFQ are provided in ANPELA, and each criterion is either quantitatively or qualitatively assessed by various metrics. These criteria include:
Different quantification measurements, various kinds of software for pre-processing raw proteomics data, and diverse methods for data manipulation profoundly affect the precision of LFQ, which can be assessed by the coefficient of variation (CV) of reported protein intensities among replicates (Navarro P, et al. Nat Biotechnol. 34(11):1130-1136, 2016; Kuharev J, et al. Proteomics. 15(18):3140-3151, 2015). In particular, the metric CV is designed to reflect LFQ’s ability to reduce variation among replicates, and therefore to enhance the technical reproducibility (Chawade A, et al. J Proteome Res. 13(6):3114-3120, 2014). The lower value (illustrated by boxplots below) of CV denotes more thorough removal of experimentally induced noise and indicates better precision of LFQ. Moreover, the sample outputs of "Distribution of CV" that performs interactively in the same way as real output are provided.
An appropriate LFQ is expected to retain or even enlarge the difference in proteomics data between two distinct sample groups (Griffin NM, et al. Nat Biotechnol. 28(1):83-89, 2010). A heatmap hierarchically clustering samples based on their protein intensities is therefore frequently used as an effective metric to assess LFQ’s classification ability (Griffin NM, et al. Nat Biotechnol. 28(1):83-89, 2010). Firstly, the total number of protein intensities in each sample is reduced by feature selection. Then, proteins (rows) and samples (columns) are clustered based on their similarities in protein intensity profile. Detail process on how to assess LFQ’s classification ability can be found in the prestigious publication by Griffin NM, et al. (Griffin NM, et al. Nat Biotechnol. 28(1):83-89, 2010). Moreover, the sample outputs of "Two-way clustering of differential proteins" that performs interactively in the same way as real output are provided.
To avoid overfitting or confounding in LFQ, the distribution of P-values of protein intensities between distinct sample groups is examined (Risso D, et al. Nat Biotechnol. 32(9):896-902, 2014). Ideally, one expects a uniform distribution for the bulk of non-differentially expressed proteins, with a peak in the [0.00, 0.05] interval corresponding to proteins with differential intensity (Risso D, et al. Nat Biotechnol. 32(9):896-902, 2014). Moreover, the volcano plot colored proteins with differential intensity can give a glance of the total number of differentially expressed proteins (Välikangas T, et al. Brief Bioinform. doi:10.1093/bib/bbx054, 2017). In the proteomics (and other OMICs) studies that explore the mechanism underlining complex biological process, a limited number of differentially expressed proteins may resulted in false discovery (Blaise BJ. Anal Chem. 85(19):8943-8950, 2013). Therefore, the differential significance of protein intensities between sample groups measured by P-values is firstly calculated using the reproducibility-optimized test statistic (ROTS) package in ANPELA (Pursiheimo A, et al. J Proteome Res. 14(10):4118-4126, 2015). Secondly, the distribution of P-values and the volcano plot are provided. Skewed distribution of P-values may indicate overfitting and/or confounding (Karpievitch YV, et al. BMC Bioinformatics. 13(S16):S5, 2012). Moreover, the sample outputs of "Distriubtion of P-values" and "Volcano plot of protein markers" that perform interactively in the same way as real output are provided.
Consistency score is a popular criterion used to represent the robustness of protein marker identification (Li B, et al. Nucleic Acids Res. 45(W1):162-170, 2017), which is calculated to quantitatively measure the overlap of identified protein markers among different partitions of a given dataset (Wang X, et al. Mol Biosyst. 11(5):1235-1240, 2015). The higher consistency score represents the more robust results in protein marker identification (Li B, et al. Nucleic Acids Res. 45(W1):162-170, 2017). Thus, the random sampling is firstly preformed within LFQ dataset to produce multiple sub-datasets. Then, each protein is ranked according to its significance measured by q-value and absolute fold changes. Thirdly, top-ranked proteins in each sub-dataset are selected as markers. Finally, a consistency score is calculated based on these markers using equation (Wang X, et al. Mol Biosyst. 11(5):1235-1240, 2015) as follow:
where C is the total number of sub-datasets, Ii indicates a set of significant protein makers containing the intersections of any i sub-datasets, and nS refers to the number of markers in the intersection S. Moreover, the sample outputs of "Venn diagram illustrating marker numbers" that performs interactively in the same way as real output are provided.
Additional experimental data (e.g. spiked proteins) are frequently generated and used as references to validate or adjust the performance of LFQ (Kuharev J, et al. Proteomics. 15(18):3140-3151, 2015; Navarro P, et al. Nat Biotechnol. 34(11):1130-1136, 2016), and the expected log fold changes (logFCs) are known both for the spiked and the background proteins (the expected LogFC for background proteins equals to zero) (Välikangas T, et al. Brief Bioinform. doi:10.1093/bib/bbx054, 2017). In ANPELA, the reproducibility-optimized test statistic (ROTS) is firstly applied to identify the differentially expressed proteins. Then, the true positive rate (TPR), the true negative rate (TNR) and the precision (PRE) for the success discovery of the spiked proteins are calculated. The higher the TPR, the more accurate the LFQ achieves. Moreover, the logFCs of protein intensities (for both spiked and background proteins) between two sample groups are calculated, and the level of correspondence between the quantification and the expected logFCs is then assessed by the mean squared error (MSE). The performance of LFQ can be reflected by how well the quantification logFCs corresponded to what are expected based on the references (Välikangas T, et al. Brief Bioinform. doi:10.1093/bib/bbx054, 2017). Moreover, a boxplot illustrating the deviations of both quantification and expected logFCs of the spiked proteins is provided. The preferred median in boxplot would be zero with minimized deviations. The required format of the file providing the information of the spiked proteins can be readily downloaded HERE ( Right Click to Save). The users will be asked to upload this file in the “Performance Assessment” step, and multiple metrics under this criterion will be calculated to the users for evaluating their selected quantification workflow. Moreover, the sample outputs of "Deviations between the quantification and the expected LogFCs of the spiked proteins", "Deviations of both spiked and background proteins between the quantification and the expected", "Metrics measuring LFQ performance" and "ROC curve of classification accuracy" that perform interactively in the same way as real output are provided.
ANPELA accepts a variety of data generated by 18 kinds of popular quantification software, all of which aim at pre-processing the raw proteomics data acquired by 3 quantification measurements:
2.1 A List of Software for Pre-processing the Data Acquired Based on SWATH-MS
(software sorted alphabetically)
2.2 A List of Software for Pre-processing the Data Acquired Based on Precursor Ion Signal Intensity (Peak Intensity)
(software sorted alphabetically)
2.3 A List of Software for Pre-processing the Data Acquired Based on Spectral Counting
(software sorted alphabetically)
Users are provided with the option to conduct transformation, pretreatment and imputation on their uploaded data. In total, 3 transformation, 18 pretreatment and 7 imputation methods frequently applied to manipulate the label-free proteomics data are provided in the current version of ANPELA.
3.1 Methods for Data Transformation
(methods sorted alphabetically)
3.2 Methods for Data Pretreatment
Pretreatment Methods include 2 centering methods, 4 scaling methods and 12 normalization methods.
3.3 Methods for Missing Value Imputation
(methods sorted alphabetically)
4. Diverse MS Systems for Proteome Quantification
Those popular kinds of software listed in the Section 2 of this Manual aim at quantifying the raw proteomics data derived from a diverse set of MS systems including the AB SCIEX Q-TOF systems, the Agilent Q-TOF mass spectrometer, the Bruker hybrid Q-TOF mass spectrometer and the Thermo Fisher Scientific Orbitrap.
4.1 AB SCIEX Q-TOF Systems
4000 System, API 3200 System)
4.2 Agilent Q-TOF
4.3 Bruker Hybrid Q-TOF Mass Spectrometer
4.4 Thermo Fisher Scientific Orbitrap
Al Shweiki MR, et al. Assessment of Label-Free Quantification in Discovery Proteomics and Impact of Technological Factors and Natural Variability of Protein Abundance. J Proteome Res. 16(4):1410-1424, 2017
Almeida AM, et al. The longissimus thoracis muscle proteome in Alentejana bulls as affected by growth path. J Proteomics. 152:206-215, 2017
Alter O, et al. Singular value decomposition for genome-wide expression data processing and modeling. PNAS. 97(18):10101-10106, 2000
Andjelkovic V, et al. Changes in gene expression in maize kernel in response to water and salt stress. Plant Cell Rep. 25(1):71-99, 2006
Anjo SI, et al. SWATH-MS as a tool for biomarker discovery: From basic research to clinical applications. Proteomics. 17(3-4), 2017
Ballman KV, et al. Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics. 20(16):2778-86, 2004
Blaise BJ. Data-driven sample size determination for metabolic phenotyping studies. Anal Chem. 85(19):8943-8950, 2013
Bolstad BM, et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 19(2):185-93, 2003
Borgaonkar SP, et al. Comparison of normalization methods for the identification of biomarkers using MALDI-TOF and SELDI-TOF mass spectra. OMICS. 14(1):115-26, 2010
Bouyssié D, et al. Mascot file parsing and quantification (MFPaQ), a new software to parse, validate, and quantify proteomics data generated by ICAT and SILAC mass spectrometric analyses: application to the proteomics study of membrane proteins from primary human endothelial cells. Mol Cell Proteomics. 6(9):1621-1637, 2007
Broudy D, et al. A framework for installable external tools in Skyline. Bioinformatics. 30(17):2521-2523, 2014
Bruderer R, et al. Extending the limits of quantitative proteome profiling with data-independent acquisition and application to acetaminophen-treated three-dimensional liver microtissues. Mol Cell Proteomics. 14(5):1400-1410, 2015
Bruderer R, et al. High-precision iRT prediction in the targeted analysis of data-independent acquisition and its impact on identification and quantitation. Proteomics. 16(15-16):2246-2256, 2016
Callister SJ, et al. Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J Proteome Res. 5(2):277-86, 2006
Cao MQ, et al. Identification of salivary biomarkers in breast cancer patients with thick white or thick yellow tongue fur using isobaric tags for relative and absolute quantitative proteomics. Zhong Xi Yi Jie He Xue Bao. 9(3):275-280, 2011
Casado-Vela J, et al. iTRAQ-based quantitative analysis of protein mixtures with large fold change and dynamic range. Proteomics. 10(2):343-347, 2010
Chai LE, et al. Investigating the effects of imputation methods for modelling gene networks using a dynamic bayesian network from gene expression data. Malays J Med Sci. 21(2):20-22, 2014
Chawade A, et al. Normalyzer: a tool for rapid evaluation of normalization methods for omics data sets. J Proteome Res. 13(6):3114-3120, 2014
Cheadle C, et al. Analysis of microarray data using Z score transformation. J Mol Diagn. 5(2):73-81, 2003
Chen YY, et al. Refining comparative proteomics by spectral counting to account for shared peptides and multiple search engines. Anal Bioanal Chem. 404(4):1115-1125, 2012
Cho CK, et al. Proteomics analysis of human amniotic fluid. Mol Cell Proteomics. 6(8):1406-15, 2007
Cociorva D, et al. Validation of tandem mass spectrometry database search results using DTASelect. Curr Protoc Bioinformatics. Chapter 13:Unit 13.4, 2007
Codrea MC, et al. Platforms and Pipelines for Proteomics Data Analysis and Management. Adv Exp Med Biol. 919:203-215, 2016
Colaert N, et al. Thermo-msf-parser: an open source Java library to parse and visualize Thermo Proteome Discoverer msf files. J Proteome Res. 10(8):3840-3843, 2011
Cox J, et al. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 26(12):1367-1372, 2008
De Livera AM, et al. Normalizing and integrating metabolomics data. Anal Chem. 84(24):10768-10776, 2012
Dieterle F, et al. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal Chem. 78(13):4281-4290, 2006
Dike AO. The Distribution of Cube Root Transformation of the Error Component of the Multiplicative Time Series Model. Global Journal Inc. 16(5):49-60, 2016
Dorts J, et al. Effects of cadmium exposure on the gill proteome of Cottus gobio: modulatory effects of prior thermal acclimation. Aquat Toxicol. 154:87-96, 2014
Dupierris V, et al. A toolbox for validation of mass spectrometry peptides identification and generation of database: IRMa. Bioinformatics. 25(15):1980-1981, 2009
Ejigu BA, et al. Evaluation of normalization methods to pave the way towards large-scale LC-MS-based metabolomics profiling experiments. OMICS. 17(9):473-485, 2013
Escher C, et al. Using iRT, a normalized retention time for more targeted measurement of peptides. Proteomics. 12(8):1111-1121, 2012
Fermin D, et al. Abacus: A computational tool for extracting and pre-processing spectral count data for label-free quantitative proteomic analysis. Proteomics. 11(7):1340-1345, 2011
Gan X, et al. Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res. 34(5):1608-1619, 2006
Gao Y, et al. Evaluation of sample extraction methods for proteomics analysis of green algae Chlorella vulgaris. Electrophoresis. 37(10):1270-1276, 2016
Garbis SD, et al. A novel multidimensional protein identification technology approach combining protein size exclusion prefractionation, peptide zwitterion-ion hydrophilic interaction chromatography, and nano-ultraperformance RP chromatography/nESI-MS2 for the in-depth analysis of the serum proteome and phosphoproteome: application to clinical sera derived from humans with benign prostate hyperplasia. Anal Chem. 83(3):708-18, 2011
Gaspari M, et al. Proteome Speciation by Mass Spectrometry: Characterization of Composite Protein Mixtures in Milk Replacers. Anal Chem. 88(23):11568-11574, 2016
Gautier V, et al. Label-free quantification and shotgun analysis of complex proteomes by one-dimensional SDS-PAGE/NanoLC-MS: evaluation for the large scale analysis of inflammatory human endothelial cells. Mol Cell Proteomics. 11(8):527-539. 2012
Griffin NM, et al. Label-free, normalized quantification of complex mass spectrometry data for proteomic analysis. Nat Biotechnol. 28(1):83-89, 2010
Gromski PS, et al. The influence of scaling metabolomics data on model classification accuracy. Metabolomics. 11:684-695, 2015
Guo T, et al. Rapid mass spectrometric conversion of tissue biopsy samples into permanent quantitative digital proteome maps. Nat Med. 21(4):407-413, 2015
Gärdén P, et al. PROTEIOS: an open source proteomics initiative. Bioinformatics. 21(9):2085-2087, 2005
Hoedt E, et al. SILAC-based proteomic profiling of the human MDA-MB-231 metastatic breast cancer cell line in response to the two antitumoral lactoferrin isoforms: the secreted lactoferrin and the intracellular delta-lactoferrin. PLoS One. 9(8):e104563, 2014
Hoekman B, et al. msCompare: a framework for quantitative analysis of label-free LC-MS data for comparative candidate biomarker studies. Mol Cell Proteomics. 11(6):M111, 2012
Hong MG, et al. Multidimensional Normalization to Minimize Plate Effects of Suspension Bead Array Data. J Proteome Res. 15(10):3473-3480, 2016
Huber W, et al. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 18 Suppl 1:S96-104, 2002
Häkkinen J, et al. The proteios software environment: an extensible multiuser platform for management and analysis of proteomics data. J Proteome Res. 8(6):3037-3043, 2009
Karpievitch YV, et al. Liquid Chromatography Mass Spectrometry-Based Proteomics: Biological and Technological Aspects. Ann Appl Stat. 4(4):1797-1823,2010
Karpievitch YV, et al. Metabolomics data normalization with EigenMS. PLoS One. 9(12):e116221, 2014
Karpievitch YV, et al. Normalization and missing value imputation for label-free LC-MS analysis. BMC Bioinformatics. 13(S16):S5, 2012
Keeping AJ, et al. Data variance and statistical significance in 2D-gel electrophoresis and DIGE experiments: comparison of the effects of normalization methods. J Proteome Res. 10(3):1353-60, 2011
Keller A, et al. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 74(20):5383-5392, 2002
Khoonsari PE, et al. Analysis of the Cerebrospinal Fluid Proteome in Alzheimer's Disease. PLoS One. 11(3):e0150672, 2016
Kim H, et al. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics. 21(2):1-12, 2004
Kohl SM, et al. State-of-the art data normalization methods improve NMR-based metabolomic analysis. Metabolomics. 8(Suppl 1):146-160, 2012
MacLean B, et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments Bioinformatics. 26(7):966-968, 2010
Matzke MM, et al. Improved quality control processing of peptide-centric LC-MS proteomics data. Bioinformatics. 27(20):2866-2872, 2011
McManus FP, et al. Identification of cross talk between SUMOylation and ubiquitylation using a sequential peptide immunopurification approach. Nat Protoc. 12(11):2342-2358, 2017
Millikin RJ, et al. Ultrafast Peptide Label-Free Quantification with FlashLFQ. J Proteome Res, 2017
Mouton-Barbosa E, et al. In-depth exploration of cerebrospinal fluid by combining peptide ligand library treatment and label-free protein quantification. Mol Cell Proteomics. 9(5):1006-1021, 2010
Navarro P, et al. A multicenter study benchmarks software tools for label-free proteome quantification. Nat Biotechnol. 34(11):1130-1136, 2016
Nesvizhskii AI, et al. A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem. 75(17):4646-58, 2003
Neuhauser N, et al. Expert system for computer-assisted annotation of MS/MS spectra. Mol. Cell. Proteomics. 11(11):1500-1509, 2012
Nezami Ranjbar MR, et al. Gaussian process regression model for normalization of LC-MS data using scan-level information. Proteome Sci. 11(Suppl 1):S13, 2013
Padoan A, et al. Reproducibility in urine peptidome profiling using MALDI-TOF. Proteomics. 15(9):1476-1485, 2015
Park GW, et al. Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate. J Proteome Res. 15(11):4082-4090, 2016
Park SK, et al. A quantitative analysis software tool for mass spectrometry-based proteomics. Nat Methods. 5(4):319-322, 2008
Park SK, et al. Census for proteome quantification. Curr Protoc Bioinformatics. Chapter 13:Unit 13.12.1-11, 2010
Prieto JH, et al. Large-scale differential proteome analysis in Plasmodium falciparum under drug treatment. PLoS One. 3(12):e4098, 2008
Pursiheimo A, et al. Optimization of Statistical Methods Impact on Quantitative Proteomics Data. J Proteome Res. 14(10):4118-4126, 2015
Ramus C, et al. Benchmarking quantitative label-free LC-MS data processing workflows using a complex spiked proteomic standard dataset. J Proteomics. 132:51-62, 2016
Ramus C, et al. Spiked proteomic standard dataset for testing label-free quantitative software and statistical methods. Data Brief. 6:286-294, 2015
Reindl J, et al. Proteomic biomarkers for psoriasis and psoriasis arthritis. J Proteomics. 140:55-61, 2016
Risso D, et al. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol. 32(9):896-902, 2014
Rosenberger G, et al. Inference and quantification of peptidoforms in large sample cohorts by SWATH-MS. Nat Biotechnol 35(8):781-788, 2017
Rosenberger G, et al. Statistical control of peptide and protein error rates in large-scale targeted data-independent acquisition analyses. Nat Methods 14(9):921-927, 2017
Röst HL, et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat Methods. 13(9):741-748, 2016
Röst HL, et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat Biotechnol. 32(3):219-223, 2014
Röst HL, et al. TRIC: an automated alignment strategy for reproducible protein quantification in targeted proteomics. Nat Methods. 13(9):777-783. 2016
Sakia RM. The Box-Cox transformation technique: a review. The Statistician. 41:169-178, 1992
Saranya C, et al. A Study on Normalization Techniques for Privacy Preserving Data Mining. International Journal of Engineering and Technology. 5(3):2701-2704, 2013
Saraswat M, et al. Comparative proteomic profiling of the serum differentiates pancreatic cancer from chronic pancreatitis. Cancer Med. 6(7):1738-1751, 2017
Savas JN, et al. Proteomic Analysis of Protein Turnover by Metabolic Whole Rodent Pulse-Chase Isotopic Labeling and Shotgun Mass Spectrometry Analysis Methods Mol Biol. 1410:293-304, 2016
Schilling B, et al. Platform-independent and label-free quantitation of proteomic data using MS1 extracted ion chromatograms in skyline: application to protein acetylation and phosphorylation. Mol Cell Proteomics. 11(5):202-214, 2012
Schlaffner CN, et al. Fast, Quantitative and Variant Enabled Mapping of Peptides to Genomes. Cell Syst. 5(2):152-156, 2017
Searle BC. Scaffold: a bioinformatic tool for validating MS/MS-based proteomic studies. Proteomics. 10(6):1265-9, 2016
Shao S, et al. Minimal sample requirement for highly multiplexed protein quantification in cell lines and tissues by PCT-SWATH mass spectrometry Proteomics. 15(21):3711-3721, 2015
Sturm M, et al. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinformatics. 9:163, 2008
Tabb DL, et al. DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. J Proteome Res. 1(1):21-6, 2002
Tsou CC, et al. DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics. Nat Methods. 12(3):258-264, 2015
Tsou CC, et al. Untargeted, spectral library-free analysis of data-independent acquisition proteomics data generated using Orbitrap mass spectrometers. Proteomics. 16(15-16):2257-2271, 2016
Twigt JM, et al. Preconception folic acid use influences the follicle fluid proteome. Eur J Clin Invest. 45(8):833-41, 2015
Tyanova S, et al. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc. 11(12):2301-2319, 2016
Tyanova S, et al. Visualization of LC-MS/MS proteomics data in MaxQuant. Proteomics. 15(8):1453-1456, 2015
Van den Berg RA, et al. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics. 7:142, 2006
Veit J, et al. LFQProfiler and RNP(xl): Open-Source Tools for Label-Free Quantification and Protein-RNA Cross-Linking Integrated into Proteome Discoverer. J Proteome Res. 15(9):3441-3448, 2016
Vidotto A, et al. Systems Biology Reveals NS4B-Cyclophilin A Interaction: A New Target to Inhibit YFV Replication. J Proteome Res. 16(4):1542-1555, 2017
Välikangas T, et al. A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation. Brief Bioinform. doi:10.1093/bib/bbx054, 2017
Välikangas T, et al. A systematic evaluation of normalization methods in quantitative label-free proteomics. Brief Bioinform. pii: bbw095, 2016
Végvári A, et al. Bioinformatic strategies for unambiguous identification of prostate specific antigen in clinical samples Mol Cell J Proteomics. 75(1):202-210, 2011
Wan J, et al. Palmitoylated proteins: purification and identification. Nat Protoc. 2(7):1573-1584, 2007
Wang F, et al. Label free quantitative proteomics analysis on the cisplatin resistance in ovarian cancer cells. Cell Mol Biol (Noisy-le-grand). 63(5):25-28, 2017
Wang X, et al. Optimal consistency in microRNA expression analysis using reference-gene-based normalization. Mol Biosyst. 11(5):1235-1240, 2015
Webb-Robertson BJ, et al. A Statistical Analysis of the Effects of Urease Pre-treatment on the Measurement of the Urinary Metabolome by Gas Chromatography-Mass Spectrometry. Metabolomics. 10(5):897-908, 2014
Webb-Robertson BJ, et al. A Statistical Selection Strategy for Normalization Procedures in LC-MS Proteomics Experiments through Dataset Dependent Ranking of Normalization Scaling Factors. Proteomics. 11(24):4736-4741, 2011
Weisser H, et al. An automated pipeline for high-throughput label-free quantitative proteomics. J Proteome Res. 12(4):1628-1644, 2013
Wu JX, et al. SWATH Mass Spectrometry Performance Using Extended Peptide MS/MS Assay Libraries. Mol Cell Proteomics. 15(7):2501-2514, 2016
Wu L, et al. A hybrid retention time alignment algorithm for SWATH-MS data. Proteomics. 16(15-16):2272-2283, 2016
Yan W, et al. A dataset of human liver proteins identified by protein profiling via isotope-coded affinity tag (ICAT) and tandem mass spectrometry. Mol Cell Proteomics. 3(10):1039-1041, 2004
Yang YH, et al. Normalization for cDNA Microarray Data. Proc Spie. 6(10):1-21, 2003
Zhang J, et al. An intelligentized strategy for endogenous small molecules characterization and quality evaluation of earthworm from two geographic origins by ultra-high performance HILIC/QTOF MS(E) and Progenesis QI. Anal Bioanal Chem. 408(14):3881-3890, 2016
Zhang J, et al. PEAKS DB: De Novo Sequencing Assisted Database Search for Sensitive and Accurate Peptide Identification. Mol Cell Proteomics. 11(4):M111, 2012
Zhang Y, et al. The Use of Variable Q1 Isolation Windows Improves Selectivity in LC-SWATH-MS Acquisition. J Proteome Res. 14(10):4359-4371, 2015
Zhang Z. Recombinant human activated protein C for the treatment of severe sepsis and septic shock: a study protocol for incorporating observational evidence using a Bayesian approach. BMJ Open. 4(7):e005622, 2014
Table of Contents
1. The Compatibility of Browser and Operating System (OS)
2. Required Formats of the Input Files
2.1 Flow Cytometry Data for Cell Subpopulation Identification (CSI)
2.2 Mass Cytometry Data for Cell Subpopulation Identification (CSI)
2.3 Flow Cytometry Data for Pseudo-time Trajectory Inference (PTI)
2.4 Mass Cytometry Data for Pseudo-time Trajectory Inference (PTI)
3. Step-by-step Instruction on the Usage of ANPELA 2.0
3.1 Uploading Your Data or the Sample Data Provided in ANPELA
3.2 Feature Selection and Data Quantification Workflow (Compensation& Transformation & Normalization & Signal Clean)
3.3 Performance Evaluation Based on Multiple Criteria
4. A Variety of Methods for Data Quantification
4.1 Compensation Methods
4.2 Transformation Methods
4.3 Normalization Methods
4.4 Signal Clean Methods
In general, the file required at the beginning of ANPELA 2.0 analysis should be flow cytometry standard format (FCS). The structure of FCS are as followed, parameters needed in ANPELA workflow are shown in orange:
The data used to quantify single-cell proteomic is extracted from the "exprs" of the FCS. Column name of the data matrix was generated from the "name" and "desc" of the FCS parameters, indicating the protein and the fluorescent antibody or non-radioactive rare-earth-metal isotopes used to stain it. And each row corresponds to a single cell detected by the cytometry.
In cell subpopulation identification that based on flow cytometry data, ANPELA compares protein expression of cells under two different conditions, therefore at least two samples for each condition (four FCS files in total) are needed. A metadata csv file which matches the file name to the condition is also needed in the process, the first column is the name of the FCS (without filename extension) from the sample followed by the second column which is the condition of the sample. For compensation method expect CytoSpill additional single stained control samples are needed. Sample data of this data type can be downloaded .
In cell subpopulation identification based on mass cytometry data (MC/CyTOF), ANPELA compares protein expression of cells under two different conditions, therefore at least two samples for each condition (four FCS files in total) are needed. A metadata which matches the file name to the condition is also needed in the process, the first column is the name of the FCS file (without filename extension) followed by the second column which is the condition of the sample. Sample data of this data type can be downloaded .
In pseudo-time trajectory inference based on flow cytometry data, ANPELA 2.0 can generate a pseudo-progression trajectories based on samples from more then two different time point meaning that at least two FCS files (one for each time) should be uploaded by the user. A metadata csv file specifies a time for each FCS file is also needed, and the first column of the csv are file names of FCS files followed by the second column containing time points when the sample was collected. For compensation method expect CytoSpill additional single stained control samples are needed.
As for evaluation under Criterion Cd (biological meaning), an extra csv file containing the order of proteins in prior known signal transduction cascades is needed, and the format requirements for the csv are shown as the following figure, each column represents a prior known protein activation pathway, form first line to the bottom conform to the sequence of protein activation. All of the proteins in the pathway csv should be included in the marker selected in step two and named same as the markers. Sample data of this data type can be downloaded .
In Pseudo-time Trajectory Inference based on mass cytometry data, ANPELA 2.0 can generate a pseudo-progression trajectories based on samples from more than two different time points meaning that at least two FCS files (one for each time) should be uploaded by the user. A metadata csv file specifies a time for each FCS file is also needed, and the first column of the csv are file names of FCSs followed by the second column containing time points when the sample was collected.
As for evaluation under Criterion Cd (biological meaning), an extra csv file containing the order of the respective proteins in known signal transduction cascades is needed, and the format requirements for the csv are shown as the following figures. Each column represents a prior known protein activation pathway, from first line to the bottom conform to the sequence of protein activation. All of the proteins in the pathway csv should be included in the marker selected in step two and named as the marker. Sample data of this data type can be downloaded .
This website is free and open to all users and there is no login requirement, and can be readily accessed by all popular web browsers including Google Chrome, Mozilla Firefox, Safari and Internet Explorer 10 (or later), and so on. Quantification and comprehensive performance assessment for single-cell proteomics are started by clicking on the "Single-cell Proteomics" panel on the homepage of ANPELA 2.0. The collection of web services and the whole process provided by ANPELA 2.0 can be summarized into 3 steps: (3.1) uploading single-cell proteomics data, (3.2) data quantification workflow, and (3.3) performance assessment A report containing evaluation results is also generated and can be downloaded in the format of PDF, HTML and DOC. The flowchart below summarizes the whole processes in ANPELA 2.0.
There are 3 radio checkboxes and a drop-down box in STEP-1 on the left side of the analysis page. Users can choose to upload their own cytometry data or to directly load sample data. The type of the study (cell subpopulation identification/pseudo-time trajectory inference) and measurement method (flow cytometry/mass cytometry) are selected in the remaining 2 radio checkboxes below.
And four different merge methods for users to choose from the drop-down box: (1) Ceil: Up to fixed number (specified by fixed Num) of cells are sampled without replacement from each FCS file and combined for analysis. (2) Fixed: A fixed number (specified by fixed Num) of cells are sampled (with replacement when the total number of cells are led than fixed Num) from each FCS files and combined for analysis. (3) All: All cells from each FCS file are combined for analysis and for method. (4) Min: The minimum number of cells among all the selected FCS files are sampled from each FCS file and combined for analysis.
Fixed number in "Ceil" or "Fixed" can be assigned by the input box below.
4 sets of sample data are also provided in this step facilitating a direct access and evaluation of ANPELA 2.0. These sample data are all benchmark datasets collected from previous articles, including (1) CSI-FC datasets of flow cytometry-based cell subpopulation identification which contains the blood and thymus samples from three myasthenia gravis patients and six healthy controls. (2) CSI-MC datasets of mass cytometry-based cell subpopulation identification which contains 6 peripheral blood mononuclear cells sample and 6 intestinal biopsies samples. (3) PTI-FC datasets of flow cytometry-based pseudo-time trajectory inference which contains 6 sequential time points of human embryonic stem cell line HUES9 after the induction of hematopoietic differentiation. (4) PTI-MC datasets of mass cytometry-based pseudo-time trajectory inference which contains the peripheral blood mononuclear cells sampled at 7 sequential time points after the activation by pVO4.
3.2 Feature Selection and Data Preprocessing (Compensation & Transformation & Normalization & Signal Clean)
Quantification of cytometry-based single-cell proteomics data requires a work flow consisting of compensation, transformation, normalization and signal clean.
A detailed explanation on each compensation, transformation, normalization, signal clean methods is provided in Section 4 of this Manual. After selecting preferred methods, please proceed by clicking the "PROCESS" button, a summary of the preprocessed data will be shown on the left. The resulting data can be downloaded by clicking the "Download" button.
After the quantification process, please select the protein marker (column name) you want for subsequent process from the drop-down list on the bottom.
In ANPELA 2.0, both cell subpopulation identification and pseudo-time trajectory inference has four well-established criteria for comprehensive evaluation on the performance of selected quantification workflow.
For Cell Subpopulation Identification(CSI), those criteria includes:
After cell subpopulation identification, we use KNN to classify cells within each cluster into two different conditions. F1 score or AUC are used to evaluate the accuracy between the classification result and the real condition label.
- External Criterion "Precision" (Jiang H, et al. Bioinformatics. 34: 3684-3694, 2018)
In order to assess the level of precision of the clusters among two conditions for each cell subpopulation, two well-established measures, purity and Rand index, are calculated by matching the cluster structures and the priori condition information of data.
- Internal Criterion "Coherence" (Lee HC, et al. Bioinformatics. 33: 1689-1695, 2017)
Coherence evaluation is based on the hypothesis that an ideal clustering result should have high similarity within each cluster and high heterogeneity between clusters. Therefore, Silhouette coefficient (SC) which measures how close a datum is to its own cluster compared to the other clusters are used to evaluate the coherence. Similar measurements such as Xie-Beni index (XB), Calinski-Harabasz index (CH), Davies-Bouldin index (DB) are also adopted in ANPELA 2.0.(Lee HC, et al. Bioinformatics. 33: 1689-1695, 2017)
After cell subpopulation identification, each cluster is random sampled creating three subset and P-value of each protein between different conditions are used in order to find biomarker within each subset. For each cluster, the consistency score of biomarkers found within three subsets is calculated, in order to evaluate the robustness of the quantification workflow.
A T-test is conducted in order to find differentially expressed protein between two conditions of all sampled cells. Then the recall of the prior known biomarkers is calculated in order to evaluate the correspondence of the quantification workflow.
For Pseudo-time Trajectory Inference(PTI), those criteria includes:
The Conformance of the selected quantification workflow is assessed by comparing the inferred trajectory with the sample collected time. Specifically, we calculated the probability that the order of the two cells in the pseudo-time is consistent with the actual collection time.
The Smoothness of the inferred trajectory for each protein is scored by calculating the expression differences of consecutive cells in the pseudo timeline. The performance of the selected quantification workflow is assessed by p-value which compares all protein smoothness scores between inferred order and random order.
In this criterion, four subsets are created by extracting 20% cells from the original data. The selected quantification workflow is applied on the subsets to generate four inferred trajectories. The Spearman rank correlation coefficient or Kendall rank correlation coefficient is calculated by comparing the subsets’ trajectory with the original one.
In this criterion, proteins were sequenced according to the order they reach peak expression in pseudo-time. The correspondence score of the selected quantification workflow is calculated by comparing its peak expression sequence with the prior known signal transduction pathway.
- AutoSpill. AutoSpill uses single-color controls combine automated gating which calculate spillover matrix based on robust linear regression and iterative refinement to reduce error. (Roca CP, et al. Nat Commun. 12(1):2890, 2021).
- CATALYST. CATALYST is an compensation methods for mass cytometry data that can calculate a spillover matrix based on single-stain beads which are used to compensation mass cytometry data. (Helena L, et al. Bioconductor. DOI: 10.18129/B9.bioc.CATALYST).
- CytoSpill. By finite mixture modeling and sequential quadratic programming achieve optimal error, CytoSpill can quantifies and compensates the spillover effects in Mass cytometry data without requiring the use of single-stained controls. (Miao Q, et al. Cytometry A. 99(9):899-909, 2021).
- FlowCore. Compensation methods from FlowCore can provides an estimation of the spillover matrix based on single-color controls or extract pre-calculated spillover matrix from original FCS by checking valid keywords which are used to compensate the corresponding data. (Hahne F, et al. BMC Bioinformatics. 10:106, 2009).
- MetaCyto. MetaCyto can extract the pre-calculated spillover matrix of each FCS file and use it to compensate corresponding data. (Hu ZC, et al. Cell Rep. 24(5):1377-1388, 2018).
- Arcsinh Transformation . The definition of this function is currently x<-asinh(a+b*x)+c) and is used to convert to a linear valued parameter to the natural logarithm scale. By default a and b are both equal to 1 and c to 0. (Rybakowska P, et al. Comput Struct Biotechnol J. 18:874-886, 2020).
- Asinh with Non-negative Value . This is the suggested methods by Xshift. Before asinh transformation, a specified noise threshold (set at 1) will be subtracted from every raw value and then all the negative values will be set to zero. (Liu X, et al. Genome Biol. 20(1):297, 2019).
- Asinh with Randomized Negative Value . This is the suggested methods by Phenograph. Asinh with Randomized Negative Value is similar to Asinh with Non-negative Value except that negative values are randomized to a normalization distribution rather than set to zero. (Liu X, et al. Genome Biol. 20(1):297, 2019).
- Biexponential Transformation . Biexponential is an over-parameterized inverse of the hyperbolic sine and should be used with care as numerical inversion routines often have problems with the inversion process due to the large range of values that are essentially 0. (Hahne F, et al. BMC Bioinformatics. 10:106, 2009).
- Box-Cox Transformation . Box-Cox transformation is a transformation of non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests. (Finak G, et al. Bioconductor. DOI: 10.18129/B9.bioc.flowTrans).
- FlowVS Transformation . FlowVS is a variance stabilization (VS) that removes the mean-variance correlations from cell populations identified in each fluorescence channel. flowVS transforms each channel from all samples of a data set by the inverse hyperbolic sine (asinh) transformation. For each channel, the parameters of the transformation are optimally selected by Bartlett’s likelihood-ratio test so that the populations attain homogeneous variances. The optimum parameters are then used to transform the corresponding channels in every sample. (Azad A, et al. BMC Bioinformatics. 17:291, 2016).
- Hyperlog Transformation . The HyperLog transform is a log-like transform that admits negative, zero, and positive values. The transform is a hybrid type of transform specifically designed for compensated data. One of its parameters allows it to smoothly transition from a logarithmic to linear type of transform that is ideal for compensated data. (Bagwell CB, et al. Cytometry A. 64(1):34-42, 2005).
- Linear Transformation . The definition of this function is currently x <- a*x+b and is a basic transformation commonly used in preprocessing cytometry data. (Novo D, et al. Cytometry A. 73(8):685-692, 2008).
- LnTransform . The definition of this function is currently x<-log(x)*(r/d). The transformation would normally be used to convert to a linear valued parameter to the natural logarithm scale. Typically, r and d are both equal to 1.0 and both must be positive. (Hahne F, et al. BMC Bioinformatics. 10:106, 2009).
- Log Transformation . Log is one of the most commly used flow cytometry data transformation method (Arcsinh for mass cytometry data).The definition of this function is currently x<-log(x,logbase)*(r/d). The transformation would normally be used to convert to a linear valued parameter to the natural logarithm scale. Typically r and d are both equal to 1 and both must be positive. logbase = 10 corresponds to base 10 logarithm. (Schoof EM, et al. Nat Commun. 12(1):3341, 2021).
- Logicle Transformation . Logicle transformation creates a subset of biexponentialTransform hyperbolic sine transformation functions which represent a particular generalization of the hyperbolic sine function with one more adjustable parameter than linear or logarithmic functions. The Logicle display method provides more complete, appropriate, and readily interpretable representations of data that includes populations with low-to-zero means, including distributions resulting from fluorescence compensation procedures. (Diggins KE, et al. Methods. 82:55-63, 2015).
- QuadraticTransform . The definition of this function is currently x <- a*x^2 + b*x + c, and has been adopted as a transformation method within FlowCore package. (Hahne F, et al. BMC Bioinformatics. 10:106, 2009).
- ScaleTransform . The definition of this function is currently x = (x-a)/(b-a). The transformation would normally be used to convert to a 0-1 scale. In this case, b would be the maximum possible value and a would be the minimum possible value (Hahne F, et al. BMC Bioinformatics. 10:106, 2009).
- TruncateTransform . In Truncate transformation all values less than a are replaced by a. The typical use would be to replace all values less than 1 by 1 and it is often used to remove fluorescence values < 1 (Hahne F, et al. BMC Bioinformatics. 10:106, 2009).
- Bead-based Normalization. This method first identifies the isotope-containing bead events, converts the raw data to local medians, then the average across all files is computed, these global means is utilized to calculate the slopes for each time point and finally multiplied by all data acquired from corresponding time. (Chevrier S, et al. Cell Syst. DOI: 10.18129/B9.bioc.flowStats).
- GaussNorm. This method normalizes a set of flow cytometry data samples by identifying and aligning the high density regions (landmarks or peaks) for each channel. The data of each channel is shifted in such a way that the identified high density regions are moved to fixed locations called base landmarks. (Hahne F, et al. Bioconductor. 6(5):612-620.e5, 2018).
- WarpSet. WarpSet are normalization method from flowStats package which perform a normalization of flow cytometry data based on warping functions computed on high-density region landmarks for individual flow channels. WarpSet is based on the idea (1) High density areas represent particular sub-types of cells.(2) Markers are binary. Cells are either positive or negative for a particular marker.(3) Peaks should aline if the above statements are true. (Hahne F, et al. Bioconductor. DOI: 10.18129/B9.bioc.flowStats).
- FlowAI. FlowAI is an automatic method that check and remove suspected anomalies that derive from (i) abrupt changes in the flow rate, (ii) instability of signal acquisition and (iii) outliers in the lower limit and margin events in the upper limit of the dynamic range. (Monaco G, et al. Bioinformatics. 32(16):2473-80,2016).
- FlowClean. FlowClean track subset frequency changes within a sample during acquisition and reported aberrant time periods as a new parameter added to data file allowing users to exclude those events. (Fletez-Brant K, et al. Cytometry A. 89(5):461-71, 2016).
- FlowCut. FlowCut can identify and delete regions of low density and segments that are significantly different from the rest by calculating eight measures(mean, median, 5th, 20th, 80th and 95th percentile, second moment (variation) and third moment (skewness)) and two parameters(MaxValleyHgt and MaxPercCut). (Justin Meskas, et al. bioRxiv. 2020).