SSizer: Determination of Sample Sufficiency for Comparative Biological Studies

SSizer enables the online assessment and determination of the sample size required for comparative biological studies. It integrates 3 types of statistical indexes, each with distinct underlying theory, to ensure comprehensive evaluation on the pilot biological data than any single type. Moreover, the sample simulation based on the original pilot data is further performed to expand data size and determine the required sample size. This server is powered by R shiny and freely accessible to users with no login requirement. It can be accessed by popular web browsers like Google Chrome, Mozilla Firefox, Safari and Internet Explorer 10 (or later). Bug reports or (new) feature requests are warmly welcomed, and please feel free to inform Dr. LI and Dr. ZHOU. We would be happy to update our service according to your valuable comments.

Thanks for using and improving SSizer, and welcome to visit our lab at https://idrblab.org/

SSizer is Unique for Facilitating the Comparative Biological Studies by:

Comprehensively Assessing the Sufficiency of Sample Size from Multiple Perspectives

3 types of statistical indexes are applied to comprehensively assess the sufficiency of sample size for a specific comparative biological study. Type I: statistical power analyzing the level of difference between comparative groups (Eng J. Radiology. 227(2): 309-313, 2003); Type II: overall diagnostic accuracy and classification accuracy on independent data (Jianguo Xia, et al. Metabolomics. 9(2): 280-299, 2013); Type III: robustness among lists of markers identified from multiple datasets (Domany E, et al. Cancer Res. 74(17): 4612-4621, 2014). Each type assesses the sufficiency of sample size based on distinct underlying theory, and the combination of multiple types can thus provide comprehensive evaluation on the pilot data than any single one. Assessment results are represented by a colored bar as shown in figure below (not enough region in RED: non-satisfaction of any type of indexes; passable region in ORANGE: satisfaction of only one type; good region in BLUE: satisfaction of two types; very good region in GREEN: satisfaction of all types). The results of all indexes are displayed on web page, and all figures and tables can also be downloaded from the website.

Determining the Sample Size Required for a Specific Biological Study by Sample Simulation

Sample simulation based on original pilot data is performed to expand data size and determine the required sample size (Blaise BJ, et al. Anal Chem. 88(10): 5179-5188, 2016). The agreement of the assessment results between the simulated and pilot data is evaluated to prove the simulation accuracy (Ein-Dor L, et al. PNAS. 103(15): 5923-5928, 2006). In SSizer, the benchmark dataset (MTBLS28) containing samples of 469 lung cancer patients and 536 healthy individuals (Mathé EA, et al. Cancer Res. 74(12): 3259-3270, 2014) is repeatedly sampled to get a series of sub-datasets with various sample sizes (50:50, 100:100, 150:150, 200:200 and 250:250). Based on these sub-datasets, sample simulation expands the data size to 400:400. As shown in the figure below, the data distributions and assessment results between simulations and original pilot data agree well with each other (Blaise BJ, et al. Anal Chem. 88(10): 5179-5188, 2016: Ein-Dor L, et al. PNAS. 103(15): 5923-5928, 2006), and the assessment results for the simulated data by all indexes are also represented by a colored bar displayed on the web page. All corresponding figures and tables can also be downloaded from the website.

The required sample size for user-input data determined based on sample simulation is validated using the benchmark dataset (MTBLS28). The predictive performance is assessed by the same measure (the agreement of index means and standard deviations between the simulated and user-input data) as that used in PNAS. 103(15): 5923-8, 2006. The assessments have revealed good agreements, which are downloadable HERE( Right Click to Save).

Instruction to the Users

Sample Data Files for Download

Sample metabolomics data (MTBLS354) could be downloaded HERE ( Right Click to Save). Sample proteomics data (PXD005144) could be downloaded HERE ( Right Click to Save). Sample transcriptomics data (GSE28702) could be downloaded HERE ( Right Click to Save).

Select the Format of the Uploaded File

SSizer accepts dataset of various formats including csv, tab delimited, xls, xlsx and txt. The row and column of the input file should be samples and features, respectively. In particular, the first row should be the feature name, the first column must be the sample name, and the second column indicates the class label (case or control) of each sample. At least 3 samples are needed in each class.

Summary and Visualization of Raw Data

Summary of the Raw Data

Visualiztion of Data Distribution

Overviews of Data Pre-processing

Normalization

Please select the normalization method on the left side panel, and no normalization is also allowed for selection.

Transformation

Please select the transformation method on the left side panel, and no transformation is also allowed for selection.

Missing Value Imputation

Please select or provide your preferred No. of neighbors, max percentage of missing values allowed in a row or a column and the largest block of features imputed.

Data Filtering

Please select the data filtering method on the left side panel, and no filtering is also allowed for selection.

Please Process the Data by Clicking the “PROCESS” Button

WARNING: The method you have choosed may not be suitable for your data, please refresh the page and try again.

Results after Data Preprocessing

Normalization and Transformation

Dataset after Normalization

Missing Value Imputation

Dataset after Imputation

Feature intensities are displayed in black color, with the highest intensity set as exact black and lower ones gradually fading towards white (intensity = 0). The proteins without numerical intensity (missing value) are highlighted by red color.

Data Filtering

Dataset after Filtering

Test Statistics Distribution Overview (for data after preprocessing)

Before assessment, please ensure that:

1. The test statistics distribution of your data follows an approximately normal distribution;

2. The majority of p-values are close to zero.

Otherwise, your assessing result might be unsatisfactory.

If the test statistics distribution of your data did not match the requirments, please switch the preprocess methods and try again.

Overviews of Sample Size Assessment

Sample size assessment is achieved by 3 types of statistical indexes.

Type I. Statistical Power Analyzing the Level of Difference between Comparative Groups

Please provide your preferred false discovery rate, cutoff of power value, min sample size per group, No. of increment and time of sampling. The default cutoff of power value is set to 0.8.

Type II. Classification Accuracy Based on the Identified Markers

Please select your preferred statistical index(es) for assessment (AUC and ACC), and also set the cutoff of these indexes. The default cutoffs of AUC and ACC are 0.9 and 0.7, respectively.

Type III. Robustness of the Identified Markers

Please select your preferred statistical index(es) for assessment (overlap, concordance and CW), and also set the cutoff for each index. The default cutoffs are set to 0.5, 0.36 and 0.5 for overlap, concordance and CW, respectively. SSizer also need you to choose the preferred feature selection method, No. of top ranked features and the classification algorithm.

Please Assess the Sample Size of the Pilot Data by Clicking the “ASSESS” Button

Assessing the Sample Size of Pilot Data

Assessing Results for the Pilot Data

The colored bar shown above illustrated the assessment results of sample size by multiple types of statistical indexes. The sample number within the RED (not enough) region refers to the non-satisfaction of any type of indexes. The sample number within the ORANGE (passable) region indicates the satisfaction of only one type of indexes, and the exact type number is also provided under this region. The sample number within the BLUE (good) region implies the satisfaction of two types of indexes, and the additional type number is provided under this region. The sample number within the GREEN (very good) region denotes the satisfaction of all three types of indexes.

Type I. Statistical 'Power' Analyzing the Level of Difference between Comparative Groups

Download Power RData File

Type II. Classification Accuracy Based on the Identified Markers

Download AUC RData File

Download ACC RData File

Type III. Robustness of the Identified Markers

Download Overlap RData File

Download Concordance RData File

Download CWrel RData File

Determination of the Adequate Size by Sample Simulation based on the Pilot Data

Sample Simulation based on the Pilot Data

Please provide the intended sample size simulated based on the pilot data.

Determining the Adequate Size by Simulated Data

Please choose your preferred type(s) of indexes for determining the adequate size of the biological study. The default cutoffs of power value, AUC and overlap are set to 0.9, 0.7 and 0.5, respectively.

Please Determine the Adequate Sample Size by Clicking the “DETERMINE” Button

Determination of the Adequate Size by Sample Simulation based on the Pilot Data

WARNING: The number of pilot data exceeds the number you have chosed for sample simulation, please reset the parameter which must be larger than the sample size of pilot data!

Simulated Data (red dots) and the Original Pilot Data (green dots)

Assessing Results for the Simulated Data

Type I. Statistical 'Power' Analyzing the Level of Difference between Comparative Groups

Download power_sim.RData

Type II. Classification Accuracy Based on the Identified Markers

Download Sim_AUC.RData

Download Sim_ACC.RData

Type III. Robustness of the Identified Markers

Download Sim_Overlap.RData

Download Sim_Concordance.RData

Download Sim_CWrel.RData

Table of Contents

1. Instruction on the Usage of SSizer

1.1 Data Uploading and Pre-processing

1.2 Assessing the Sample Size of Pilot Data

1.3 Determining the Adequate Size by Sample Simulation

2. Statistical Indexes for the Assessment and Determination of Sample Size

2.1 Statistical 'Power' Analyzing the Level of Difference between Comparative Groups

2.2 Classification Accuracy Based on the Identified Markers

2.2.1 'AUC' Denoting the Overall Diagnostic Accuracy

2.2.2 'ACC' Indicating the Classification Accuracy on Independent Data

2.3 Robustness of the Identified Markers

2.3.1 'Overlap' between Lists of Markers Identified from Two Sub-datasets

2.3.2 'Concordance' between Lists of Markers Identified from Two Sub-datasets

2.3.3 'Weighted Consistency' among Lists of Markers Identified from Multiple Sub-datasets

1. Instruction on the Usage of SSizer

Assessment and determination of the sample size for a specific biological study are started by clicking the 'Analysis' panel on the homepage of SSizer. The whole analysis includes: uploading & pre-processing of the biological dataset, assessing the sample size of pilot data, and determining the adequate size by sample simulation.

1.1 Data Uploading and Pre-processing

By check the "Upload Biological Data" radio button, users can upload their data matrix to SSizer in various formats (csv, tab delimited, xls and xlsx). In particular, the first row of the data matrix should be the feature name, and the first column provides the sample name. The second column indicates the class label (case or control) of each sample. The standard format accepted by SSizer can be downloaded HERE ( Right Click to Save). By clicking the "Upload Data" button, the biological dataset provided by the users can be uploaded for further assessment.

Three sets of sample data are also provided in this step facilitating a direct access and evaluation of SSizer. These sample data are benchmark datasets collected from a variety of public databases. These sample data are all benchmark datasets collected from a variety of public databases. The first set of sample data MTBLS354 is metabolomics benchmark dataset collected from the MetaboLights database. This dataset contains 142 samples of community-acquired pneumonia (CAP) patients and 97 samples of people without CAP (non-CAP controls) (To K K W, et al. Diagn Microbiol Infect Dis. 85(2):249-54, 2016). The second set of sample data PXD005144 is proteomics benchmark dataset from the PRoteomics IDEntifications (PRIDE) database constructed by the European Bioinformatics Institute. This dataset contains 66 samples of patients with pancreatic cancer and 36 samples of people with chronic pancreatitis (Saraswat M, et al. Cancer Med. 6(7): 1738-1751, 2017). The last set of sample data GSE28702 is transcriptomics benchmark dataset from the Gene Expression Omnibus database dedicated by the National Center for Biotechnology Information. This dataset contains 42 samples of responders to FOLFOX therapy and 41 samples of non-responders (Tsuji S, et al. Br J Cancer. 106(1): 126-132, 2012). By clicking the "Load Data" button, the sample dataset selected by the users can be uploaded for further assessment.

Several factors, like unwanted experimental & biological variations and technical errors, can hamper the analysis of OMICs and other biological data, which requires normalization before further study (Li B, et al. Nucleic Acids Res. 45(W1): 162-170, 2017). Because of the huge amount of hypothesis tests during the analysis, it is also necessary to conduct a multiple testing adjustment. However, in case of large size of tests and low differential features, the above adjustment will lead to a substantially low power for identifying truly differential features. Data filtering is thus introduced to reduce the size of tests and increase the power (Hackstadt AJ, et al. BMC Bioinformatics. 45(W1): 10:11, 2009). In the current version of SSizer, the option to conduct data normalization and filtering before the assessment and determination of sample size is provided to the users. In particular, 15 normalization methods (Auto Scaling, Contrast, Cubic Splines, Cyclic Loess, EigenMS, Linear Baseline, Log-transform, Level Scaling, MSTUS, Pareto Scaling, Mean Normalization, Median Normalization, MSTUS, Pareto Scaling, Power Scaling, PQN, Range Scaling, Vast Scaling and VSN) frequently used to pre-process OMICs and other biological data are provided, and 2 popular filtering methods (Standard Deviation and Mean) in current biological research are further included. Users can select a specific normalization and filtering method by checking corresponding radio checkbox in the left side panel, and can also check the "NONE" option to skip one or both of these pre-processing. Moreover, the resulting data can be previewed and is fully downloadable from the website. The distributions of feature intensity before and after normalization are visualized by boxplots, and the distributions of feature intensity mean before and after data filtering are also viewable by histograms.

1.2 Assessing the Sample Size of Pilot Data by Multiple Statistical Indexes

Three types of well-established statistical indexes for comprehensive assessment and determination of sample size are used. The colored bar shown below illustrated the assessment results of sample size (SS) by multiple types of statistical indexes. SS within the RED (not enough) region refers to the non-satisfaction of any type of indexes. SS within the ORANGE (passable) region indicates the satisfaction of only one type of indexes, and the specific type ID is also provided under this colored region. SS within the BLUE (good) region implies the satisfaction of two types of indexes, and the additional type ID is provided under the BLUE region. SS within the GREEN (very good) region denotes the satisfaction of all three types of indexes.

Type I: Statistical Power Analyzing the Level of Difference between Comparative Groups

To assess the level of difference between comparative groups, Statistical Power Analysis is performed to assess whether the "statistical power" is at a desired level (≥ 0.8), which in turn helps to estimate the sample size required (Blaise B J, et al. Anal Chem. 88(10): 5179-5188, 2016). Under this index, users need not only to specify a false discovery rate and an expected power value, but also to select the minimum sample size per group and the number of increment and repeats. The larger the power value of a biological dataset, the higher the probability that an observed effect can pass the required threshold of claiming its discovery (Button K S, et al. Nat Rev Neurosci. 14(5): 365-376, 2013). Moreover, the sample outputs of "Line-plots of POWER Changing along with the Sample Size" that performs interactively in the same way as real output are provided.

Type II: Classification Accuracy Based on the Identified Markers

"AUC" Denoting the Overall Diagnostic Accuracy

The overall diagnostic accuracy is evaluated by the receiver operating characteristic curve together with the area under that curve (AUC) based on 3 popular machine learning algorithms. An adequate sample size is reported to be reflected by desired AUC value (≥ 0.9), and the users can also define their preferred cutoff of desired AUC value (Xia J, et al. Nucleic Acids Res. 43(W1): 251-257, 2015). Firstly, markers are identified by choosing feature selection method from 3 popular ones (Students‘ t-test, PLS-DA and OPLS-DA) based on the users‘ preference. Secondly, users are asked to select their preferred machine learning algorithms (support vector machine, random forest or diagonal linear discriminant analysis) for constructing the classification models based on those identified markers. After the k-folds cross validation on these models, results with higher AUC value are recognized as well-performed (Sing T, et al. Bioinformatics. 21(20): 3940-3941, 2005). The sample outputs of "Boxplots and Medium Values of AUC Changing along with Sample Size" performing interactively in the same way as real output are also provided.

"ACC" Indicating the Classification Accuracy on Independent Data

The classification accuracy on independent test data is assessed by accuracies (ACCs) of the constructed classification models based on 3 popular machine learning algorithms (support vector machine, random forest and diagonal linear discriminant analysisdiagonal linear discriminant analysis). According to the previous studies, an adequate sample size can be reflected by a desired ACC value (≥ 0.7), and this desired value can also be defined by users in SSizer (Billoir E, et al. Brief Bioinform. 16(5): 813-819, 2015). In particular, 3 popular feature selection methods (Students‘ t-test, PLS-DA and OPLS-DA) are applied in the first place to identify the markers based on feature intensities. Secondly, users are asked to select their preferred machine learning algorithm for sample classification. The higher value of ACC denotes better classification accuracy and indicates better performance of prediction capacity (Nyamundanda G, et al. BMC Bioinformatics. 14(1): 338, 2013).The sample outputs of "Boxplots and Medium Values of ACC Changing along with Sample Size" performing interactively in the same way as real output are also provided.

Type III: Robustness of the Identified Markers

"Overlap" between Lists of Markers Identified from Two Sub-datasets

In simple terms, overlap is the fraction of shared features that appear on both two lists of markers which determined the robustness of the identified markers by measure the similarity of two lists of identified markers. According to the previous studies, an adequate sample size can be reflected by a desired overlap value (≥ 0.5), and this desired value can also be defined by users in SSizer (Wang C, et al. Nat Biotechnol. 32(9): 926-932, 2014). The higher value of overlap denotes better classification accuracy and indicates robustness of the identified markers (Ein-Dor L, et al. PNAS. 103(15): 5923-5928, 2006). The sample output of "Boxplots and Mediums of OVERLAP Changing along with Sample Size" performing interactively in the same way as real output is also provided.

"Concordance" between Lists of Markers Identified from Two Sub-datasets

Concordance is reported to be more relevant than the mere overlap between markers in evaluating the similarity between different signatures (Fan C, et al. N Engl J Med. 355(6): 560-569, 2006). The classification of individual samples is the relevant measure of concordance which is assessed by Cramer‘s V statistic: different lists of markers could track a common set of biologic characteristics, and result in similar predictions of outcome. According to the previous studies, an adequate sample size and robust identified markers can be reflected by a desired concordance value (≥ 0.36), and this desire value could be defined by users in SSizer either (Fan C, et al. N Engl J Med. 355(6): 560-569, 2006). Similar to overlap, the higher value of concordance denotes better robustness of the identified markers. Sample output of "Boxplots and Mediums of CONCORDANCE Changing along with Sample Size" performing interactively in the same way as real output is also provided.

"Weighted Consistency" among Lists of Markers Identified from Multiple Sub-datasets

Weighted Consistency (CW) is a different kind of index which is distinguished from overlap and concordance for its overall markers statistics calculation principle. It counts the number of times of every single feature appeared in every single list of markers to represent the robustness of the identified markers from an overall perspective. According to previous studies, the higher weighted consistency indicates the better robustness of the identified markers (Somol P, et al. IEEE Trans Pattern Anal Mach Intell. 32(11): 1921-1939, 2010). But there is no well-established desired value of weighted consistency, we provided a recommended desired value (≥ 0.5) and this desire value could also be defined by users in SSizer. The sample outputs of "Linear Graph of CW Changing along with the Sample Size" that performs interactively in the same way as real output are also provided.

1.3 Determining the Adequate Size by Sample Simulation Based on the Pilot Data

Hypothetical data is simulated in SSizer to enlarge the data size. The newly generated large dataset is then used to assess and further determine the adequate sample size for a specific biological problem. Data simulation starts from a relatively small cohort, variables are identified by the statistical recoupling of variables (SRV) procedure (Blaise B J, et al. Anal Chem. 88(10): 5179-5188, 2016;Navratil V, et al. Bioinformatics. 29(10): 1348-1349, 2013). A larger dataset is then generated based on the kernel density estimation of SRV (Rosenblatt M, et al. Annals of Mathematical Statistics. 27(3): 832-837, 1956; Parzen E, et al. Annals of Mathematical Statistics. 33(3): 1065-1076, 1962). Statistical significance of variables identified by SRV are assessed by Benjamini-Yekutieli correction for simulated datasets of variable sizes (Benjamini Y, et al. Annals of statistics. 1165-1188, 2001). Robustness of simulated model is evaluated by receiver operating characteristic analysis on an independent cohort and cross-validation.

According to the pre-case study of SSizer, overlap has been proved to be a stable and effective index for sample size evaluation. So, the sample size evaluation in SSizer is achieved by data simulation and subsequent statistic overlap analysis. The data are simulated using a multivariate log-normal distribution fit to the pilot data, which allows SSizer to maintain the long-tails and strong correlations that are typically seen. This data-driven data simulation approach that SSizer used is based on the assumption that a small random cohort is a good estimator of the general population. Users are required to impute the size of simulation data for data simulation and choose expected overlap, minimum sample size for per group, number of increment and number of repeats for overlap calculation.

The colored bar shown above illustrated the assessment results of sample size (SS) by multiple types of statistical indexes. SS within the RED (not enough) region refers to the non-satisfaction of any type of indexes. SS within the ORANGE (passable) region indicates the satisfaction of only one type of indexes, and the specific type ID is also provided under this colored region. SS within the BLUE (good) region implies the satisfaction of two types of indexes, and the additional type ID is provided under the BLUE region. SS within the GREEN (very good) region denotes the satisfaction of all three types of indexes.

2. Statistical Indexes for the Assessment and Determination of Sample Size

Three well-established statistical indexes for a comprehensive evaluation on the adequacy of sample size are provided in SSizer, which include:

2.1 Statistical Power Analyzing the Level of Difference between Comparative Groups

Statistical power analysis relates sample size, effect size and significance level to the chance of detecting an effect in a data set. If β represents the risk of falsely rejecting truly positive results as nonsignificant, the power equals to the probability 1-β of flagging a true effect as statistically significant (Blaise B J, et al. Anal Chem. 88(10): 5179-5188, 2016). The lower the power of a study, the lower the probability that discovering effects are genuinely true. In other words, if a study shows low power, it may exaggerate the magnitude of that effect provided by the study even if when underpowered study discovers a true effect (Button K S, et al. Nat Rev Neurosci. 14(5): 365-376, 2013).

To evaluate the sensitivity of the studied dataset, statistical power analysis is usually performed by fixing power value at a desired level (usually ≥ 0.9, leading to false rejection of true effects in 10% of the cases) and estimating required sample size (Blaise B J, et al. Anal Chem. 88(10): 5179-5188, 2016).Since there is no preconception about which variables will be affected for most studies, it is preferable to set sample size to a number where the majority of variables reach the minimum power value. The pilot study is the primary source of data for calculation, and the pilot of 20 samples is suggested as sufficient to a robust power analysis. In SSizer, statistical power analysis is used as safeguard estimating the probability of obtaining meaningful results (Button K S, et al. Nat Rev Neurosci. 14(5): 365-376, 2013).

2.2 Classification Accuracy Based on the Identified Markers

2.2.1 "AUC" Denoting the Overrall Diagnostic Accuracy

The robustness and predictive capacity of each simulated model are evaluated by the receiver operating characteristic (ROC) analysis and area under the curve (AUC) value by k-folds cross-validations (Blaise B J, et al. Anal Chem. 88(10): 5179-5188, 2016). There is usually a trade-off between the robustness and the predictive capacity, which means different threshold may lead to higher robustness at the expense of lowering the predictive capacity or vice versa. One of the best ways to observe how a decision threshold affects these 2 measures is through ROC curve (Blaise B J, et al. Anal Chem. 88(10): 5179-5188, 2016). This curve is plotted using "robustness" against "1-predictive capacity" at various thresholds and can be quantitatively represented by the AUC value (Xia J, et al. Nucleic Acids Res. 43(W1): 251-257, 2015).

AUC values are widely considered to be one of the most objective and valid metrics for the performance evaluation of biomarker discovery (Blaise B J, et al. Anal Chem. 88(10): 5179-5188, 2016). ROC curves and AUC values are provided by SSizer via the following steps. First, the differential features are identified by partial least squares discriminant analysis (PLS-DA). Then, the machine learning classifiers are constructed using these identified features. Based on the k-folds cross validation, method with larger area under ROC curve and higher AUC value is recognized as with better performance.

2.2.2 "ACC" Indicating the Classification Accuracy on Independent Data

The performance of the constructed statistic model can be assessed by its classification accuracy (ACC) on the independent test dataset. In SSizer, the independent test dataset is constructed by randomly selecting data from the studied dataset (Billoir E, et al. Brief Bioinform. 16(5): 813-819, 2015). Based on the prediction results of the constructed statistic model, 4 metrics are provided which include the number of all positive (AP) & negative (AN) samples and true positive (TP) & negative (TN) samples predicted successfully (Nyamundanda G, et al. BMC Bioinformatics. 14(1): 338, 2013). Finally, ACC is defined by the following equations (Sing T, et al. Bioinformatics. 21(20): 3940-3941, 2005).

The lower the value of ACC, the lower the performance of the constructed statistic model. As reported, a model of a favorable ACC value (≥ 0.7) can results in a better classification results (Billoir E, et al. Brief Bioinform. 16(5): 813-819, 2015).

2.3 Robustness of the Identified Markers

2.3.1 "Overlap" between Lists of Markers Identified from Two Sub-datasets

In biological study, biomarkers are usually identified by the analysis of differentially expressed feature. For the biomarker discovery from a number of datasets, the Ni top ranked features are discovered from i datasets (i = 1,..., a,..., b,..., n; n≥ 4). The value of overlap is then calculated by the fraction of shared features that appear on both two lists (a and b) of markers. The closer the overlap value equals to 1, the more robust are the markers discovered in that study (Wang C, et al. Nat Biotechnol. 32(9): 926-932, 2014). The metric overlap can be calculated using the following equation (Ein-Dor L, et al. PNAS. 103(15): 5923-5928, 2006):

2.3.2 "Concordance" between Lists of Markers Identified from Two Sub-datasets

In some biological problems, the overlap is not the only determinant of the robustness of the identified markers, and in this circumstance the metric concordance is reported to be more relevant than the mere overlap between markers in evaluating the similarity between different signatures (Fan C, et al. N Engl J Med. 355(6): 560-569, 2006). In SSizer, the strength of concordance between markers is assessed using Cramer‘s V statistic (Fan C, et al. N Engl J Med. 355(6): 560-569, 2006). Let a sample of size n of the simultaneously distributed variables ai and bj (i = 1,..., r and j = 1,..., k) be given by the following frequencies: nij= the number of times the values (ai, bj) are observed. The chi-squared statistic is:

The concordance represented by the Cramer‘s V is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1:

2.3.3 "Weighted Consistency" among Lists of Markers Identified from Multiple Sub-datasets

As demonstrated above, 2 metrics (overlap and concordance) are calculated based on the lists of markers identified from different sub-datasets. Another well-established measure (weighted consistency, CW in short) is provided to assess the robustness of markers from a very different perspective. In particular, the CW could make the robustness of the identified markers more multifaceted and credible (Somol P, et al. IEEE Trans Pattern Anal Mach Intell. 32(11): 1921-1939, 2010). The procedures used for calculating CW can be represented by the following steps. The first step is to define the measure of occurrence stability of the feature, and the second step is to extend the definition of consistency to evaluate whole system. Finally, the CW of the system S can be defined by the following equation: