SISPRO enables signature identification and biological interpretation for spatial proteomics. It is unique in (a) identifying proteomic signature of good robustness and accuracy and (b) interpretating the identified signature based on comprehensive sets of subcellular information. Due to the increasing concerns about the neglect of robustness by standard statistical frameworks (Trends Biotechnol. 36: 488-98, 2018) and the lack of automated subcellular interpretation (Nucleic Acids Res. 45: W6-11, 2017), SISPRO is expected to be essential in current proteomics.
Thanks a million for using and improving SISPRO, and please feel free to report any errors to Dr. ZHOU at zhou_ying@zju.edu.cn.
Browser and Operating System (OS) Tested for Smoothly Running SISPRO:
The SISPRO is powered by R shiny. It is free and open to all users without login requirement & can be readily accessed by a variety of popular web browsers and operating systems as shown below.

Application Methodology and Data Statistics Incoporated in SISPRO:
In SISPRO, proteomic signature is first identified by collectively considering robustness and accuracy through calculating Relative Weighted Consistency (CWrel, Inf Fusion. 35: 132-47, 2017) and Area Under the Curve (AUC, J Extracell Vesicles. 9: 1750202, 2020); and the biological interpretation of the identified signature is then realized by integrating a comprehensive set of spatial information (9 organelles and 22 subcellular structures).
Welcome to Download the Sample Data for Testing and for File Format Correcting
- Spatial Proteomics
Spatial proteomics is the systematic and high-throughput study of proteins' localizations and their dynamics at the subcellular level (Lundberg E et al. Nat Rev Mol Cell Biol 20:285-302, 2019; Gatto L et al. Curr Opin Chem Biol 48:123-49, 2019). Spatial distribution and dynamic changes of proteins at the subcellular level are essential for a complete understanding of cell biology (Lundberg E et al. Nat Rev Mol Cell Biol 20:285-302, 2019). The power of comparative spatial proteomics as a discovery tool to unravel disease mechanisms has been successfully harnessed by several studies (Krahmer N et al. Dev Cell 47:205-21, 2018; Zilocchi M et al. Front Cell Dev Biol 48:123-49, 2019)
Recent substantial advances in high-throughput microscopy, quantitative mass spectrometry (MS) as well as machine learning applications for data analysis, have enabled proteome-wide investigations of spatial cellular regulation (Aebersold R et al. Nature 537:347-55, 2016; Borner GHH et al. Mol Cell Proteomics 19:1076-1087, 2020). The basic strategy for spatial proteomics is to carry out tailored biochemical fractionation to enrich for a specific organelle and then to quantify proteins across the different steps of the enrichment protocol using MS. An abundance distribution profile is obtained for each protein. Proteins associated with the target organelle have similar profiles and thus can be distinguished from contaminants, which have different profiles. Owing to focus on specific organelle, the approach is generally best suited to address targeted research questions (Lundberg E et al. Nat Rev Mol Cell Biol 20:285-302, 2019).
The sample data PXD010361 studies changes of plasma membrane proteins during Escherichia coli infection. The dataset contains 9 samples with Escherichia coli infection and 9 samples without Escherichia coli infection and could be downloaded .
The sample data JPST000934 compares differences of mitochondrial proteins between acute leukemia cells and healthy peripheral blood mononuclear cells. This dataset contains 5 healthy peripheral blood mononuclear cell samples and 15 acute leukemia cell samples and could be downloaded .
Summary and Visualization of the Uploaded Raw Data
A. Overview of the Uploaded Raw Data
B. Distribution Visualiztion of the Uploaded Raw Data
Summary and Visualization of the Data after Preprocessing
A. Missing Value Imputation
B. Data Filtering
C. Data Normalization
Table of Contents
1. The Compatibility of Browser and Operating System (OS)
2. Required Formats of the Input Files
3. Step-by-step Instruction on the Usage of SISPRO
3.1 Data Upload & Preprocessing
3.2 Assessing Signature Robustness
3.3 Assessing Prediction Accuracy
3.4 Protein Function & Signaling Pathway Interpretation
3.5 Protein-protein Interaction Network Analysis
4. A Variety of Organelles for Analysis
5. A Variety of Methods for Signature Identification
6. A Variety of Ensemble Methods for Signature Identification
SISPRO is powered by R shiny. It is free and open to all users with no login requirement and can be readily accessed by a variety of popular web browsers and operating systems as shown below.
In general, the file required at the beginning of SISPRO analysis should be a sample-by-feature matrix in a csv, xls/xlsx or txt format. The sample name and label name are sequentially provided in the first 2 columns of the input file. Names of these 2 columns must be kept as “SampleName” and “Class” without any changes during the entire analysis and names of the remaining columns are UniProt ID, UniProt AC or Entrez ID. The sample name is uniquely assigned according to the preference of users; the class ID refers to 2 differential analytical classes of samples, and is labeled with ordinary number “0” and “1”.
This website is freely accessible to all users without login requirement, and is readily accessed by all popular web browsers including Google Chrome, Mozilla Firefox, Safari and Internet Explorer 10 (or later), and so on. Analysis and subsequent performance assessment are started by clicking on the “Analysis” panel on the homepage of SISPRO. The collection of web services and the whole process provided by SISPRO can be summarized into 5 steps: (3.1) data upload & preprocessing, (3.2) assessing signature robustness, (3.3) assessing prediction accuracy, (3.4) protein function & signaling pathway interpretation and (3.5) protein-protein interaction network analysis.
The first radio checkbox in STEP-1 on the left side of the Analysis page is for users to select the way of uploading data. Users can choose to upload their own spatial proteomics data or directly load sample data.
Under the circumstance of uploading customized spatial proteomics data from users, 3 data formats which include csv, xls/ xlsx and txt can be selected in the remaining radio checkbox below. After selecting the corresponding radio checkboxes, datasets provided by the users for further analysis can be then directly uploaded by clicking Browse button. Preview of the uploaded data is subsequently provided on the web page.
In the situation of loading sample data, 2 sets of sample data are provided in this step to facilitate a direct access and evaluation of SISPRO. (1) Plasma membrane proteomics benchmark dataset PXD010361 from PRoteomics IDEntifications (PRIDE) database studies changes of plasma membrane proteins during Escherichia coli infection. Sample data could be downloaded HERE ( Right Click to Save). (2)Mitochondrial proteomics benchmark dataset JPST000934 from Japan ProteOme STandard Repository (jPOSTrepo) database studies mitochondrial proteins differences between acute leukemia cells and healthy peripheral blood mononuclear cells. Sample data could be downloaded HERE ( Right Click to Save). After selecting the sample data of plasma membrane proteomics or mitochondrial proteomics, review of the uploaded data is subsequently provided on the web page.
After uploading corresponding spatial proteomics data, 3 steps are subsequently provided for data preprocessing, which involve missing value imputation, data filtering and data normalization. The imputation methods used here are BPCA Imputation, Column Mean Imputation, Column Median Imputation, Half of the Minimum Pos-value, KNN Imputation, SVD Imputation and Zero Imputation. And 2 methods frequently applied to data filtering are covered, which include Mean Intensity Value and Standard Deviation. Moreover, 19 popular data normalization methods are also adopted in SISPRO, which involve Auto Scaling, Contrast, Cubic Splines, Cyclic Loess, Eigen MS, MSTUS, PQN, Quantile, Level scaling, Linear Baseline, Li-Wong, Mean, Median, Pareto Scaling, Power Scaling, Range Scaling, Total Sum, Vast Scaling, VSN. 2 transformation methods including Log Transformation and Cube Root Transformation are also provided. After selecting or defining preferred methods and parameters, please proceed by clicking the Process button, summary and visualization of the data before and after data preprocessing are automatically generated. All resulting data and figures can be downloaded by clicking the corresponding Download button.
Signature identification and ensemble is subsequently provided in this step. SISPRO offers 12 signature identification methods for analyzing spatial proteomics data and 3 types of signature ensemble and 6 ensemble method to achieve ensemble signature rank. Detailed description on each signature identification method and ensemble method is provided in the Section 4 and Section 5 of this Manual, respectively. Users select signature ensemble type, signature identification method(s) and set proper parameters of the selected methods. After selecting preferred methods and parameters, please proceed by clicking the Process button, the results of the robustness among identified signatures based on multiple sampling are automatically generated, which can be divided into two parts: (A) The Dependence of Signature Robustness on the Number of Signature Selected and (B) The Occurrence (in Percent) of Each Protein among All Signature. If users have any questions about the previous steps, please click the Back button and try again.
SISPRO provides users with 2 SVM kernel functions, which include linear kernel and radial basis function kernel. After selecting upper bound of AUC-based golden section search scope, please select the SVM kernel function and set the corresponding parameter in STEP-3 on the left side of the Analysis page. Finally, please proceed by clicking the Process button, the collective consideration of both prediction accuracy and robustness is automatically generated, which can be divided into two parts: (A) The Level of Prediction Accuracy for Signature Top-ranked by Signature Robustness and (B) Assessment of Prediction Accuracy for Identified Signatures. If users have any questions about the previous steps, please click the BACK button and try again.
Biological interpretation of selected signature is performed in this process. After selecting preferred organelle(s), please proceed by clicking the Process button. Collapsible tree of biological annotation including four levels is automatically generated. The first level is the organelle(s) users select. The second level is the substructures of the organelle (for those without substructure, name of the organelle shows again). The third level includes two types of annotation for each substructure, which are protein function and signaling pathway The last level is annotations corresponding to the substructure and annotation type. Pathway information including description, protein(s) in pathway, p value, q value, p adjust and reference will be provided when clicking on or put the mouse over the last level.
After selecting preferred organelle(s) and parameters, please proceed by clicking the Process button. PPI network is automatically generated. Different shape of node represents different type of proteins, specifically triangle and dot representing identified signature and interaction proteins respectively. Different color of lines between identified signature and interaction proteins represents the organelle/substructure that PPI occurs. Protein information including gene name, gene ID, organelle and substructure of corresponding protein will be given when clicking on or put the mouse over the node. PPI information including organelle and substructure of PPI, experiment type and reference will be given when clicking on or put the mouse over the line. All PPI information can be downloaded by clicking the corresponding Download PPI Info button.
-
Centrosome: Centrosome is the organelle that serves as the main microtubule organizing center (MTOC) of the animal cell, as well as a regulator of cell-cycle progression. It is composed of Centriole and Centriolar satellite (Azimzadeh J, et al. Curr Opin Struct Biol. 66: 96-103, 2020).
Centriole is complex microtubule-based structure and the core of the centrosome are a pair of centrioles made of nine groups of triplet microtubules arranged perpendicular to each other. As the main microtubule organizing center (MTOC), centrioles play a important part in many microtubule-dependent processes such as mitosis and intracellular transport (Azimzadeh J, et al. Curr Opin Struct Biol. 66: 96-103, 2020).
Centriolar satellites are small,electron‐dense, membraneless granules cluster around Centrioles which contain numerous proteins that directly involved in centrosome maintenance, ciliogenesis, and neurogenesi (Azimzadeh J, et al. PLoS Biol. 18: e3000679, 2020).
-
Cytoskeleton: Cytoskeleton is a complex, dynamic network of interlinking protein filaments present in the cytoplasm of all cells. It extends from the cell nucleus to the cell membrane and is composed of similar proteins in the various organisms. In eukaryotes, it is composed of three main components: (1) actin filaments, (2) intermediate filaments and (3) microtubules (Wickstead B, et al. J Cell Biol. 194: 513-25, 2011).
Actin filaments also known as microfilaments or F-actin, are polymers of globular actin subunits that organized in long and straight bundles or three dimensional networks of filaments. As one of the major components of the cytoskeleton, the dynamics of actin filaments are important to the organization and function of actin filaments. It can provides scaffold for cellular processes, form the contractile ring at the cell cortex during cytokinesis and many other functions (Svitkina T, et al. Cold Spring Harb Perspect Biol. 10: a018267, 2018).
Intermediate filaments are made from a large group of proteins and connect both nuclear membrane, plasma membrane by forming an extensive network in the cytosol of cells. It's association with numerous proteins and other subcellular structures making Intermediate filaments play an important role in many cell activity such as providing mechanical support and shape to cells, intracellular organization, cytoskeletal cross-talk, cell adhesion, and cell signaling (Etienne-Manneville S, et al. Annu Rev Cell Dev Biol. 34: 1-28, 2018).
Microtubules are hollow tube with a diameter of about 25 nm which formed by two globular subunits: alfa- and beta tubulin. Dynamic instability is one of it's main features as microtubules are constantly undergoing polymerization and rapidly alter between phases of growth and shrinkage by changes in the relative rates of polymerization and depolymerization. Microtubules not only provide mechanical support to cells, but also play a prominent role in intracellular polarization, organization and transport. Microtubules also forms mitotic spindle to help separating chromosomes into the daughter cells during mitosis (Goodson HV, et al. Cold Spring Harb Perspect Biol. 10: a022608, 2018).
-
Endoplasmic reticulum: Endoplasmic reticulum (ER) is a membranous network continuous with the outer nuclear membrane. It can be divided into two categories: smooth ER (sER) and rough ER (rER) with have ribosomes attached to the cytoplasmic surface. rER synthesis membrane proteins, other proteins release to the extracellular space and also contribute to the synthesis of lipids and steroids. ER is also a storage sites for intracellular ions that maintaining a homeostasis in the cell by regulating ions release (Csordas G, et al. Trends Cell Biol. 28: 523-40, 2018).
- Golgi apparatus: Golgi apparatus is a central hub in the endomembrane system of human cells that can be divided into cis-, medial- and trans-Golgi compartments. vesicles carrying proteins and membrane components from the ER enter at the tubular cis-Golgi network and then modified by various enzymes that reside in the Golgi membranes. Fanally,they are categorize and exit by vesicles bud from the trans-Golgi network (Ravichandran Y, et al. Curr Opin Cell Biol. 62: 104-13, 2020).
-
Mitochondria: Mitochondria is a double-membrane-bound organelle found in most eukaryotic organisms. Because of this double-membraned organization, there are four distinct parts to mitochondria: (1) Mito inner membrane, (2) Mito intermembrane space, (3) Mito matrix and (4) Mito outer membrane (Nunnari J, et al. Cell. 148: 1145-59, 2012).
Mito inner membrane folded into characteristic cristae creating multiple sub-compartment where many different biochemical reaction took place. Mitochondria produces energy in the form of ATP at the inner mitochondrial membrane and the matrix within (Kuhlbrandt W, et al. BMC Biol. 13: 89, 2015).
Mito intermembrane space is the gap between the outer and inner membrane of the mitochondria which is filled with amorphous liquid. The content of it is very close to the cytoplasmic matrix, containing many biochemical substrates, soluble enzymes, and cofactors (Kuhlbrandt W, et al. BMC Biol. 13: 89, 2015).
Mito matrix is the liquid between the characteristic cristae created by the folded inner membrane. It contains many enzymes involved in biochemical reactions such as tricarboxylic acid cycle, fatty acid oxidation and amino acid degradation. There are also mitochondrial DNA, RNA, and ribosomes in the matrix (Kuhlbrandt W, et al. BMC Biol. 13: 89, 2015).
Mito outer membrane has smaller content of protein and connects the endoplasmic reticulum forming a structure called mitochondria-associated ER-membrane, MAM. It can preliminary oxidation of a substance and involved in many different biochemical reaction such as extension of fatty acid chain, oxidation of adrenergic and tryptophan biodegradation (Kuhlbrandt W, et al. BMC Biol. 13: 89, 2015).
-
Nucleus: Nucleus is a membrane-bound organelle found in eukaryotic cells. The main structures making up the nucleus are the (1) Nuclear membrane, (2) Nucleoli and (3) Nucleoplasm (Lusk CP, et al. Curr Opin Cell Biol. 44: 44-50, 2017).
Nuclear membrane consists of two lipid bilayers that separate the nucleoplasm from the cytoplasm. The inner membrane and the underlying nuclear lamina contain intermediate filament proteins serves as an anchoring site for chromatin. The outer membrane is attached to the endoplasmic reticulum nuclear pore complexes are distributed throughout the membrane allowing free diffusion of small molecules and selective transportation of large molecules (Dingwall C, et al. Science. 258: 942-7, 2018).
Nucleoli are non-membrane bound subcompartments in the nucleoplasm assembled around nucleolar organizing regions (NORs). The main function of nucleoli is the synthesis, processing and assembly of ribosomes, it also contribute to stress responses and cell cycle regulation (Fulka H, et al. Trends Mol Med. 21: 663-72, 2015).
Nucleoplasm contains most of the human genome and a large number of proteins involved in DNA-related cellular processes, some of which are involved in the formation of substructures such as nucleoli, nucleosomes and nuclear spots (Galganski L, et al. Nucleic Acids Res. 45: 10350-68, 2017).
- Plasma membrane: Plasma membrane is lipid bilayer separating the interior of the cell from the exterior which is composed of phospholipids, cholesterol, glycolipids, and a large fraction of membrane proteins. It plays an important roles in cell communication and signalling, cell adhesion, cell shape, and cell motility. Plasma membrane also controls material transportation by allowing diffusion of some small molecules and larger molecules has to go through transmembrane protein channels and transporters. Cell junctions are large protein complexes in the plasma membrane connecting cells and the extracellular matrix (ECM) or neighboring cells (Krapf D, et al. Curr Opin Cell Biol. 53: 15-21, 2018).
- Vesicle: Vesicle is a collective term for small structures or organelles enclosed by a lipid bilayer. Based on different function there are transport vesicles which transport protein and lipids between different cellular compartments, secretory vesicles that relarease substances to the exterior of the cell through exocytosis, endosomes and lysosomes uptake substance by endocytosis or phagocytosis and vesicles with special special function like peroxisomes (van Niel G, et al. Nat Rev Mol Cell Biol. 19: 213-28, 2018).
- Chi-squared Test. Chi-square test (CHIS) is a widely used hypothesis testing method for counting data. It firstly hypothesizes the independence of two events and then determines the correctness of the theoretical value by observing the deviation between the actual value and the theoretical value (McHugh ML, et al. Biochem Med (Zagreb). 23: 143-9, 2013). Using CHIS statistic for signature identification is similar to importing hypothesis testing about the distribution of classes (Zhang H, et al. Biomed Res Int. 2014: 589290, 2014).
- Correlation-based Method. Correlation-based Method is a multivariate method of filter, which evaluates attribute subset according to the prediction ability of each feature in it and the correlation between them. The subsets with strong prediction ability and low internal correlation in the signatures perform well, which is the core hypothesis of this method (Batushansky A, et al. Biomed Res Int. 2016: 8313272, 2016).
- Entropy-based Filters. Entropy-based Filters is a filter-based feature ranking technique including information gain, gain ratio and symmetrical uncertainty (Tang J, et al. Brief Bioinform. 21: 1378-90, 2020). Information gain selects the features based on the information contribution related to the class variable without considering feature interaction. Gain ratio is the non-symmetrical measure that is introduced to compensate for the bias of the information gain. Symmetrical uncertainty criterion compensates for information gain's inherent bias.
- Fold Change Analysis. Fold Change (FC) is a basic and widely used method for identifying different gene expression, referring to the fold change between two samples (Feng J, et al. Bioinformatics. 28: 2782-8, 2012). The FC has been widely applied in metabolomics to identify urinary metabolomic biomarkers of aminoglycoside nephrotoxicity in newborn rats (Hanna MH, et al. Pediatr Res. 73: 585-91, 2013).
- Linear Model &Bayes. Linear Model &Bayes assesses the differential intensities by measuring features based on t-statistics and fold changes simultaneously (Tang J, et al. Brief Bioinform. 21: 1378-90, 2020).
- PLS-DA. Partial Least Squares Discriminant Analysis (PLS-DA) uses the partial least squares (PLS) algorithm to establish a model for predicting the categories of samples or discriminative variable selection (Lee LC, et al. Analyst. 143: 3526-39, 2018). It consists of a classical PLS regression analysis in which the response regressor is the class label. PLS components are built by trying to find a proper compromise between describing the data and predicting the response.
- Random Forest. Random forest refers to a classifier that uses multiple decision trees to train and predict samples. It belongs to supervised learning (Touw WG, et al. Brief Bioinform. 14: 315-26, 2013). The algorithm has become very popular for pattern recognition in OMIC data, mainly because it provides two aspects that are very important for data mining: high accuracy and information on the variable importance for classification.
- Random Forest-Recursive Feature Elimination. Random Forest-Recursive Feature Elimination (RF-RFE) combines random forest and recursive feature elimination. It is a recursive backward feature elimination procedure (Zhou L, et al. Anal Bioanal Chem. 403: 203-13, 2012). In each iteration, a random forest is constructed to measure the features's importance and the feature of least importance is removed. This procedure is repeated until there is no feature left. Finally, the features are ranked based on the deleted sequence, and the top ranked feature is the last deleted one.
- Significance Analysis for Microarrays. Significance Analysis for Microarrays (SAM) is a statistical approach for the identification of molecular quantities that differ significantly between two measurement sets (Constantinou C, et al. J Proteome Res. 10: 869-79, 2011).
- Student t-test. Student t-test compares the mean of the data sets, judges whether the two are the same and whether the difference between the two is self-evident. It is a test under normal curve theory (Kumar N, et al. Bioinformation. 13: 202-8, 2017).
- Support Vector Machine-Recursive Features Elimination. Support Vector Machine-Recursive Features Elimination (SVM-RFE) identify the least useful attributes to eliminate for further analysis or development of prediction models (Ding Y, et al. BMC Bioinformatics. 7 Suppl 2: S12, 2006). It is a permutation-based (non-parametric) hypothesis testing method for the identification of molecular quantities that differ significantly between two measurement sets that represent different physiological conditions (Larsson O, et al. BMC Bioinformatics. 6: 129, 2005).
- Wilcoxon rank-sum test.Wilcoxon rank-sum test is generally used to detect whether 2 data sets come from the same population, which is frequently used in statistical practice for the comparison of measures of location when the underlying distributions are far from normal or not known in advance (Rosner B, et al. Biometrics. 59: 1089-98, 2003).
- Homogeneous. Homogeneous ensemble uses the same signature identificaiton method with different training data. By distributing the dataset into several nodes and parallelizing the training task, it can considerably reduces training time while ensuring reasonable classification accuracy. Therefore, it becomes a optimal tool for large data analysis (B. Seijo-Pardo, et al. Knowl Based Syst. 118: 124-39, 2017).
- Heterogeneous. Heterogeneous ensemble uses different signature identificaiton method on the same training data. The advantage of heterogeneous centralized ensemble is it's ability to maintain or improve classification performance and free the users from choosing the most appropriate signature identificaiton method for their study at the same time (B. Seijo-Pardo, et al. Knowl Based Syst. 118: 124-39, 2017).
- Hybrid. Hybrid ensemble is a combination method of homogeneous and heterogeneous ensemble. It uses different signature identificaiton methods with different training (Neumann U, et al. BioData Min. 10: 21, 2017).
Please feel free to visit the website of Prof. Feng Zhu (the corresponding author):
Official website: https://idrblab.org/Peoples.php
Name
From
To
Topic
Message
Thank you
For your interest in SISpro
We'll be in touch shortly
Get in touch directly at zhou_ying@zju.edu.cn