FCCLnc: functional characterization of disease/comorbidity-associated long noncoding RNAs

FCCLnc was constructed to enable functional characterization of lncRNAs by (1) integrating diverse SNPs that were associated with 193 diseases standardized by the WHO International Classification of Diseases (ICD-11), (2) enabling lncRNA functional characterization in 193 diseases and a large number of comorbidities, (3) reducing false discovery by detecting the condition-specific expression of lncRNA and (4) providing interactive visualization and full download of lncRNA-centered co-expression network.

Thanks a million for using and improving FCCLnc, and please feel free to report any errors to Dr. TANG at tangj@cqu.edu.cn.

Browser and Operating System (OS) Tested for Smoothly Running FCCLnc:

FCCLnc is free and open to all users with no login requirement and can be readily accessed by popular web browsers and operating systems shown below.

Brief Introduction to the Function and Usage of FCCLnc

Discovering the Potentially Disease-associated lncRNAs by SNP-disease Associations

The SNP-disease associations were first collected from GWASdb (Li MJ, et al. Nucleic Acids Res. 44: D869-D876, 2016), NHGRI-EBI GWAS Catalog (Buniello A, et al. Nucleic Acids Res. 47: D1005-D1012, 2019), GRASP2 (Zhong C, et al. BMC Bioinformatics. 20: 276, 2019) together with our comprehesive literature review on PubMed, which led to 24,339 associations between 193 standardized diseases and 22,458 SNPs. Second, the chromosome data of lncRNAs was downloaded from the NONCODEV5 (Fang S, et al. Nucleic Acids Res. 47: 46: D308-D314, 2018) to match the disease-associated SNPs to the lncRNA region. Finally, 10,936 lncRNAs (with at least one disease-associated SNP) were identified to be “potentially disease-associated”.

Detecting the Interindividual Variability of lncRNA by Condition-specific Expression

The interindividual expression variability of lncRNA is assessed using the standard measure ‘coefficient of variation (CV)’ (Ecker S, et al. Genome Med. 7: 8, 2015). A low value of CV denotes a lncRNA in normal cell, while a high value represents the disease-related lncRNA (Signal B, et al. Trends Genet. 32: 620-637, 2016). Herein, the CV is first defined as the ratio between the standard deviation of the lncRNA expression levels measured across the patients and its mean (Ecker S, et al. Genome Med. 7: 8, 2015). Using those “potentially disease-associated” lncRNAs identified in previous section, their CV values were then calculated and ranked. Finally, top-N ranked lncRNAs were identified as “disease-associated”.

Constructing the Co-expression Network Based on lncRNAs’ Neighboring Genes

The comprehensive data of 96,308 lncRNAs and 19,975 protein coding genes were first collected from NONCODEV5 (Fang S, et al. Nucleic Acids Res. 47: 46: D308-D314, 2018) and GENCODEV31 (Frankish A, et al. Nucleic Acids Res. 47: D766-D773, 2019), respectively. Then, the neighboring genes within 5kb ~ 500kb up/downstream of the studied lncRNAs were calculated, which resulted in a collection of neighboring genes of the studied disease-associated lncRNAs. Third, WGCNA (Langfelder P, et al. BMC Bioinformatics. 9: 559, 2008) was used to compute a co-expression network based on the studied lncRNAs and their neighboring genes. The resulting co-expression network was illustrated and downloadable in FCCLnc.

Characterizing the lncRNA Function in Comorbidity Using Common Disease Genes

The mechanism of lncRNAs in comorbid diseases were explained by their shared genetic factors (namely “common disease genes”) (Goh KI, et al. Proc Natl Acad Sci U S A. 104: 8685-8690, 2007; Ko Y, et al. Sci Rep. 6: 39433, 2016). Thus, uploaded matrices containing the data of multiple diseases was allowed. First, the RNA expression data of each disease were analyzed using the sequential steps discussed above, which resulted in multiple co-expression networks. Then, a direct overlap of disease-associated RNAs among multiple diseases was conducted, which identified a set of common disease genes. Third, multiple networks were linked together based on this set of common disease genes, and the resulting network is the network of the comorbidity. Finally, the common disease lncRNAs were characterized as "comorbidity-associated", and their function was annotated by their co-expressed mRNAs.

Table of Contents

1. The Compatibility of Browser and Operating System (OS)

2. Step-by-step Instruction on the Usage of FCCLnc

2.1 Required Formats of the Input Files

2.2 Upload your lncRNA and mRNA expression matrix separately or the Sample Data Provided in FCCLnc

2.3 Discover Disease-associated lncRNAs

2.4 Construct Co-expression Network

2.5 Annotate lncRNA Function

3. FCCLnc platform intergrated the SNP associated-disease based on GWAS from various publicly database and manually searching

3.1 NHGRI

3.2 GWASdb

3.3 GRASP2

4. WGCNA computation methods

5. GO terms(BP, MF, CC) and KEGG pathway

5.1 GO

5.2 KEGG Pathway

6. Combining several computational methods

6.1 Differential expression

6.2 Guilt-by-association

6.3 Condition-specific expression

6.4 Combining several computational methods

1. The Compatibility of Browser and Operating System (OS)

The FCCLnc is powered by R shiny. It is free and open to all users with no login requirement and can be readily accessed by a variety of popular browsers and operating systems.

2. Step-by-step Instruction on the Usage of FCCLnc

This website is free and open to all users and there is no login requirement, and can be readily accessed by all popular web browsers including Google Chrome, Mozilla Firefox, Safari and Internet Explorer 10 (or later), and so on. Analysis and subsequent function characterization of disease-association lncRNA are started by clicking on the "Analysis" panel on the homepage of FCCLnc The collection of web services and the whole process provided by FCCLnc can be summarized into 4 steps: (2.2) uploading gene expression data, (2.3) discover disease-associated lncRNAs, (2.4) network construction and (2.5) function annotation. All resulting data and figures can be downloaded by clicking the corresponding "Download" button. The flowchart below summarizes the flowchart of analyzing processes in FCCLnc.

2.1 Required Formats of the Input Files

In general, the file required at the beginning of FCCLnc analysis should be a sample-by-gene matrix in a csv/txt format.Upload expression matrix should be the TXT/CSV format files that sample in column, gene in row. The lncRNA gene name could be NONCODE Gene ID (for example: NONHSAG000001.2) or Ensemble Gene ID (for example: ENSG00000263089), and the mRNA gene name should be Ensemble Gene ID (for example: ENSG00000168746). The expression matrix should be reads counts or normalized. The group name (label) of each sample can be "control" and "case" for single disease or disease name for comorbidity. The sample format of lncRNA and mRNA dataset can be seen as following.

2.2 Upload your lncRNA and mRNA expression matrix separately or the Sample Data Provided in FCCLnc

In STEP 1, users can choose to upload their own gene expression matrix or to directly load sample data on the left side of the Analysis page. The disease of uploaded sample class, the top-N lncRNAs ranked by CV values in detecting condition-specific expression and the distance for defining the neighboring mRNAs of a studied lncRNA (up/downstream) can be selected in the 3 remaining drop-down selection boxes. After selecting all corresponding parameters, datasets provided by the users for further analysis can be then directly uploaded by clicking “Submit”.

2 sets of sample data are also provided in this step facilitating a direct access and evaluation of FCCLnc. The breast cancer benchmark dataset were collected from TCGA database, which included the samples of 115 breast cancer tissues and 113 paracancerous tissues ( Cancer Genome Atlas Network, et al. Nature. 490(7418):61-70, 2012). The benchmark GSE133099 were collected from GEO database, which included the samples of six diabetes patients and six obesity people (Barrett T, et al. Nucleic Acids Res. 41: D991-D995, 2013).

2.3 Discover Disease-associated lncRNAs

In this procedure, integrating SNP-disease associations and condition-specific expression is estimated. The output(s) contained: (1) the distribution of raw data uploaded and (2) the distribution of the data log-transformed based on the disease-associated lncRNAs, which were identified via the disease-SNP associations. All resulting data can be downloaded by clicking the corresponding download button.

2.4 Construct Co-expression Network

In this procedure, co-expression network between disease-associated lncRNAs and the corresponding neighboring gene is estimated. The resulting co-expression network can be downloaded in the format of HTML (for visualization analysis) and in the format of CYS (supporting the network analysis in Cytoscape).

The Visual dynamic network indcluding the information:

(1) Select the specific lncRNA or mRNA by Gene Name or Gene Class for displaying the informantion

(2) When click on a node, the edges and other nodes connected to the node will be highlighted

(3) When hovering over a node, the annotation information of the node will be displayed, including gene ID, gene symbol, gene Location, NGG_Name: neighbor gene’s name of lncRNA (show 1), NGG_ID: neighbor gene’s Ensemble ID of lncRNA (show 1), SNP: SNPs located on lncRNA (show 3), Group: The group of the node, KEGG_Pathway: KEGG pathway that the mRNA is involved in for mRNA node, or KEGG pathway that the NGG is involved in for lncRNA node (show 1), GO_Term: GO term that the mRNA is involved in for mRNA node, or GO term that the NGG is involved in for lncRNA node (show 1)

2.5 Annotate lncRNA Function

In this procedure, GO terms (containing biological processes, molecular functions, and cellular components) and KEGG pathway enrichment analysis were performed via the mRNAs co-expressed with disease-associated lncRNAs. The output(s) contained: (1) chord diagram of KEGG pathway enrichment result and (2) chord diagram of GO terms enrichment result based on the mRNAs co-expressed with disease-associated lncRNAs. All resulting data and figures can be downloaded by clicking the corresponding download button.

3. FCCLnc platform intergrated the SNP associated-disease based on GWAS from various publicly database and manually searching

Given the characteristic specificity of lncRNA expression, there is growing interest in the use of these molecules as disease biomarkers. SNP have been reported to affect the structure, expression, and function of lncRNAs (Castellanos-Rubio, et al. Front. Immunol. 10, 420). SNPs identified by genome-wide association studies (GWAS) and other genomic variations contained within or nearby lncRNAs can also point to functional roles in specific phenotypes. FCCLnc provides the association Disease-SNP based on GWAS & GWASdb &NHGRI and manual searching. All diseases in the FCCLnc were standardized via the ICD-11 version.

3.1 NHGRI

The National Human Genome Research Institute (NHGRI), which focus on advance in genomics research. The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies (GWAS) Catalog (Buniello, et al. Nucleic Acids Res. 47(D1):D1005-D1012) provides a publicly available manually curated collection of published GWAS assaying at least 100,000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P <1*10(-5). The Catalog includes 1751 curated publications of 11912 SNPs. In addition to the SNP-trait association data, the Catalog also publishes a quarterly diagram of all SNP-trait associations mapped to the SNPs chromosomal locations.

3.2 GWASdb

Genome-wide association studies database (GWASdb) (Li M.J, et al. Nucleic Acids Res. 44, D869-D87), a database for human genetic variants identified by genome-wide association studies. GWASdb contains 20 times more data than the GWAS Catalog and includes less significant GVs (P < 1*10(-3)) manually curated from the literature. In addition, GWASdb provides comprehensive functional annotations for each GV, including gene expression and disease associations.

3.3 GRASP2

GRASP v2.0 (Zhong C, et al. BMC Bioinformatics. 20, 276) contains over 8.87 million SNP associations reported in 2082 studies. GRASP v2.0 is a user-friendly means for diverse sets of researchers to query reported SNP associations (P<0.05) with human traits, including methylation and expression quantitative trait loci (QTL) studies.

4. WGCNA computation methods

Correlation networks are increasingly used in bioinformatics applications. Weighted correlation network analysis (WGCNA) (Langfelder P, et al. BMC Bioinformatics. 9, 559), a powerful guilt-by-association (GBA) method for constructing co-expression network based on expression data. WGCNA was developed and applied to high-throughput microarray or RNA-seq datasets since it provides a system- level insights, high sensitivity to low abundance, or small fold changes genes without any information loss. And Weighted gene co-expression network analysis (WGCNA) tool can detect clusters of highly correlated genes. It is measured by intermolecular expression correlation coefficients. To measure their co-expression relation, the molecular expression mode in the same module. The expression pattern is similar to that of other modules. Molecules with similar expression patterns may participate in the same biological process or pathway. It retains the network node connection degree to have the continuity. WGCNA has a powerful analytical power. The more samples, the better the results. However, as the number of samples and genes increases, more computational resources are needed. Consuming a large resource is the disadvantage of WGCNA. Those popular kinds of software listed in the of this Manual aim at quantifying the raw proteomics data.

5. GO terms(BP, MF, CC) and KEGG pathway

5.1 GO

The Gene Ontology (GO) (Gene Ontology, et al. Nucleic Acids Res. 47, D330-D33) is a resource that supplies information about gene product function using ontologies to represent biological knowledge. These ontologies cover three domains: Cellular Component (CC), Molecular Function (MF), and Biological Process (BP). The GO database standardizes gene products from functional, participating biological pathways and localization in cells. That is, a simple annotation of the gene product, through the GO enrichment analysis can roughly understand which biological functions, pathways or cell localization of differential gene enrichment. As for the GO terms, they are clustered with a directed acyclic graph (DAG). The annotations of them describe the specific role of candidate genes and summarize the interaction network of their products with no redundancy. The functional differentiation of cancer relevant and irrelevant lncRNAs may be directly reflected by their respective GO terms and can be easily studied with the help of the corresponding analytic tools (AmiGO, CGAP, DAVID, etc). This method can also be summarized as ontology based screening.

5.2 KEGG Pathway

KEGG (Kanehisa M, et al. Nucleic Acids Res. 45, D353-D361) is an encyclopedia of genes and genomes. Assigning functional meanings to genes and genomes both at the molecular and higher levels is the primary objective of the KEGG database project. Molecular higher-lever functions are represented by networks of molecular interactions, reactions and relations in the forms of KEGG pathway maps, BRITE hierarchies and KEGG modules. KEGG is moving towards becoming a comprehensive knowledge base for both functional interpretation and practical application of genomic information. There are four parts to this database: systems information, genomic information, chemical information and health information. The system information contains the pathway maps for cellular and organismal functions, which may help us screen the diversity of functional lncRNAs in tumor and normal tissues. Pathway refers to metabolic pathways, and pathway analysis of differential genes can be used to understand the metabolic pathways that are significantly altered under experimental conditions, which is particularly important in mechanism studies. Based on the KEGG database, we can identify and confirm the functional pathways in tumorigenesis and their respective regulatory lncRNAs, which can be summarized as pathway based screening. Supported by Kanehisa Laboratories, the KEGG database is a functional database that stores high-level functions and utilities of biological systems. LncRNAs plays an important role in the regulation of gene expression. Identification of cancer-related lncRNAs GO terms and KEGG pathways is great helpful for revealing cancer-related functional biological processes. For the accurate description of the detailed biological functions, gene ontology (GO) and KEGG pathways are introduced for further functional clustering and summary of lncRNA. We identified the relationships between cancer-related lncRNAs and GO terms or KEGG pathways based on the constructed dataset. But not all GO terms and KEGG pathways have equal associations with cancer-related lncRNAs.

6. Combining several computational methods

Core features of functional lncRNAs can be probed via an array of computational methods strengthened by publicly-available datasets. Several methods based on binding and sequence features can be applied to build evidence for function and point towards particular mechanisms, and predictive algorithms are beginning to show promise in interrogating lncRNA functional properties. Predictive tools can work on lowly and specifically expressed transcripts,it is foreseeable that their continued development will enable functional characterization of a much wider pool of lncRNAs. Computational methods such as differential expression; guilt-by-association; condition-specific expression. (Signal B, et al. Trends Genet. 32, 620-637)

6.1 Differential expression

The most common method of inferring lncRNAs function in a system is through differential expression analysis. The widely utilized and generally accepted method is adept at prioritizing candidates for further examination, but differential expression alone does not typically produce any functional insights.

6.2 Guilt-by-association

Guilt-by association, as the name suggests, assigns putative functions to transcripts based on those it is co-expressed with. It is predicated on the idea that co-expressed transcripts are more likely to be coregulated, share similar functions, or are involved in similar biological processes. Guilt-by association take advantage of the general characteristics of lncRNAs by exploiting other biological contexts. The difference between of the two methods (6.1/6.2) is expression patterns from multiple related biological conditions can be used, enabling the identification of distinct relationships between transcripts.

6.3 Condition-specific expression

Numerous lncRNAs show specific temporal and spatial expression patterns, which can direct us towards the biological context in which they are acting. Multiple algorithms are available for the detection of condition-specific expression which can be used in place of differential expression testing in a larger number of conditions. Compared to protein-coding genes, transcribed lncRNAs tend to have higher expression variability within the same condition, which can complicate annotation. Low variability may be used as a potential indicator of transcript function in normal cell functions, whereas high variability may indicate environment- and disease-related function.

6.4 Combining several computational methods

However, a single approach cannot detect all aspects of functional characteristics of a gene. All these approaches can be combined to get a more complete picture of lncRNA function. Combining several computational methods which are complentary to each other is an effective approach to maximize research findings and effectively deploy laboratory resources (Signal B, et al. Trends Genet. 32, 620-637).

@ ZJU