MathJax v3 with TeX input and HTML output Font Awesome 图标

Table of Contents

1. User Manual Requirements

1.1 The Compatibility of Browser and Operating System (OS)

1.2 Required Formats of the Input Files

2. Step-by-step Instruction on the Usage of intRNA

2.1 Steps for using Module 'Analysis'

2.1.1 Uploading RNA Data

2.1.2 Selecting Feature Type

2.1.3 Coding Result Analysis

3. RNA Encoding Methods (D-features)

3.1 Sequence-Intrinsic Features

3.1.1 Codon related

3.1.1.1 Fickett Score

3.1.1.2 Stop codon related Features

3.1.2 Open reading frame

3.1.2.1 Basic ORF Features

3.1.2.2 Entropy density profiles of ORF

3.1.2.3 Measurement of Hexamer on ORF

3.1.3 Guanine-cytosine related

3.1.4 Transcript K-mers content

3.1.5 Global transcript sequence descriptors

3.1.6 Entropy Density Profiles on Transcript

3.2 Physico-chemical Features

3.2.1 Pseudo protein related

3.2.2 Nucleotide related

3.2.2.1 Autocorrelation of dinucleotide

3.2.2.2 Pseudo dinucleotide composition

3.2.3 EIIP based spectrum

3.3 Structure-based Features

3.3.1 Multi-scale Secondary Structure Information

4. RNA Encoding Methods (G-features)

4.1 G-gap Features


1. User Manual Requirements

1.1 The Compatibility of Browser and Operating System (OS)

It is free and open to all users with no login requirement and can be readily accessed by a variety of popular web browsers and operating systems as shown below.

Table 1 The Compatibility of Browser and Operating System

OS

Chrome

Firefox

Edge

Safari

Linux(Ubuntu-17.04)

v78.0.3904.108

v52.0.1

n/a

n/a

MacOS(v10.1)

v78

v71

n/a

v8

Window(v10)

v78.0.3904.108

v70.0.1

V44.18362.449.0

n/a

1.2 Required Formats of the Input Files

In general, the files required at the beginning of intRNA analysis should be the FASTA files of RNA.

2. Step-by-step Instruction on the Usage of intRNA

2.1 Steps for using Module ‘Analysis’

2.1.1 Uploading RNA Data

By clicking on 'Upload Your RNA Data', users can upload one RNA fasta file. Firstly, click the button on the left of the 'No file selected' to select the file with the specified path. Secondly, click the 'Upload' button after the selection to upload the local file. After uploading files, the number of four bases of RNA sequence distribution can be visualized. Thirdly, click 'Load data' button for further analysis steps.

By clicking on 'Upload Sample RNA Data', users can directly use the sample provided on the server for analysis. The right figure visualizes the four bases of RNA sequence count distribution for the sample. Sample data of this module can be downloaded on the analysis interface. Click 'Load data' for further analysis steps.

2.1.2 RNA Image

There are 2 encoding feature groups: D-features are divided into 3 categories (sequence-intrinsic features, physicochemical features and structure-based features) and G-features are divided into 2 categories (K-mer features and G-gap features), which are integrated into intRNA for RNA. The specific description of the feature are presented on the right side. Users can click 'Calculate' to encode RNA into RNAimage.

The process of RNAimage generation will spend about 2 minutes. Please wait patiently.

The calculated results will be displayed on the right side. Users can download encoding features in csv, tsv, and npy formats. The RNA template images and an example RNA image of D-features and G-features are presented in the right side. Subsequently, click 'Next' to further step.

2.1.3 Classification Result

In this section, users can classify the RNA based on the RNA image features obtained from the previous encoding step. As for the classifier, we use deep learning-based classification model. The RNA will be classified to one category. The detailed description is provided in the right side. Then click 'Submit' button to show the final results.

The result of RNA classification will presented in the right side. The content includes RNA sequence id, RNA sequence length, the RNA category and RNA sequence.

3. RNA Encoding Methods (D-features)

3.1 Sequence-intrinsic Features

3.1.1 Codon related

Codon is a group of nucleotide triplets which specifies a single amino acid. It’s the set of rules used to construct genetic information. Codons are important indicators for transcripts coding into proteins (Shuai Liu, et al. Genes (Basel). 10: 672, 2019). Codon characteristics are fundamental sequence features and closely related to RNA coding potential.

3.1.1.1 Fickett Score

Fickett score (FickS) is a simple linguistic feature that distinguishes protein-coding RNA (pcRNA) and non-coding RNA (ncRNA) according to the combinational effect of nucleotide composition and codon usage bias, first proposed by Fickett (J W Fickett, et al. Nucleic Acids Res. 10: 5303-18, 1982). It’s for RNA coding potential assessing independent of the ORF, and when the length of the test RNA region is ≥200nt, Fickett score alone can achieve 94% sensitivity and 97% specificity (Liguo Wang, et al. Nucleic Acids Res. 41: e74, 2013).
Calculation:
Fickett score is obtained by computing four position values and four composition values (base content) related coding potential from the transcript.
Position value reflects the degree to which each base is favored in one codon position versus another. For example, position value of A(Aposition):

\[ A_1=\text{Number of As in position 0,3,6,…,3n} \]

\[ A_2=\text{Number of As in position 1,4,7,…,3n+1} \]

\[ A_3=\text{Number of As in position 0,3,6,…,3n+2} \]

$$ A_{position}=\frac{\max{(A_1,A_2,A_3)}}{\min{(A_1,A_2,A_3)}+1} $$ Position value of Tposition, Cposition, Gposition are determined in the same way.
Composition values (base content) of each base (Acontent, Tcontent, Ccontent, Gcontent) is easy to calculate.
Then these eight values can be converted into a probabilities (p) of coding table, which quantified from coding potential of real sequences of known function by Fickett (J W Fickett, et al. Nucleic Acids Res. 10: 5303-18, 1982).

Table 2 Characteristic Parameters of Coding and Noncoding Sequence

Position Parameter

Probability of Coding

A

C

G

T

0.0~1.1

0.22

0.23

0.08

0.09

1.1~1.2

0.30

0.30

0.08

0.09

1.2~1.3

0.34

0.33

0.16

0.20

1.3~1.4

0.45

0.51

0.27

0.54

1.4~1.5

0.68

0.48

0.48

0.44

1.5~1.6

0.58

0.66

0.53

0.69

1.6~1.7

0.93

0.81

0.64

0.68

1.7~1.8

0.84

0.70

0.74

0.91

1.8~1.9

0.68

0.70

0.88

0.97

1.9~2.0+

0.94

0.80

0.90

0.97

Content Parameter

Probability of Coding

A

C

G

T

0.0~0.17

0.21

0.31

0.29

0.58

0.17~0.19

0.81

0.39

0.33

0.51

0.19~0.21

0.65

0.44

0.41

0.69

0.21~0.23

0.67

0.43

0.41

0.56

0.23~0.25

0.49

0.59

0.73

0.75

0.25~0.27

0.62

0.59

0.64

0.55

0.27~0.29

0.55

0.64

0.64

0.40

0.29~0.31

0.44

0.51

0.47

0.39

0.31~0.33

0.49

0.64

0.54

0.24

0.33~0.99

0.28

0.82

0.40

0.28

Then Each probability is multiplied by a weight $w$ for the respective base, where the value of $w$ reflects the percentage of each parameter alone successfully predictions for the sequence of known function, also reported by Fickett (J W Fickett, et al. Nucleic Acids Res. 10: 5303-18, 1982) in weights $w$ of parameters table.

Table 3 Weight to be Given to the Individual Parameters

Methods

Position

Cotent

A

0.26

0.11

C

0.18

0.12

G

0.31

0.15

T

0.33

0.14

Finally, the Fickett score is calculated as follows:

\[ \text{Fickett Score}=\sum_{i=1}^8p_iw_i \]

3.1.1.2 Stop codon related Features

A stop codon signals the termination of the current protein translation process (Martin Buncek, et al. Biologia Plantarum. 45:50-50, 2002).
Under standard conditions, there are three kinds of stop codons: UAG called amber, UGA called opal (sometimes also called umber), and UAA called ochre. While the start codon alone is not sufficient to begin the translation, usually need nearby sequences or initiation factors, stop codons alone are sufficient to initiate termination (Christian Touriol, et al. Biol Cell. 95:169-78, 2003). So it’s an important feature for RNA encoding.
Here we consider four types of STOP Codon-related Features.

Table 4 STOP Codon-related Feature

Methods

Description

Stop Codon Count (SCC)

the number of STOP Codon in the transcript

Stop Codon Frequency (SCF)

STOP Codon Frequency in the transcript

Stop Codon Frame score (SCFs)

the variance of stop codon count among three reading frames

Stop Codon Frequency Frame score (SCFFs)

the variance of stop codon frequency among three reading frames

3.1.2 Open reading frame

Open reading frame (ORF) is a continuous stretch of codons, begins with a start codon (for RNA, usually AUG) and ends at a stop codon (for RNA, usually UAA, UAG or UGA), which has the ability to be translated.
The start-stop definition of an ORF only applies to mature mRNAs, not genomic DNA, since introns may contain stop codons and/or cause shifts between reading frames. Such an ORF corresponds to parts of a gene rather than the complete gene.
Since DNA/RNA is interpreted in groups of three-nucleotide codons, a DNA/RNA strand has three distinct reading frames. Meanwhile, the double helix of a DNA has two anti-parallel strands. With the two strands having three reading frames each, there are six possible frame translations for one DNA (W R Pearson, et al. Genomics. 46: 24-36, 1997).
Cheng Yang et al (Cheng Yang, et al. Bioinformatics. 34: 3825-3834, 2018) concluded that Open reading frame (ORF) is one of the most widely used and fundamental criteria to evaluate the coding potential of a sequence and distinguish an lncRNA from a coding transcript.

3.1.2.1 Basic ORF Features

It’s reported that pcRNAs often contain longer ORFs than ncRNAs, and long ORFs are unlikely to be observed by random chance in ncRNAs, thus long ORFs are often one of the ideal and fundamental evidences to initially identify candidate protein-coding regions (Sen Yang, et al. Front Genet. doi: 10.3389/fgene.2020.00090. eCollection 2020, 2020; Jinfeng Liu, et al. PLoS Genet. 2: e29, 2006). ORF length as a feature is frugal, but it has high concordance with more sophisticated discrimination methods and remains the primary criterion in almost all coding-potential prediction methods (G S Zubenko, et al. J Neuropathol Exp Neurol. 49: 206-14, 1990; Liguo Wang, et al. Nucleic Acids Res. 41: e74, 2013; Lei Kong, et al. Nucleic Acids Res. 35: W345-9, 2007).
Here, not only restrict to ORF length, we consult to CPPred (Xiaoxue Tong, et al. Nucleic Acids Res. 47: e43, 2019) and introduce other ORF-based descriptions. For start codon setting, although the use of non-AUG has been constantly described (Nicholas T Ingolia, et al. Cell. 147: 789-802, 2011; Yafeng Zhu, et al. Nat Commun. 9: 903, 2018), AUG is still selected.
We use five types of Open Reading Frame Features (ORFF):

Table 5 Five types of Open Reading Frame Features (ORFF)

Methods

Description

Longest ORF length

the length of the longest ORF

First ORF length

the length of the first ORF

ORF coverage

the Ratio of the longest ORF length and transcript length

ORF integrity

whether the longest ORF starts with a start codon and ends with a stop codon

ORF frame score

the variance of ORF length among ORFs

3.1.2.2 Entropy density profiles on ORF

The entropy density profile (EDP) model is a global statistical description for a nucleotide sequence. It employs a Shannon's artificial linguistic description for a nucleotide sequence of finite length like an ORF or a transcript (Yongchu Liu, et al. BMC Bioinformatics. 14 Suppl 5(Suppl 5):S12, 2013).The EDP model has been proved to be successful and validates the hypothesis that the pcRNAs distribute separately from the ncRNAs in the EDP phase space, which may be caused by different selection pressures during the evolution (Huaiqiu Zhu, et al. BMC Bioinformatics. 8: 97, 2007).
Calculation:
Here, we employ EDP model on ORFs identified from input sequence. For the EDP based on k-mer frequency, we preset k-mer as a 3-mer (codon). And the EDP {si} of an ORF is defined as:

\[ s_i=-\frac{1}{H}c_i\log c_i \]

where $c_i$ is the abundance of the ith codon obtained by counting the number of it in the ORF sequence divided by the total number of codons, $i=1,2,\ldots,61$ represents the index of the 61 codons (excluding 3 stop codons), and $H=-\sum_{i=1}^{61}c_i\log c_i$ is the Shannon entropy.
Finally, we transform these s_i of 61 codons into eigenvalues of twenty corresponding native amino acids, and respectively named as:

Table 6 Entropy density profiles on ORF

Entropy density A

Entropy density C

Entropy density D

Entropy density E

on ORF (ORFEA)

on ORF (ORFEC)

on ORF (ORFED)

on ORF (ORFEE)

Entropy density F

Entropy density G

Entropy density H

Entropy density I

on ORF (ORFEF)

on ORF (ORFEG)

on ORF (ORFEH)

on ORF (ORFEI)

Entropy density K

Entropy density L

Entropy density M

Entropy density N

on ORF (ORFEK)

on ORF (ORFEL)

on ORF (ORFEM)

on ORF (ORFEN)

Entropy density P

Entropy density Q

Entropy density R

Entropy density S

on ORF (ORFEP)

on ORF (ORFEQ)

on ORF (ORFER)

on ORF (ORFES)

Entropy density T

Entropy density V

Entropy density W

Entropy density Y

on ORF (ORFET)

on ORF (ORFEV)

on ORF (ORFEW)

on ORF (ORFEY)

3.1.2.3 Measurement of Hexamer on ORF

Some biologically important oligomers (such as Hexamers) consist of many important macromolecules. For instance, hemoglobin is a protein tetramer. An oligomer of amino acids is called an oligopeptide. An oligonucleotide is a short single-stranded fragment of nucleic acid such as DNA or RNA. Many researches (J M Claverie, et al. Nucleic Acids Res. 14: 179-96, 1986; J M Claverie, et al. Methods Enzymol. 183: 237-52, 1990), revealed that some model for hexamers measure in frames can reflect to sequences differences (J W Fickett, et al. Nucleic Acids Res. 20: 6441-50, 1992).
Hexamer score on ORF (ORFHS), which means Hexamer usage bias on ORF. Hexamer usage bias (Hexamer score) is an essential feature because of the dependence between adjacent amino acids in pseudo proteins. Hexamer score is calculated based on in-frame hexamer frequency of protein-coding and non-coding RNA (J W Fickett, et al. Nucleic Acids Res. 20: 6441-50, 1992).
Calculation:
Here we consult to CPAT (Liguo Wang, et al. Nucleic Acids Res. 41: e74, 2013) and NCResNet (Sen Yang, et al. Front Genet. doi: 10.3389/fgene.2020.00090. eCollection 2020, 2020).CPAT used a log-likelihood ratio to measure differential hexamer usage between coding and noncoding sequences. For a given sequence, we identify its ORF and calculated the probability under the model of pcRNA and ncRNA trained by CPAT, and then we took the logarithm of the ratio of these probabilities as the score of coding potential. We used $F(H_i)(i=0,1,\ldots,4^6-1)$ and$F'(H_i)(i=0,1,\ldots,4^6-1)$to represent in-frame hexamer frequency. For a given hexamer sequence $S=H_1,H_2,…,H_m$

\[ \text{Hexamer Score}=\frac{1}{m}\sum_{i=1}^{m}\log(\frac{F(H_i)}{F'(H_i)}) \]

Hexamer score determines the relative degree of hexamer usage bias in a particular sequence. Positive values indicate a coding sequence, whereas negative values indicate a noncoding sequence (Shuai Liu, et al. Genes (Basel). 10: 672, 2019).
However, Hexamer score only computes the average hexamer frequencies of the training data sets. For any unknown sequence, the hexamer patterns are scanned, but their frequencies are abandoned. The main idea underlying the proposed measurements is to estimate the unevaluated sequence is ‘close to’ ncRNA or pcRNA. Here, we consult to LncFinder (Siyu Han, et al. Brief Bioinform. 20: 2009-2027, 2019) and propose the following two new measurements to quantify the usage bias of hexamer on ORF, Euclidean distance and Logarithm distance, and get six eigenvalues:Hexamer logarithm distance to lncRNA ORF (ORLoL); Hexamer logarithm distance to pcRNA ORF (ORLoP); Hexamer logarithm distance ORF ratio (ORLoR); Hexamer euclidean distance to lncRNA ORF (OREuL); Hexamer euclidean distance to pcRNA ORF (OREuP); Hexamer euclidean distance ORF ratio (OREuR).
Calculation:

\[ ORLoL=\frac{1}{n}\sum{In\frac{freq.seq(i)}{freq.lnc(i)}},i=1,2,3,\ldots,4^k \]

\[ ORLoP=\frac{1}{n}\sum{In\frac{freq.seq(i)}{freq.pc(i)}},i=1,2,3,\ldots,4^k \]

\[ ORLoR=\frac{\log{Dist.LNC}}{\log{Dist.PC}} \]

\[ OREuL=\sqrt{\sum(freq.seq(i)-freq.lnc(i))^2},i=1,2,3,\ldots,4^k \]

\[ OREuP=\sqrt{\sum(freq.seq(i)-freq.pct(i))^2},i=1,2,3,\ldots,4^k \]

\[ OREuR=\frac{EucDist.LNC}{EucDist.PC} \]

where freq.seq are the k-adjoining base(s) frequencies of one unevaluated sequence; freq.lnc denotes the average frequencies of lncRNAs’ k-adjoining base(s) ; freq.pct denotes the average frequencies of pcRNAs’ k-adjoining base(s); i denotes the different types of k-adjoining base(s), and n is the total number of the k-adjoining base(s) in one sequence. Based on first two equations, OREuL and logORLoP can be computed. EucDist.PC and logDist.PC can be obtained similarly.

3.1.3 Guanine-cytosine related

As we all know, there are four kinds of nitrogenous bases, Adenine (A), Guanine (G), Cytosine (C), and Thymine (T), which bind to sugar and phosphate to form four kinds of nucleotides for DNA. In its double-stranded helical structure, the bases on one strand must pair up with bases on another strand, Adenine pair up with Thymine (A/T or T/A) and Cytosine pair up with Guanine (C/G or G/C). This principle of arrangement is called Complementary Base Pairing (James D Watson, et al. Clin Orthop Relat Res. 462: 3-5, 2007). For RNAs, Uracil (U) equals to Thymine (T).
Quantitatively, each Guanine/Cytosine (G/C) base pair is held by three hydrogen bonds, while Adenine/Thymine (A/T) and Adenine/ Uracil (A/U) base pairs are held by two hydrogen bonds. For a long nucleotide sequence, interactions of base stacking have obvious influence on molecular stability (Peter Yakovchuk, et al. Nucleic Acids Res. 34: 564-74, 2006). So, it’s easy to understand that nucleic acid with low GC content (the percentage of nitrogenous bases guanine and cytosine on a nucleic acid molecule) is less stable than nucleic acid with high GC-content.
For RNAs, Pozzoli U et al. pointed that the length of the coding sequence (CDS) is directly proportional to higher GC content (Uberto Pozzoli, et al. BMC Evol Biol. 27; 8:99, 2008). And it has been pointed the fact that the stop codon has a bias towards A and T nucleotides, and, thus, the shorter the sequence the higher the AT content (J D Wuitschick, et al. J Eukaryot Microbiol. 46: 239-47, 1999). Also, it has been pointed that there is a strong correlation between the GC-content of structural RNAs and the optimal growth of prokaryotes at higher temperatures, such as ribosomal RNA, transfer RNA, and many other non-coding RNAs (L D Hurst, et al. Proc Biol Sci. 268:493-7, 2001). Shuai Liu et al. pointed that GC content may play an important role in the prediction of RNA coding potential (Shuai Liu, et al. Genes (Basel). 10: 672, 2019).
Calculation
The percentage of G and C on nucleotide strand can be used to calculate the GC content of the sequence and GC content in the first, second and third position of codons.

\[ \frac{G+C}{A+T+G+C}\times100\% \]

The variance of them among three reading frames can also be considered.

Table 7 Three reading frames and their variances

Methods

Description

GC content

GC content of the transcript

GC1

GC content in the first position of codons

GC1 variance

The Variance of GC1 among three reading frames

GC2

GC content in the second position of codons

GC2 variance

The Variance of GC2 among three reading frames

GC3

GC content in the third position of codons

GC3 variance

The Variance of GC3 among three reading frames

3.1.4 Transcript K-mers content

K-mer means dividing a sequence into substrings containing k bases. If the length of sequence is L and the length of k-mer is k, the number of k-mers generated is L-k+1, and the number of kinds of k-mers generated is 4k.
K-mer is a simple approach to represent nucleotide sequences, which are represented as the occurrence frequencies of k neighboring nucleic acids. This approach has been successfully applied to human gene regulatory sequence prediction (William Stafford Noble, et al. Bioinformatics. Suppl 1: i338-43, 2005), enhancer identification (, Dongwon Lee, et al. Genome Res. 21: 2167-80, 2011), etc.
Here, we consult to Jessime M Kirk et al.’s research (Jessime M Kirk, et al. Nat Genet. 50: 1474-1482, 2018) , and count each kind of mer as a 4kD feature (preset k=3 (codon)),respectively named as:

Table 9 Transcript k-mers content (preset k=3)

Transcript k-mer AAA content (KMAAA)

Transcript k-mer AAG content (KMAAG)

Transcript k-mer AAT content (KMAAT)

Transcript k-mer AAC content (KMAAC)

Transcript k-mer AGA content (KMAGA)

Transcript k-mer AGG content (KMAGG)

Transcript k-mer AGT content (KMAGT)

Transcript k-mer AGC content (KMAGC)

Transcript k-mer ATA content (KMATA)

Transcript k-mer ATG content (KMATG)

Transcript k-mer ATT content (KMATT)

Transcript k-mer ATC content (KMATC)

Transcript k-mer ACA content (KMACA)

Transcript k-mer ACG content (KMACG)

Transcript k-mer ACT content (KMACT)

Transcript k-mer ACC content (KMACC)

Transcript k-mer GAA content (KMGAA)

Transcript k-mer GAG content (KMGAG

Transcript k-mer GAT content (KMGAT)

Transcript k-mer GAC content (KMGAC)

Transcript k-mer GGA content (KMGGA)

Transcript k-mer GGG content (KMGGG)

Transcript k-mer GGT content (KMGGT)

Transcript k-mer GGC content (KMGGC)

Transcript k-mer GTA content (KMGTA)

Transcript k-mer GTG content (KMGTG)

Transcript k-mer GTT content (KMGTT)

Transcript k-mer GTC content (KMGTC)

Transcript k-mer GCA content (KMGCA)

Transcript k-mer GCG content (KMGCG)

Transcript k-mer GCT content (KMGCT)

Transcript k-mer GCC content (KMGCC)

Transcript k-mer TAA content (KMTAA)

Transcript k-mer TAG content (KMTAG)

Transcript k-mer TAT content (KMTAT)

Transcript k-mer TAC content (KMTAC)

Transcript k-mer TGA content (KMTGA)

Transcript k-mer TGG content (KMTGG)

Transcript k-mer TGT content (KMTGT)

Transcript k-mer TGC content (KMTGC)

Transcript k-mer TTA content (KMTTA)

Transcript k-mer TTG content (KMTTG)

Transcript k-mer TTT content (KMTTT)

Transcript k-mer TTC content (KMTTC)

Transcript k-mer TCA content (KMTCA)

Transcript k-mer TCG content (KMTCG)

Transcript k-mer TCT content (KMTCT)

Transcript k-mer TCC content (KMTCC)

Transcript k-mer CAA content (KMCAA)

Transcript k-mer CAG content (KMCAG)

Transcript k-mer CAT content (KMCAT)

Transcript k-mer CAC content (KMCAC)

Transcript k-mer CGA content (KMCGA)

Transcript k-mer CGG content (KMCGG)

Transcript k-mer CGT content (KMCGT)

Transcript k-mer CGC content (KMCGC)

Transcript k-mer CTA content (KMCTA)

Transcript k-mer CTG content (KMCTG)

Transcript k-mer CTT content (KMCTT)

Transcript k-mer CTC content (KMCTC)

Transcript k-mer CCA content (KMCCA)

Transcript k-mer CCG content (KMCCG)

Transcript k-mer CCT content (KMCCT)

Transcript k-mer CCC content (KMCCC)

3.1.5 Global transcript sequence descriptors

CTD encoding is a computing strategy for sequence representation such as nucleotide sequences, amino acid sequence and so on (L Y Han, et al. Nucleic Acids Res. 32:6437-44, 2004). CTD features (composition, transition, and distribution) captured can convert a sequence into a digital feature vector, based on the characteristics including basic sequence information, physicochemical properties or other appropriate features (Xiaoxue Tong, et al. Nucleic Acids Res. 47: e43, 2019; Feng Zhu, et al. J Pharmacol Exp Ther. 330:304-15, 2009).
Global transcript sequence descriptors denote the global transcript sequence descriptions using CTD encoding, including nucleotide composition, nucleotide transition and nucleotide distribution. The first descriptor nucleotide composition describes the percent composition of each nucleotide in a transcript sequence. The second descriptor nucleotide transition describes the percent frequency with conversion of four nucleotides between adjacent positions. The third descriptor nucleotide distribution describes five relative positions along the transcript sequence of each nucleotide, calculated with the 0 (first one), 25%, 50%, 75% and 100% (last one) (Xiaoxue Tong, et al. Nucleic Acids Res. 47: e43, 2019).
Take one nucleotide sequence as an example, ACTTGCAGCCCCCCGCCTGTCCCGAGCCGCGCGGGCGCCAGCTCAGTTTGTCCGCGGCGG, with 5 adenines (As), 9 thymines (Ts), 20 guanines (Gs) and 26 cytidines (Cs) (Xiaoxue Tong, et al. Nucleic Acids Res. 47: e43, 2019).
First descriptor composition C are four features respectively:
$$Composition of A (ComRA) = 5/60 = 0.083;$$ $$Composition of T (ComRT) = 9/60 = 0.15;$$ $$Composition of G (ComRG) = 20/60 = 0.33;$$ $$Composition of C (ComRC) = 26/60 = 0.43$$ Second descriptor transition T are six features respectively, there is zero transition between A and T, five transitions between A and G, four transitions between A and C, five transitions between T and G, six transitions between T and C and twenty transitions between G and C: $$Transition between A and T (TraAT) = 0/59 = 0.00;$$ $$Transition between A and G (TraAG) = 5/59 = 0.085;$$ $$Transition between A and C (TraAC) = 4/59 = 0.068;$$ $$Transition between T and G (TraTG) = 5/59 = 0.085;$$ $$Transition between T and C (TraTC) = 6/59 = 0.10;$$ $$Transition between G and C (TraGC) = 20/59 = 0.34.$$ Third descriptor distribution D are 20 features respectively, The first, 25%, 50%, 75% and 100% of As are located within 1, 1, 25, 40 and 45 residues: $$Distribution of 0.00A (DPRA0) = 1/60 = 0.017;$$ $$Distribution of 0.25A (DPRA1) = 1/60 = 0.017;$$ $$Distribution of 0.50A (DPRA2) = 25/60 = 0.42;$$ $$Distribution of 0.75A (DPRA3) = 40/60 = 0.67;$$ $$Distribution of 1.00A (DPRA4) = 45/60 = 0.75.$$ Likewise, the D descriptors for Ts, Gs and Cs are: $$Distribution of 0.00T (DPRT0) = 0.05;$$ $$Distribution of 0.25T (DPRT1) = 0.067;$$ $$Distribution of 0.50T (DPRT2) = 0.72;$$ $$Distribution of 0.75T (DPRT3) = 0.80;$$ $$Distribution of 1.00T (DPRT4) = 0.85;$$ $$Distribution of 0.00G (DPRG0) = 0.083;$$ $$Distribution of 0.25G (DPRG1) = 0.40;$$ $$Distribution of 0.50G (DPRG2) = 0.57;$$ $$Distribution of 0.75G (DPRG3) = 0.83;$$ $$Distribution of 1.00G (DPRG4) = 1.00;$$ $$Distribution of 0.00C (DPRC0) = 0.033;$$ $$Distribution of 0.25C (DPRC1) = 0.22;$$ $$Distribution of 0.50C (DPRC2) = 0.38;$$ $$Distribution of 0.75C (DPRC3) = 0.65;$$ $$Distribution of 1.00C (DPRC4) = 0.97;$$

3.1.6 Entropy density profiles on transcript

Similar to the Entropy density profiles on ORF, we also consider Entropy density profiles on transcript, which can be calculated by analogy.

Table 10 Entropy density profiles on transcript

Entropy density A on transcript (TraEA)

Entropy density C on transcript (TraEC)

Entropy density D on transcript (TraED)

Entropy density E on transcript (TraEE)

Entropy density F on transcript (TraEF)

Entropy density G on transcript (TraEG)

Entropy density H on transcript (TraEH)

Entropy density I on transcript (TraEI)

Entropy density K on transcript (TraEK)

Entropy density L on transcript (TraEL)

Entropy density M on transcript (TraEM)

Entropy density N on transcript (TraEN)

Entropy density P on transcript (TraEP)

Entropy density Q on transcript (TraEQ)

Entropy density R on transcript (TraER)

Entropy density S on transcript (TraES)

Entropy density T on transcript (TraET)

Entropy density V on transcript (TraEV)

Entropy density W on transcript (TraEW)

Entropy density Y on transcript (TraEY)

3.2. Physico-chemical Features

3.2.1 Pseudo protein related (PPF)

NcRNAs cannot be translated to produce true proteins, so it’s easy to understand that the pseudo protein translated by ncRNA doesn’t have true protein sequence and physicochemical features (Peter J A Cock, et al. Front Genet. 11: 90, 2020). Pseudo protein related physicochemical features are reliable for ncRNA description.
Based on the understanding above, we select and calculate five related protein characters as features by Biopython (Sen Yang, et al. Bioinformatics. 25: 1422-3, 2009):

Table 11 Five related protein characters as features

Methods

Description

Pseudo protein molecular weight (ProMW)

The molecular weight of the predicted peptide

Pseudo protein isoelectric point (ProPI)

The theoretical isoelectric point of the predicted peptide

Pseudo protein PI-MW frame score (PPMFS)

The pI/Mw shows the log 10 transformed ratio of pI and Mw, and it is also applied to the ORFs as pI/Mw frame score

Pseudo protein average hydropathy (ProAH)

The grand average of hydropathicity of predicted peptide

Pseudo protein instability index (ProII)

The stability of predicted peptide

3.2.2 Nucleotide related

Here, we consult to Bin Liu et al.’s research (Bin Liu, et al. Bioinformatics. 31: 1307-9, 2015) and use two kinds of model, Autocorrelation and pseudo nucleic acid composition, to formulate transcript with dinucleotide physicochemical features. It’s calculated by their repDNA python package. The Table 12 include the 38 dinucleotide physicochemical features considered in the calculation below (Bin Liu, et al. Bioinformatics. 31: 1307-9, 2015).

Table 12 The names of the 35 physicochemical indices for dinucleotides

BST:Base stacking

PID:Protein induced deformability

BDT:B-DNA twist

DGC: Dinucleotide GC Content

APH:A-philicity

PTW:Propeller twist

DST:Duplex stability:(freeenergy)

DTB: Duplex tability (disrupt energy)

DDT: DNA denaturation

BDS: Bending stiffness

PDT: Protein DNA twist

SEZ: Stabilising energy of Z-DNA

ABA: Aida_BA_transition

BDG: Breslauer_dG

Helix-BDH: Breslauer_dH

BLS: Breslauer_dS

ETI: Electron_interaction

HTF: Hartman_trans_free_energy

HCT: Helix-Coil_transition

IBA: Ivanov_BA_transition

LBZ: Lisser_BZ_transition

PIA: Polar_interaction

SDG: SantaLucia_dG

SDH: SantaLucia_dH

SDS: SantaLucia_dS

SFB: Sarai_flexibility energy of Z-DNA

STB: Stability

SKE: Stacking_energy

SMG: Sugimoto_dG

SMH: Sugimoto_dH

SMS: Sugimoto_dS

WCI: Watson-Crick_interaction

TWI: Twist

TIL: Tilt

ROL: Roll

SHI: Shift

SLI: Slide

RIS: Rise

 

3.2.2.1 Autocorrelation of dinucleotide

Autocorrelation is a multivariate modeling tools, which can transform different length of nucleotide sequences into fixed-length vectors by measuring the correlation between any two properties. Autocorrelation results in two kinds of variables: autocorrelation (AC) between the same property, and cross-covariance (CC) between two different properties (Bin Liu, et al. Bioinformatics. 31: 1307-9, 2015).
Dinucleotide auto covariance (Bin Liu, et al. Bioinformatics. 31: 1307-9, 2015).
Dinucleotide auto covariance (DAC) incorporates the correlation of the same property between two dinucleotides.
Suppose the nucleotide sequence D with L nucleic acid residues:

\[ D=R_1R_2R_3R_4R_5R_6R_7\ldots R_L \]

$R_1$ represents the nucleic acid residue at the sequence position 1, $R_2$ represents the nucleic acid residue at position 2 and so forth.
The DAC measures the correlation of the same physicochemical index between two dinucleotides separated by a distance of lag along the sequence, which can be calculated as:

\[ DAC(u,lag)=\sum_{i=1}^{L-lag-1} (P_u(R_iR_{i+1})-\overline{P_u})(P_u(R_{i+lag}R_{i+lag+1})-\overline{P_u})/(L-lag-1) \]

where $u$ is a physicochemical index considered that are listed in the Table 12, $L$ is the length of the sequence, $P_u(R_iR_{i+1})$ means the numerical value of the physicochemical index $u$ for the dinucleotide $R_iR_{i+1}$ at position $i$, $\overline{P_u}$ is the average value for physicochemical index $u$ along the whole sequence:

\[ \overline{P_u}=\sum_{j=1}^{L-1}P_u (R_j R_{j+1} ) /(L-1) \]

In such a way, there are N*LAG of DAC feature values, where N is the number of physicochemical indices and LAG is the maximum of lag (lag=1,2,…,LAG). Here, N = 38 and LAG is preset to 2. Thus, the feature vector is Dinucleotide auto covariance of lag1 on BST (ABST1) ~ Dinucleotide auto covariance of lag2 on RIS (ARIS2), 78 values totally. The complete short names are summarized in the following Table 13.
Table 13 Dinucleotide auto covariance physicochemical features

ABST1

APID1

ABDT1

ADGC1

AAPH1

APTW1

ADST1

ADTB1

ADDT1

ABDS1

APDT1

ASEZ1

AABA1

ABDG1

ABDH1

ABLS1

AETI1

AHTF1

AHCT1

AIBA1

ALBZ1

APIA1

ASDG1

ASDH1

ASDS1

ASFB1

ASTB1

ASKE1

ASMG1

ASMH1

ASMS1

AWCI1

ATWI1

ATIL1

AROL1

AROL1

ASHI1

ASLI1

ARIS1

ABST2

APID2

ABDT2

ADGC2

AAPH2

APTW2

ADST2

ADTB2

ADDT2

ABDS2

APDT2

ASEZ2

AABA2

ABDG2

ABDH2

ABLS2

AETI2

AHTF2

AHCT2

AIBA2

ALBZ2

APIA2

ASDG2

ASDH2

ASDS2

ASFB2

ASTB2

ASKE2

ASMG2

ASMH2

ASMS2

AWCI2

ATWI2

ATIL2

AROL2

ASHI2

ASLI2

ARIS2

 

This DAC approach is similar as the approach used for protein fold recognition (Qiwen Dong, et al. Bioinformatics. 25: 2655-62, 2009).
Dinucleotide-based cross covariance (Bin Liu, et al. Bioinformatics. 31:1307-9, 2015)
Given a nucleotide sequence D (analogy to above), the DCC measures the correlation of two different physicochemical indices between two dinucleotides separated by lag nucleic acids along the sequence, which can be calculated by:

\[ DCC(u_1,u_2,lag)=\sum_{i=1}^{L-lag-1}(P_{u_1}(R_iR_{i+1})-\overline{P_{u_1}})(P_{u_2}(R_{i+lag}R_{i+lag+1})-\overline{P_{u_2}})/(L-lag-1)) \]

where $u_1$, $u_2$ are two different physicochemical indices considered that are listed in the Table 12, $L$ is the length of the sequence, $P_{u_1}(R_iR_{i+1})$($P_{u_2}(R_iR_{i+1})$) is the numerical value of the physicochemical index $u_1$ ($u_2$) for the dinucleotide $R_iR_{i+1}$ at position $i$, $\overline{P_{u_1}}$ ($\overline{P_{u_2}}$) is the average value for physicochemical index value $u_1$, $u_2$ along the whole sequence:

\[ \overline{P_u}=\sum_{j=1}^{L-1}P_u (R_j R_{j+1} ) /(L-1) \]

In such a way, the length of the DCC feature vector is N*(N-1)*LAG, where N is the number of physicochemical indices and LAG is the maximum of $lag$ ($lag$=1,2,…,LAG).
Here, we pick out N=6 representative physicochemical features for DCC calculation, including PID, BDS, ETI, SKE, WCI, SLI, PID, BDS, ETI, SKE, WCI, SLI, and LAG is preset to 2. Thus, the feature vector is Cross covariance of lag1 on PID-BDS (CC101) ~ Cross covariance of lag2 on SLI-WCI (CC230), 60 values totally.
This DCC approach is similar as the approach used for protein fold recognition (Qiwen Dong, et al. Bioinformatics. 25: 2655-62, 2009).

3.2.2.2 Pseudo dinucleotide composition

Pseudo nucleic acid composition (PseNAC) is a kind of powerful approaches to represent the nucleotide sequence considering both local sequence-order information and long range or global sequence-order effects. Concretely here, as for Pseudo dinucleotide composition (PseDNC), it incorporates the contiguous local sequence-order information and the global sequence-order information into the feature vector of the nucleotide sequence. Here are two types of PseNAC considered (Bin Liu, et al. Bioinformatics. 31: 1307-9, 2015).
Parallel correlation pseudo dinucleotide composition (PC-PseDNC)
Parallel correlation PseDN composition (PC-PseDNC) can be calculated not only based on 38 built-in physiochemical properties (Table 13), but also other indices defined by users.
Given a nucleotide sequence D, the PC-PseDNC feature vector of D is defined:

\[ D=[d_1 d_2 \ldots d_{16}d_{16+1} \ldots d_{16+λ}]^T \]

where:

\[ d_k= \begin{cases} f_k\over\sum_{i=1}^{16}f_i +w\sum_{j+1}^\lambda \theta_j& (1\leq k \leq 16)\\\\ w\theta_{k-16}\over\sum_{i=1}^{16}f_i+w\sum_{j=1}^\lambda\theta_j& (17\leq k\leq16+\lambda) \end{cases} \]

where $f_k$ ($k$=1,2,⋯,16) is the normalized occurrence frequency of dinucleotide in the sequence; the parameter λ is an integer, representing the highest counted rank (or tier) of the correlation along a sequence; $w$ is the weight factor ranged from 0 to 1; $θ_j$ ($j$=1,2,⋯,λ) is called the j-tier correlation factor that reflects the sequence-order correlation between all the most contiguous dinucleotides along a sequence, which is defined:

\[ (\lambda\lt L) \begin{cases} \theta_1=\frac{1}{L-2}\sum_{i=1}^{L-2}\theta(R_iR_{i+1},R_{i+1}R_{i+2})\\\ \theta_2=\frac{1}{L-3}\sum_{i=1}^{L-3}\theta(R_iR_{i+1},R_{i+2}R_{i+3})\\\ \theta_3=\frac{1}{L-4}\sum_{i=1}^{L-4}\theta(R_iR_{i+1},R_{i+3}R_{i+4})\\\ \cdots\\\ \theta_\lambda=\frac{1}{L-1-\lambda}\sum_{i=1}^{L-1-\lambda}\theta(R_iR_{i+1},R_{i+\lambda}R_{i+1+\lambda}) \end{cases} \]

where the correlation function is given by

\[ \theta(R_iR_{i+1},R_{j}R_{j+1})=\frac{1}{\mu}\sum_{u=1}^\mu[P_u(R_iR_{i+1})-P_u(R_{j}R_{j+1})]^2 \]

where $μ$ is the number of physicochemical indices considered that are listed in the Table 12; $P_u(R_iR_{i+1})$ ($P_u(R_jR_{j+1})$) represents the numerical value of the $u$-th ($u$=1,2,⋯,μ) physicochemical index for the dinucleotide $R_iR_{i+1}$ ($R_jR_{j+1}$) at position $i$ and $j$, respectively.
Here, λ is preset to 2. Thus, the feature vector (d_ks) is Parallel correlation PseDN composition of AA (PCDAA) ~ Parallel correlation PseDN composition of GG (PCDGG), & Parallel correlation PseDN composition of lamda1 (PCDl1) ~ Parallel correlation PseDN composition of lamda2 (PCDl2), 18 values totally. The complete short names are summarized in the following Table 14.
Table 14 Parallel correlation PseDN composition

PCDAA

PCDAT

PCDAC

PCDAG

PCDTA

PCDTT

PCDTC

PCDTG

PCDCA

PCDCT

PCDCC

PCDCG

PCDGA

PCDGT

PCDGC

PCDGG

PCDl1

PCDl2

 

For more information of PC-PseDNC approach, you can refer to (Wei Chen, et al. Anal Biochem. 456: 53-60, 2014).
Series correlation PseDN composition:
Series correlation PseDN composition (SC-PseDNC) is a variant of PC-PseDNC. Combining dinucleotide composition and global sequence-order effects by series correlation.
Given a nucleotide sequence D, the SC-PseDNC feature vector of D is defined:

\[ D=[d_1d_2\ldots d_{16}d_{16+1}\ldots d_{16+\lambda}d_{16+\lambda+1}\ldots d_{16+\lambda\Lambda} ]^T \]

where

\[ d_k=\begin{cases} \frac{f_k}{\sum_{i=1}^{16}f_i+w\sum_{j=1}^\lambda\theta_j}&(1\leq k\leq 16)\\\ \frac{w\theta_{k-16}}{\sum_{i=1}^{16}f_i+w\sum_{j=1}^{\lambda\Lambda}\theta_j}&(17\leq k\leq16+\lambda\Lambda) \end{cases} \]

where $f_k$ ($k$=1,2,⋯,16) is the normalized occurrence frequency of dinucleotide in the sequence; the parameter λ is an integer, representing the highest counted rank (or tier) of the correlation along a sequence; $w$ is the weight factor ranged from 0 to 1; Λ is the number of physicochemical indices; $θ_j$ ($j$=1,2,⋯,λ) is called the j-tier correlation factor that reflects the sequence-order correlation between all the most contiguous dinucleotides along a sequence, which is defined:

\[ (\lambda\lt L-2) \begin{cases} \theta_1=\frac{1}{L-3}\sum_{i=1}^{L-3}\it J_{i.i+1}^1\\\ \theta_2=\frac{1}{L-3}\sum_{i=1}^{L-3}\it J_{i.i+1}^2\\\ \ldots\\\ \theta_\Lambda=\frac{1}{L-3}\sum_{i=1}^{L-3}\it J_{i.i+1}^\Lambda\\\ \cdots\\\ \theta_{\lambda\Lambda-1}=\frac{1}{L-\lambda-2}\sum_{i=1}^{L-\lambda-2}\it J_{i.i+\lambda}^{\Lambda-1}\\\ \theta_{\lambda\Lambda}=\frac{1}{L-\lambda-2}\sum_{i=1}^{L-\lambda-2}\it J_{i.i+\lambda}^{\Lambda} \end{cases} \]

The correlation function is given by:

\[ \begin{cases} J_{i,i+m}^\zeta=P_u(R_iR_{i+1})\cdot P_u(R_{i+m}R_{i+m+1})\\\ \zeta=1,2,\cdots,\Lambda;m=1,2,\cdots,\lambda;i=1,2,\cdots,L-\lambda-2 \end{cases} \]

where $μ$ is the number of total physiochemical indices (Table 12); $P_u(R_iR_{i+1})$ ($P_u(R_jR_{j+1})$) represents the numerical value of the $u$-th ($u$=1,2,⋯,μ) physiochemical index for the dinucleotide $R_iR_{i+1}$ ($R_jR_{j+1}$) at position $i$($j$).
Here, we pick out Λ=6 representative physicochemical features for SC-PseDNC calculation, including PID, BDS, ETI, SKE, WCI, SLI, PID, BDS, ETI, SKE, WCI, SLI, λ is preset to 2. Thus, the feature vector (dks) is Series correlation PseDN composition of AA (SCDAA) ~ Series correlation PseDN composition of GG (SCDGG), & Series correlation PseDN composition of lamda1-PID (L1PID) ~ Series correlation PseDN composition of lamda2-SLI (L2SLI), 28 values totally. The complete short names are summarized in the following Table 15.
Table 15 Series correlation PseDN composition

SCDAA

SCDAT

SCDAC

SCDAG

SCDTA

SCDTT

SCDTC

SCDTG

SCDCA

SCDCT

SCDCC

SCDCG

SCDGA

SCDGT

SCDGC

SCDGG

L1PID

L1BDS

L1ETI

L1SKE

L1WCI

L1SLI

L2PID

L2BDS

L2ETI

L2SKE

L2WCI

L2SLI

For more information of the SC-PseDNC, please refer to (Wei Chen, et al. Anal Biochem. 456: 53-60, 2014).

3.2.3 EIIP based spectrum

Electron-ion interaction pseudopotential (EIIP), calculate the energy of delocalized electrons in amino acids and nucleotides (Siyu Han, et al. Brief Bioinform. 20: 2009–2027, 2019). For any DNA sequence, each nucleotide (A, C, G and T) owns and can be replaced by the following EIIP values: {A→0.1260; C→0.1340; G→0.0806; T→0.1335} (Achuthsankar S Nair, et al. Bioinformation. 1: 197-202, 2006). It was primordially used to locate exons (Achuthsankar S Nair, et al. Bioinformation. 1: 197-202, 2006). Compared with pI values, DNA EIIP values can be directly applied to RNA sequences, which can avoid the potential bias caused by the speculated translation process (Siyu Han, et al. Brief Bioinform. 20: 2009–2027, 2019).
EIIP-based spectrum embodies the physicochemical and 3-base periodicity properties of pcRNAs (A A Tsonis, et al. J Theor Biol. 151: 323-31, 1991; S Tiwari, et al. Comput Appl Biosci. 13: 263-270, 1997). It has been proved that features from this category can present robust results (Siyu Han, et al. Brief Bioinform. 20: 2009–2027, 2019).
Calculation:
We consult to LncFinder (Siyu Han, et al. Brief Bioinform. 20: 2009–2027, 2019) and let $X_e[n]$ be the EIIP indicator value at nth position of a sequence. Using Fast Fourier transform (FFT) on $X_e[n]$, a Sequence Power Spectrum of an RNA can be calculated by the following equation (Sen Yang, et al. Front Genet. 11: 90, 2020):

\[ X_e[k]=\sum_{n=0}^{N-1}X_e[n]e^{-j\frac{2\pi kn}{N}} \]

\[ S_e[k]=|X_e[k]|^2=|\sum_{n=0}^{N-1}X_e[n]e^{-j\frac{2\pi kn}{N}}|^2 \]

\[ (k=0,1,2,\ldots,N-1;\text{N is the sequence length}) \]

It has been reported that pcRNA has a different Power Spectrum distribution compared with ncRNAs. And generally, in the power spectrum of a pcRNA, a peak value will emerge in the thirds position but will not appear in ncRNA (Siyu Han, et al. Brief Bioinform. 20: 2009–2027, 2019).
Thus, based on the difference, here we can employ three physicochemical properties from Sequence Power Spectrum as features.
Electron-ion interaction pseudopotential signal peak (EipSP), which records the signal value of the third position (peak value):

\[ EipSP=S_e[\frac{N}{3}] \]

It has been noted that most of lncRNAs possess lower $S_e[\frac{N}{3}]$ by Han et al (Siyu Han, et al. Brief Bioinform. 20: 2009–2027, 2019).
Electron-ion interaction pseudopotential average power (EipAP), which records the averaging power of a sequence:

\[ EipAP=\frac{\sum_{k=0}^{N-1}S_e[k]}{N} \]

Electron-ion interaction pseudopotential signal/noise ratio (EiSNR), which is equal to EipSP divided by the EipAP:

\[ EiSNR=\frac{EipSP}{EipAP} \]

It has been noted that most of lncRNAs possess lower EipAP and EiSNR values by Han et al (Siyu Han, et al. Brief Bioinform. 20: 2009–2027, 2019).
In addition to three EIIP based sequence power spectrum features above, we sorted the sequence power spectrum in descending order to sample five position values, which are:
Electron-ion interaction pseudopotential spectrum 0 (EiPS0), corresponding to the minimum of sorted Power Spectrum values; Electron-ion interaction pseudopotential spectrum 0.25 (EiPS1), corresponding to the lower quartile value of sorted Power Spectrum values; Electron-ion interaction pseudopotential spectrum 0.5 (EiPS2), corresponding to the medium value of sorted Power Spectrum values; Electron-ion interaction pseudopotential spectrum 0.75 (EiPS3), corresponding to the upper quartile value of sorted Power Spectrum values; Electron-ion interaction pseudopotential spectrum 1 (EiPS4), corresponding to the maximum value of sorted Power Spectrum values (Siyu Han, et al. Brief Bioinform. 20: 2009–2027, 2019; Sen Yang, et al. Front Genet. 11: 90, 2020).
The ranges in the first group varies from the top 10 to top 100 of the sorted power spectrum, and the ranges are also from the top 10% to 100% of the sorted power spectrum. As the signals of mRNAs are generally stronger than those of lncRNAs, pcRNAs should tend to have higher values of these quantiles’ statistics than lncRNAs (Siyu Han, et al. Brief Bioinform. 20: 2009–2027, 2019).

3.3 Structure-based Features

3.3.1 Multi-Scale Secondary Structure Information

Siyu Han et al.(Siyu Han, et al. Brief Bioinform. 20: 2009-2027, 2019) first introduce a kind of multi-scale secondary structural information based self-design RNA sequence from three levels: stability, secondary structure elements (SSEs) combined with pairing condition and structure-nucleotide sequences, which is a special way for sequence representation.

Low-scale feature (stability): MFE is a basic structural outline displaying the stability of the RNA structure, Thus, MFE is selected as the low-scale feature.

Medium-scale feature (SSEs): Let $seq[n]$ be an RNA sequence with length $N$, and the nucleotides are denoted in lowercase $seq[n]\in\{a,c,g,u\}$;Let $SS[n]$ be the secondary structure sequence of $seq[n]$, and $SS[n]$ is defined using dot-bracket notation $(SS[n]\in\{\ .\ \ ,(\ )\})$. Siyu Han et al(Siyu Han, et al. Brief Bioinform. 20: 2009-2027, 2019) employ four Secondary Structure Elements (SSEs) to depict RNA’s basic structure components: stem (s), bulge (b), loop (l) and hairpin (h), and design three secondary structure-derived sequences converted from SS[n] alone without using the nucleotide composition of seq[n]:
Replacing nucleotides of $seq[n]$ with corresponding SSEs, we can obtain the first secondary structure-derived sequence—SSE full sequence ($\text{SSE.Full Seq}$).
Regarding the continuous identical SSE as one SSE, we can obtain the second secondary structure-derived sequence—SSE abbreviated sequence ($\text{SSE.Abbr Seq}$).
In the dot-bracket notation, a dot ‘.’ means unpaired base and brackets ‘(‘or’)’ represent paired base (also the SSE stem). Thus, $SS[n]$ can be converted to $\text{Paired-Unpaired Seq}$ using the following formula:
$$ \text{Paired-Unpaired Seq[n]}= \begin{cases} U,\ \ \ SS[n]=\\\ P,\ \ \ SS[n]\neq \end{cases} $$

High-scale feature: On a high-scale level, the other three secondary structure-derived sequences, acgu-Dot Sequence ($ \text{acguD Seq}$), acgu-Stem Sequence ($\text{acgus Seq}$) and $\text{acgu-ACGU Seq}$, are obtained by combining secondary structure sequence $ SS[n]$ and primary sequence $ seq[n]$:
$$ \text{acguD Seq[n]}= \begin{cases} D,\cdots SS[n]=\\\ Seq[n],\cdots SS[n]\neq \end{cases} $$ $$ \text{acguS Seq[n]}= \begin{cases} Seq[n],\cdots SS[n]=\\\ S,\cdots SS[n]\neq \end{cases} $$ $$ \text{acgu-ACGU Seq[n]}= \begin{cases} A, Seq[n]=a\cdots\Lambda\cdots SS[n]\neq\\\ C, Seq[n]=c\cdots\Lambda\cdots SS[n]\neq\\\ G, Seq[n]=g\cdots\Lambda\cdots SS[n]\neq\\\ U, Seq[n]=u\cdots\Lambda\cdots SS[n]\neq\\\ Seq[n],\cdots SS[n]= \end{cases} $$ In $\text{acguD Seq}$, unpaired nucleotides are replaced with character ‘D’, thus $\text{acguD Seq}$ can be regarded as a portrait describing the percentage of the unpaired base and the intrinsic composition of the SSE stem. Similarly, $\text{acguS Seq}$ can be viewed as a portrait serving the complementary roles. The third sequence $\text{acgu-ACGU Seq}$ is obtained by converting nucleotides of $ seq[n]$ into uppercase if they are paired bases. The combination of these three sequences can be considered a high-resolution panorama presenting the integration of sequence and structural information.

Here, two strategies, improved k-mer scheme(Delphine Charif, et al. BIOMEDICAL. doi: 10.1007/978-3-540-35306-5_10, 2007) and Logarithm-distance (similar to the calculation of Hexamer logarithm distance above) of k-adjoining bases, are employed to extract and calculate features on four multi-scale secondary structural sequences ($\text{Paired-Unpaired Seq}$, $\text{acguD Seq}$, $\text{acguS Seq}$, $\text{acgu-ACGU Seq}$) designed above(Siyu Han, et al. Brief Bioinform. 20: 2009-2021, 2019). We choose seven optimal feature values for RNA encoding, which were verified by LncFinder for lncRNA representation, explained below:

Secondary structural UP frequency paired-unpaired (SFPUS), which means the percentage of unpaired based on $\text{Paired-Unpaired Seq}$;
Structural logarithm distance to lncRNA of acguD (SLDLD), which means the logarithm distance of nucleotide sequence to lncRNA based on $\text{acguD Seq}$;
Structural logarithm distance to pcRNA of acguD (SLDPD), which means the logarithm distance of nucleotide sequence to pcRNA based on $\text{acguD Seq}$;
Structural logarithm distance acguD ratio (SLDRD), which means the ratio of SLDLD and SLDPD;
Structural logarithm distance to lncRNA of acguACGU (SLDLN), which means the logarithm distance of nucleotide sequence to lncRNA based on $\text{acguACGU Seq}$;
Structural logarithm distance to pcRNA of acguACGU (SLDPN), which means the logarithm distance of nucleotide sequence to pcRNA based on $\text{acguACGU Seq}$;
Structural logarithm distance acguACGU ratio (SLDRN), which means the ratio of SLDLN and SLDPN.

4. RNA Encoding Methods (G-features)

4.1 G-gap Features

G-features represent RNA sequences by enumerating the frequency of RNA bases at various intervals. For instance, the G-feature designated as 3-GC calculates the frequency of GC base pairs separated by 3 bases within an RNA sequence. Then, this frequency was normalized to generate the value of 3-GC. For the GC type, this study calculated the frequency of GCs spaced from 0 bases to 40 bases. The 0-GC feature represents the frequency of GCs with no intervening bases, and the 1-GC feature represents the frequency of GCs separated by 1 base. By analogy, a total of 41 features were generated for the GC type, spanning intervals from 0 to 40 bases.