3.0 RNA-Seq analysis pipelines

Over the years, microarray and gene-chip technologies provide an insight into understanding the genetic changes in biological samples. However, these techniques are known to have certain limitations related to dynamic range, resolution and accuracy.51 Advances in transcriptome technology have allowed deeper understanding of the intricacies of gene expression regulation, particularly high-throughput RNA sequencing technology that made it possible to observe whole transcriptome variations, discover novel splicing sites and events, functions of noncoding RNAs as well as proving correct construction and annotation of complex genomes.52 It also aids to qualitatively ascertain the RNA transcripts present, RNA editing sites, and to quantitatively know how much of the individual transcripts expressed.53 Thus, it is paramount to overview pipelines and workflows applied to bladder cancer RNA-Seq analyses.
A number of computational pipelines and workflows are being used for the pre-processing of RNA-Seq data in cancer studies and other experimental purposes.50,54,55 A typical RNA-Seq workflow consists of seven steps; (1) pre-processing of raw data, (2) alignment of reads to the reference (3) transcriptome reconstruction, (4) quantifications of transcripts or genes level (5) differential expression analysis (6) functional profiling and (7) advanced analysis (Figure 2 ).56 These stages in the RNA-Seq workflow that includes quality control (QC) and data analysis can be done using varieties of computational platforms or tools. For example, read counts may be aligned using different tools such as spliced transcript alignment to a reference (STAR) or Tophat.57,58 Then, the aligned read counts can be obtained using either HTSeq or Rsubread R/Bioconductor package.59,60 The advantage of Rsubread over HTSeq is that the former is faster, requires less memory and summarizes the read counts that are more closely related to a true value.61,62
RNA-Seq raw data often have quality problems that can distort analytical findings significantly and lead to incorrect conclusions.63 For instance, the quality of raw RNA-Seq data could be altered by residue of ribosomal RNA, degradation of RNA and variation in read coverage.63 Hence, in order to obtain accurate transcripts or genes measurements and proper acquisition of information from the data, raw RNA-Seq data must be reviewed and evaluated by quality control measures before subsequent analyses are conducted.27,63 Presently, the most widely and commonly used computational tools available for RNA-Seq QC include; FASTQC and MultiQC. FASTQC processes one sample at a time, while MultiQC can generate a single report that visualizes the output of several samples from multiple tools thereby giving room for easy comparison.64,65 Other important and commonly used computational software for QC are comprises of RseQC, RNA-seQC and RNA-QC-Chain.66-68 Although both RseQC and RNAseQC can offer QC statistics of aligned read counts, RseQC partially relies on the University of California Senta Crus (UCSC) Genome Browser.67 Moreover, they are slow and unable to provide sequence trimming and filtration of contaminants. However, RNA-QC-Chain can remove low quality reads and contamination, in addition to providing fast and reliable QC to produce data for downstream analysis.63 RNA-Seq data analyses steps totally depend on the data quality and specific aims of the study. These analyses steps were reviewed in detail elsewhere.27,69
The system of RNA-Seq analysis employs high-computational tool applications for the development of pipelines that orchestrate the entire workflow and optimize usage of available computational resources.67 The development of such analytic tools for RNA-Seq data has expanded owing to complex nature of transcriptome data, and thus, selecting the correct processing pipeline and normalization strategy has a significant impact on downstream analysis.70 This pipeline consists of multiple independent analytical software packages, tools and platforms which employ R and Python, Unix/Bash, Java script, Perl and C++. Being that these software are in programmable environment; they provide flexible manipulation of data and methods. However, they required the user to have expertise in programming languages especially the bash language or Unix Commands Line.71 With the growing application of RNA-Seq in biomedical research, an integrated user friendly platforms are needed to overcome the barriers encountered when using code-bond platforms, the Graphical user interface(GUI) or web-based platforms provides convenient and enabling environment for non-expert with advantages for quick exploratory analysis, even though not on the scale of large datasets.71 Table 1 provides a summary of the various computational tools and their associated platforms used in RNA-Seq analyses.
Variations in the RNA-Seq analysis results might be observed due to usage of different platforms and analytical framework. The number of computational tools and bioinformatics methods that are currently in use, add more challenges to the analysis and interpretation of the RNA-Seq data. In order to solve these challenges caused by variations in RNA-Seq analysis techniques, standard pipelines need to be enforced and re-designed in order to integrate analysis of multiple experiments. Workflow constructions software packages such as Chipster,72 Anduril73,74 and Galaxy75 could be very much relevant in solving some of these challenges. For example, Anduril was developed for designing complex RNA-Seq pipelines with large-scale datasets which require automated parallelization. While Chipster and Galaxy are powerful in data integrative visualization which makes it very useful for data exploration and interpretation. Other workflows and management frameworks for RNA-Seq analysis are KNIME76 which aid in visual assembly and interactive execution of data pipeline and Snakemake,77 which is a Python-based workflow management engine that provides a powerful execution environment. Workflow management framework that specifically focuses on RNA-Seq data analysis is reviewed by.83 In addition, the large-scale nature of the data analyses associated with RN-Seq brought many challenges that are beyond the scope of this review. Han and colleagues 78 reviewed these challenges comprehensively and proposed solutions. Moreover, results from RNA-Seq study on tumours revealed the presence of molecular subsets of cellular signatures, microenvironment and facilitates choices to circumvent treatment failure.79 Thus, single-cell sequencing (scRNA-Seq) may prove the correct method to understand tumour progression, pathogenesis and discovery of biomarkers that could lead to a better treatment and management of bladder cancer.