systemPipeTools package extends the widely used systemPipeR (SPR) (H Backman and Girke 2016)
workflow environment with enhanced toolkit for data visualization,
including utilities to automate the analysis of differentially expressed genes (DEGs).
systemPipeTools provides functions for data transformation and data exploration via
scatterplots, hierarchical clustering heatMaps, principal component analysis,
multidimensional scaling, generalized principal components, t-Distributed
Stochastic Neighbor embedding (t-SNE), and MA and volcano plots.
All these utilities can be integrated with the modular design of the systemPipeR
environment that allows users to easily substitute any of these features and/or
custom with alternatives.
Metadata and Reads Counting Information
The first step is importing the targets file and raw reads counting table.
The targets file defines all FASTQ files and sample comparisons of the analysis workflow.
The raw reads counting table represents all the reads that map to gene (row) for each sample (columns).
For gene differential expression, raw counts are required, however for data
visualization or clustering, it can be useful to work with transformed count data.
exploreDDS function is convenience wrapper to transform raw read counts using the
DESeq2 package transformations methods. The input file
has to contain all the genes, not just differentially expressed ones. Supported
methods include variance stabilizing transformation (vst) (Anders and Huber (2010)), and
regularized-logarithm transformation or rlog (Love, Huber, and Anders (2014)).
Users are strongly encouraged to consult the DESeq2 vignette for
more detailed information on this topic and how to properly run DESeq2 on data
sets with more complex experimental designs.
Scatterplot
To decide which transformation to choose, we can visualize the transformation effect
comparing two samples or a grid of all samples, as follows:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The scatterplots are created using the log2 transform normalized reads count,
variance stabilizing transformation (VST) (Anders and Huber (2010)), and
regularized-logarithm transformation or rlog (Love, Huber, and Anders (2014)).
Hierarchical Clustering Dendrogram
The following computes the sample-wise correlation coefficients using the stats::cor()
function from the transformed expression values. After transformation to a
distance matrix, hierarchical clustering is performed with the stats::hclust
function and the result is plotted as a dendrogram, as follows:
hclustplot(exploredds, method = "spearman")
The function provides the utility to save the plot automatically.
Hierarchical Clustering HeatMap
This function performs hierarchical clustering on the transformed expression
matrix generated within the DESeq2 package. It uses, by default, a Pearson
correlation-based distance measure and complete linkage for cluster join.
If samples selected in the clust argument, it will be applied the stats::dist()
function to the transformed count matrix to get sample-to-sample distances. Also,
it is possible to generate the pheatmap or plotly plot format.
The function provides the utility to save the plot automatically.
Principal Component Analysis
This function plots a Principal Component Analysis (PCA) from transformed expression
matrix. This plot shows samples variation based on the expression values and
identifies batch effects.
PCAplot(exploredds, plotly = FALSE)
The function provides the utility to save the plot automatically.
Multidimensional scaling with MDSplot
This function computes and plots multidimensional scaling analysis for dimension
reduction of count expression matrix. Internally, it is applied the stats::dist()
function to the transformed count matrix to get sample-to-sample distances.
MDSplot(exploredds, plotly = FALSE)
The function provides the utility to save the plot automatically.
Dimension Reduction with GLMplot
This function computes and plots generalized principal components analysis for
dimension reduction of count expression matrix.
The function provides the utility to save the plot automatically.
t-Distributed Stochastic Neighbor embedding with tSNEplot
This function computes and plots t-Distributed Stochastic Neighbor embedding (t-SNE)
analysis for unsupervised nonlinear dimensionality reduction of count expression
matrix. Internally, it is applied the Rtsne::Rtsne() (Krijthe 2015) function, using the exact
t-SNE computing with theta=0.0.
tSNEplot(countMatrix, targets, perplexity = 5)
Volcano plot
A simple function that shows statistical significance (p-value) versus magnitude
of change (log2 fold change).
Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biol. 11 (10): R106.
H Backman, Tyler W, and Thomas Girke. 2016. “systemPipeR: NGS workflow and report generation environment.” BMC Bioinformatics 17 (1): 388. https://doi.org/10.1186/s12859-016-1241-0.
Krijthe, Jesse H. 2015. Rtsne: T-Distributed Stochastic Neighbor Embedding Using Barnes-Hut Implementation. https://github.com/jkrijthe/Rtsne.
Love, Michael, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Genome Biol. 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.