July 22, 2015

Outline

  • Introduction
  • Motivation
  • Design
  • Templates
  • Getting started

Introduction

  • systemPipeR is an R package for building end-to-end analysis pipelines with automated report generation for next generation NGS applications (Girke 2014).
  • Important features:
    • Support for R and command-line software, such as NGS aligners, peak callers, variant callers, etc.
    • Runs on single machines and compute clusters with schedulers
    • Uniform sample handling and annotation

Outline

  • Introduction
  • Motivation
  • Design
  • Templates
  • Getting started

Motivation

  • Many NGS applications share several analysis routines, such as:
    • Read QC and preprocessing
    • Alignments
    • Quantification
    • Feature annotations
    • Enrichment analysis
  • Thus, a common workflow environment has many advantages for improving efficiency, standardization and reproducibility

Advantages of systemPipeR

  • Facilitates design of complex NGS workflows involving multiple R/Bioconductor packages (Huber et al. 2015).
  • Makes NGS analysis with Bioconductor utilities more accessible to new users
  • Simplifies usage of command-line software from within R
  • Reduces complexity of using compute clusters for R and command-line software
  • Accelerates runtime of workflows via parallelization on computer systems with mutiple CPU cores and/or multiple compute nodes
  • Automates generation of analysis reports to improve reproducibility

Outline

  • Introduction
  • Motivation
  • Design
  • Templates
  • Getting started

Workflow design in systemPipeR

Drawing



  • Workflow steps with input/output file operations are controlled by SYSargs objects.

  • Each SYSargs instance is constructed from a targets file and a param file.

  • Only input provided by user is initial targets file. Subsequent targets instances are created automatically.

  • Any number of predefined or custom workflow steps are supported.

Outline

  • Introduction
  • Motivation
  • Design
  • Templates
  • Getting started

systemPipeRdata: template workflows

  • Helper package to generate with a single command NGS workflow templates for systemPipeR.
  • Includes sample data for testing.
  • User can create new workflows or change and extend existing ones.

RNA-Seq workflow template

  1. Read preprocessing
    • Quality filtering (trimming)
    • FASTQ quality report
  2. Alignments: rsubread, Bowtie2/Tophat2
  3. Alignment statistics
  4. Read counting per annotation
  5. Sample-wise correlation analysis
  6. DEG analysis with edgeR or DESeq2
  7. Enrichment analysis of GO terms or other annotation types
  8. Gene-wise cluster analysis

VAR-Seq workflow template

  1. Read preprocessing
    • Quality filtering (trimming)
    • FASTQ quality report
  2. Alignments: gsnap, bwa
  3. Alignment statistics
  4. Variant calling: VariantTools, GATK, BCFtools
  5. Variant filtering: VariantTools and VariantAnnotation
  6. Variant annotation: VariantAnnotation
  7. Combine results from many samples
  8. Summary statistics of samples

ChIP-Seq workflow template

  1. Read preprocessing
    • Quality filtering and/or trimming
    • FASTQ quality report
  2. Alignments: rsubread, Bowtie2
  3. Alignment statistics
  4. Genome-wide coverage statistics
  5. Peak calling: MACS2, BayesPeak
  6. Peak annotation with genomic context
  7. Differential binding analysis
  8. Enrichment analysis of GO terms or other annotation types
  9. Motif analysis

Ribo-Seq workflow template

  1. Read preprocessing
    • Adaptor trimming and quality filtering
    • FASTQ quality report
  2. Alignments: Tophat2 (or any other RNA-Seq aligner)
  3. Alignment stats
  4. Compute read distribution across genomic features
  5. Adding custom features to workflow (e.g. uORFs)
  6. Genomic read coverage along transcripts
  7. Read counting
  8. Sample-wise correlation analysis
  9. Analysis of differentially expressed genes (DEGs)
  10. GO term enrichment analysis
  11. Gene-wise clustering
  12. Differential ribosome binding (translational efficiency)

Coming soon

Workflow templates for:

  • miRNA-Seq
  • BS-Seq

Outline

  • Introduction
  • Motivation
  • Design
  • Templates
  • Getting Started

Install and load packages

Install required packages

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("systemPipeR") # Install systemPipeR from Bioconductor
BiocManager::install("tgirke/systemPipeRdata", build_vignettes=TRUE, dependencies=TRUE) # From github

Load packages and accessing help

library("systemPipeR")
library("systemPipeRdata")

Access help

library(help="systemPipeR")
vignette("systemPipeR")

Targets file organizes samples

Structure of targets file for single-end (SE) library

targetspath <- system.file("extdata", "targets.txt", package="systemPipeR")
read.delim(targetspath, comment.char = "#")[1:3,1:5]
##                   FileName SampleName Factor SampleLong Experiment
## 1 ./data/SRR446027_1.fastq        M1A     M1  Mock.1h.A          1
## 2 ./data/SRR446028_1.fastq        M1B     M1  Mock.1h.B          1
## 3 ./data/SRR446029_1.fastq        A1A     A1   Avr.1h.A          1

Structure of targets file for paired-end (PE) library

targetspath <- system.file("extdata", "targetsPE.txt", package="systemPipeR")
read.delim(targetspath, comment.char = "#")[1:3,1:4]
##                  FileName1                FileName2 SampleName Factor
## 1 ./data/SRR446027_1.fastq ./data/SRR446027_2.fastq        M1A     M1
## 2 ./data/SRR446028_1.fastq ./data/SRR446028_2.fastq        M1B     M1
## 3 ./data/SRR446029_1.fastq ./data/SRR446029_2.fastq        A1A     A1

SYSargs: targets & param

SYSargs instances are constructed from a targets file and a param file. The param file contains the settings for running command-line software.

parampath <- system.file("extdata", "tophat.param", package="systemPipeR")
(args <- suppressWarnings(systemArgs(sysma=parampath, mytargets=targetspath)))
## An instance of 'SYSargs' for running 'tophat' on 18 samples

Slots and accessor functions have the same names

names(args)[c(5,8,13)]
## [1] "software"  "reference" "sysargs"

Return command-line arguments for given software, here Tophat2 for 1st sample.

sysargs(args)[1]
## tophat -p 4 -o SRR446027_1.fastq.tophat tair10.fasta SRR446027_1.fastq .SRR446027_2.fastq

Run on single machines or clusters

Run command-line tool, here Tophat2, on single machine. Command-line tool needs to be installed for this.

runCommandline(args)

Submit command-line or R processes to a computer cluster with a queueing system.

clusterRun(args, ...) 

The last step requires additional resource allocation arguments. For details please visit the main manual here.

Workflow templates

Generate workflow template, e.g. "rnaseq", "varseq" or "chipseq"

genWorkenvir(workflow="varseq", mydirname=NULL)
setwd("varseq")



Command-line alternative for generating workflow environments

$ echo 'library(systemPipeRdata); 
        genWorkenvir(workflow="varseq", mydirname=NULL)' | R --slave

Workflow template structure

The workflow templates generated by genWorkenvir contain the following preconfigured directory structure:

workflow_name/            # *.Rnw/*.Rmd scripts, targets file, etc.
                param/    # parameter files for command-line software 
                data/     # inputs e.g. FASTQ, reference, annotations
                results/  # analysis result files



The above structure can be customized as needed, but for first-time users it is easier to keep changes to a minimum.

Run workflows

  • Next, run from within R the chosen sample workflow by executing the code provided in the corresponding *.Rnw template file (or *.Rmd or *.R versions).
  • Alternatively, one can run an entire workflow from start to finish with a single command by executing from the command-line:
$ make -B
  • Analysis reports in PDF or HTML format are autogenerated when running a workflow using standard R resources for scientific report generation including knitr and rmarkdown, respectively.

  • Integration of ReportingTools is also straightforward.

Continue here

Future development

  • Workflow templates with support for both PDF (.Rnw) and HTML (.Rmd) reports
  • Workflow templates for additional NGS applications (see here)
  • docopt support for generating .param files
  • Additional visualization functions
  • Streamline support of very complex experimental designs

References

Girke, Thomas. 2014. “systemPipeR: NGS Workflow and Report Generation Environment.” UC Riverside. https://github.com/tgirke/systemPipeR.

Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nat. Methods 12 (2): 115–21. doi:10.1038/nmeth.3252.