systemPipeR's
New CWL Command-line Interface

Daniela Cassol ()
Thomas Girke ()

24 June, 2019

Outline

Outline

  • Introduction
    • Motivation
    • systemPipeR
  • Design
  • Getting started
  • Workflow Demonstration

Introduction

Motivation

  • Many NGS applications share several analysis routines, such as:
    • Read QC and preprocessing
    • Alignments
    • Quantification
    • Feature annotations
    • Enrichment analysis
  • Thus, a common workflow environment has many advantages for improving efficiency, standardization and reproducibility

Drawing

Common Workflow Environment: Requirements

  • Design and execution of end-to-end analysis pipelines
  • Support for R and command-line software
  • Runs on single machines and compute clusters
  • Uniform interface across different data analysis applications
  • Uniform sample handling and annotation
  • Automated report generation
  • Flexibility to support custom changes and new software/tools

systemPipeR

  • systemPipeR is an R package for building end-to-end analysis pipelines with automated report generation for data analysis applications
Drawing

.

Adopting CWL

Drawing

Advantages

  • Using CWL to generated systemPipeR param files: command-line and workflow definition format
  • Flexible debugging process for all the steps
  • Allows complex experimental design: using targets files
  • Downstream visualization report: integration of workflow plot and data analysis report
  • Sharing of workflows with related environments (e.g., Galaxy or Snakemake)

Design

Workflow design in systemPipeR

Workflow design in systemPipeR

  • SYSargs2 S4 class is a list-like container
  • Workflow steps operations are controlled by SYSargs2 instances
  • Each SYSargs2 instances are generated by two constructor functions, loadWorkflow and renderWF
  • Only input provided by user is initial targets file. Subsequent targets instances are created automatically
  • Any number of predefined or custom workflow steps are supported

  • SYSargs2Pipe allows loading as many SYSargs2 instances into a single compound object
    • Storage all information required to run, control and monitor workflows from start to finish

Getting Started

Install and load packages

  • Install required packages
  • Load packages and accessing help
  • Access help

Load Sample Workflow

systemPipeRdata

  • Helper package to generate with a single command NGS workflow templates for systemPipeR
  • Includes sample data for testing
  • User can create new workflows or change and extend existing ones
  • Template Workflows:
    • Sample workflows can be loaded with the genWorkenvir function from systemPipeRdata

Structure of Workflow Templates

  • The workflow templates generated contain the following preconfigured directory structure
Drawing
  • The above structure can be customized as needed, but users need to change the code/vignette accordingly

Targets file organizes samples

  • Structure of targets file for single-end (SE) library
##                      FileName SampleName Factor SampleLong
## 1 ./data/SRR446027_1.fastq.gz        M1A     M1  Mock.1h.A
## 2 ./data/SRR446028_1.fastq.gz        M1B     M1  Mock.1h.B
## 3 ./data/SRR446029_1.fastq.gz        A1A     A1   Avr.1h.A
  • Structure of targets file for paired-end (PE) library
##                     FileName1                   FileName2 SampleName
## 1 ./data/SRR446027_1.fastq.gz ./data/SRR446027_2.fastq.gz        M1A
## 2 ./data/SRR446028_1.fastq.gz ./data/SRR446028_2.fastq.gz        M1B
## 3 ./data/SRR446029_1.fastq.gz ./data/SRR446029_2.fastq.gz        A1A
##   Factor SampleLong
## 1     M1  Mock.1h.A
## 2     M1  Mock.1h.B
## 3     A1   Avr.1h.A

Design

SYSargs2

  • SYSargs2 instances are constructed from a targets file and two param file
    • hisat2-mapping-se.cwl file contains the settings for running command-line software
    • hisat2-mapping-se.yml file define all the variables to be input in the specific command-line step

Drawing

SYSargs2 instance

Slots and accessor functions have the same names

cmdlist return command-line arguments for the specific software, here HISAT2 for the first sample:

The output components of SYSargs2 define all the expected output files for each step in the workflow; some of which are the input for the next workflow step

Workflow Demonstration

Drawing

Read Preprocessing

Drawing

Read mapping with Hisat2

  • The NGS reads of this project will be aligned against the reference genome sequence using Hisat2 (Kim, Langmead, and Salzberg 2015).
  • Subsetting SYSargs2 instance slots for each workflow step:

Run on single machines

  • Run command-line tool, here Hisat2, on single machine. Command-line tool needs to be installed for this.
  • runCommandline, by default, auto detects SAM file outputs and converts them to sorted and indexed BAM files, using internally Rsamtools package (Morgan et al. 2019)
  • runCommandline allows the user to create an exclusive results folder for each step in the workflow and a sub-folder for each sample defined in the targets file

Parallelization on clusters

  • Submit command-line or R processes to a computer cluster with a queueing system (e.g. Slurm)
  • clusterRun function submits non-R command-line software to queuing systems of clusters using run specifications defined by runCommandline

SYSargs2Pipe

  • Check whether all BAM files have been created with the construction of SYSargs2Pipe
  • Additional features to be developed here

Alignment Stats

  • The following shows the alignment statistics for a sample file provided by the systemPipeR package
##   FileName Nreads2x Nalign Perc_Aligned Nalign_Primary
## 1      M1A   192918 177961     92.24697         177961
## 2      M1B   197484 159378     80.70426         159378
## 3      A1A   189870 176055     92.72397         176055
## 4      A1B   188854 147768     78.24457         147768
##   Perc_Aligned_Primary
## 1             92.24697
## 2             80.70426
## 3             92.72397
## 4             78.24457

Create new targets files

  • To establish the connectivity between different instances, it is possible by writing the subsetting output with the writeTargetsout function to a new targets file that serves as input to the next loadWorkflow and renderWF call

Read Quantification

  • Read counting with summarizeOverlaps in parallel mode using multiple cores (BiocParallel)
  • The read counting is preformed for exonic gene regions in a non-strand-specific manner while ignoring overlaps among different genes
  • CODE
##           M1A M1B A1A A1B V1A V1B M6A M6B A6A A6B V6A  V6B M12A M12B A12A
## AT1G01010  57 244 201 169 365 229  41  38 152  46 294  405  117  139  132
## AT1G01020  23  93  69 126 107  88  18  25  20   5  88  151   43   74   33
## AT1G01030  41  98  73  58  94 156   9  13   8   6  36  147   32   29   12
## AT1G01040 180 684 522 664 585 680 162 163 249  66 697 1060  338  604  360
##           A12B V12A V12B
## AT1G01010   64  230  120
## AT1G01020   18   43   31
## AT1G01030    9   70   85
## AT1G01040   80  203  171

DEG analysis

  • Fully automated for simple pairwise designs
  • Supports edgeR and DESeq2
  • The following shows DESeq2 (Love, Huber, and Anders 2014) for any number of pairwise sample comparisons specified under the cmp argument
##           M1-A1_baseMean M1-A1_logFC M1-A1_lfcSE  M1-A1_stat M1-A1_pvalue
## AT1G01010      145.25327  0.03975293   0.4624215  0.08596688    0.9314927
## AT1G01020       44.13612 -0.26585739   0.4340369 -0.61252257    0.5401921
## AT1G01030       44.26939  0.63339311   0.6246251  1.01403724    0.3105650
## AT1G01040      315.30420 -0.01585074   0.3126902 -0.05069151    0.9595713
##           M1-A1_FDR
## AT1G01010         1
## AT1G01020         1
## AT1G01030         1
## AT1G01040         1

DEG analysis

  • Filter and plot DEG results for up and down regulated genes.
Drawing

Run Workflows with a Single Command

  • All the workflows can run the code from the provided *.Rmd template file from within R interactively
  • Alternatively, it is possible to run the entire workflows or their components from the command-line
    • The following shows how to execute the chosen sample workflow (e.g., systemPipeRNAseq.Rmd) by executing from the command-line:
  • During the evaluation of the R code, reports are dynamically autogenerated in PDF or HTML format.

Acknowledgement

Acknowledgement

  • Girke Lab - University of California Riverside

  • Funding Sources

Questions

Additional

Advantages of systemPipeR

  • Design of complex NGS workflows involving multiple R/Bioconductor packages
  • Simplifies usage of command-line software from within R
  • Accelerates runtime of workflows via parallelization on computer systems with multiple CPU cores and/or multiple compute nodes
  • Automates generation of analysis reports improving reproducibility
  • Common workflow interface for different NGS applications
  • Makes NGS analysis more accessible to new users
Drawing

Return

CWL Design

Drawing

Return / Design

Read quality filtering and trimming

The function preprocessReads allows to apply predefined or custom read preprocessing functions to all FASTQ files referenced in a SYSargs container, such as quality filtering or adapter trimming routines. The following example performs adapter trimming with the trimLRPatterns function from the Biostrings package. After the trimming step a new targets file is generated (here targets_trim.txt) containing the paths to the trimmed FASTQ files. The new targets file can be used for the next workflow step with an updated SYSargs instance, e.g. running the NGS alignments using the trimmed FASTQ files.

Return

FASTQ quality report

The following seeFastq and seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The results are written to a PDF file named fastqReport.pdf.

Return

Read counting with summarizeOverlaps in parallel mode using multiple cores

Reads overlapping with annotation ranges of interest are counted for each sample using the summarizeOverlaps function (Lawrence et al. 2013). The read counting is preformed for exonic gene regions in a non-strand-specific manner while ignoring overlaps among different genes. The raw read count table (countDFeByg.xls) is written to a files in the directory of this project. Parallelization is achieved with the BiocParallel package, here using 8 CPU cores. Return

Sample of data slice of count table

Return

References

Kim, Daehwan, Ben Langmead, and Steven L Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nat. Methods 12 (4): 357–60.

Lawrence, Michael, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Comput. Biol. 9 (8): e1003118. https://doi.org/10.1371/journal.pcbi.1003118.

Love, Michael, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Genome Biol. 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.

Morgan, Martin, Hervé Pagès, Valerie Obenchain, and Nathaniel Hayden. 2019. Rsamtools: Binary Alignment (Bam), Fasta, Variant Call (Bcf), and Tabix File Import. http://bioconductor.org/packages/Rsamtools.