systemPipeR's New CWL Command-line Interface

`systemPipeR's`
New CWL Command-line Interface

Daniela Cassol (danielac@ucr.edu)
Thomas Girke (thomas.girke@ucr.edu)

24 June, 2019

Outline

Introduction
- Motivation
- systemPipeR
Design
Getting started
Workflow Demonstration

Introduction

Motivation

Many NGS applications share several analysis routines, such as:
- Read QC and preprocessing
- Alignments
- Quantification
- Feature annotations
- Enrichment analysis
Thus, a common workflow environment has many advantages for improving efficiency, standardization and reproducibility

Common Workflow Environment: Requirements

Design and execution of end-to-end analysis pipelines
Support for R and command-line software
Runs on single machines and compute clusters
Uniform interface across different data analysis applications
Uniform sample handling and annotation
Automated report generation
Flexibility to support custom changes and new software/tools

systemPipeR

systemPipeR is an R package for building end-to-end analysis pipelines with automated report generation for data analysis applications

Adopting CWL

Drawing

CWL is a community standard for describing computational data analysis workflows
- Provide portability and scalability across platforms and environments
Integration of CWL allows running systemPipeR workflows from a single specification instance:
- Entirely from R
- Command-line wrappers (cwl-runner or cwl-tools)
- Other languages
Support resources for containerization, parallel evaluations on computer clusters
Can be applying for any tools/software

Advantages

Using CWL to generated systemPipeR param files: command-line and workflow definition format
Flexible debugging process for all the steps
Allows complex experimental design: using targets files
Downstream visualization report: integration of workflow plot and data analysis report
Sharing of workflows with related environments (e.g., Galaxy or Snakemake)

Design

Workflow design in systemPipeR

targets file
CWL design
SYSargs2 generation

Workflow design in systemPipeR

SYSargs2 S4 class is a list-like container
Workflow steps operations are controlled by SYSargs2 instances
Each SYSargs2 instances are generated by two constructor functions, loadWorkflow and renderWF
Only input provided by user is initial targets file. Subsequent targets instances are created automatically
Any number of predefined or custom workflow steps are supported
SYSargs2Pipe allows loading as many SYSargs2 instances into a single compound object
- Storage all information required to run, control and monitor workflows from start to finish

Getting Started

Install and load packages

Install required packages

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("systemPipeR")
BiocManager::install("systemPipeRdata", build_vignettes=TRUE, dependencies=TRUE)

Load packages and accessing help

library("systemPipeR")
library("systemPipeRdata")

Access help

library(help="systemPipeR")
vignette("systemPipeR")

Load Sample Workflow

systemPipeRdata

Helper package to generate with a single command NGS workflow templates for systemPipeR
Includes sample data for testing
User can create new workflows or change and extend existing ones
Template Workflows:
- Sample workflows can be loaded with the genWorkenvir function from systemPipeRdata

library(systemPipeRdata)
genWorkenvir(workflow="rnaseq")
setwd("rnaseq")

Structure of Workflow Templates

The workflow templates generated contain the following preconfigured directory structure

The above structure can be customized as needed, but users need to change the code/vignette accordingly

Targets file organizes samples

Structure of targets file for single-end (SE) library

targetspath <- system.file("extdata", "targets.txt", package="systemPipeR")
read.delim(targetspath, comment.char = "#")[1:3,1:4]

##                      FileName SampleName Factor SampleLong
## 1 ./data/SRR446027_1.fastq.gz        M1A     M1  Mock.1h.A
## 2 ./data/SRR446028_1.fastq.gz        M1B     M1  Mock.1h.B
## 3 ./data/SRR446029_1.fastq.gz        A1A     A1   Avr.1h.A

Structure of targets file for paired-end (PE) library

targetspath <- system.file("extdata", "targetsPE.txt", package="systemPipeR")
read.delim(targetspath, comment.char = "#")[1:3,1:5]

##                     FileName1                   FileName2 SampleName
## 1 ./data/SRR446027_1.fastq.gz ./data/SRR446027_2.fastq.gz        M1A
## 2 ./data/SRR446028_1.fastq.gz ./data/SRR446028_2.fastq.gz        M1B
## 3 ./data/SRR446029_1.fastq.gz ./data/SRR446029_2.fastq.gz        A1A
##   Factor SampleLong
## 1     M1  Mock.1h.A
## 2     M1  Mock.1h.B
## 3     A1   Avr.1h.A

Design

SYSargs2

SYSargs2 instances are constructed from a `targets` file and two `param` file

`hisat2-mapping-se.cwl` file contains the settings for running command-line software

`hisat2-mapping-se.yml` file define all the variables to be input in the specific command-line step

targets <- system.file("extdata", "targets.txt", package="systemPipeR")
dir_path <- system.file("extdata/cwl/hisat2-se", package="systemPipeR")
align <- loadWorkflow(targets=targets, wf_file="hisat2-mapping-se.cwl",
                   input_file="hisat2-mapping-se.yml", dir_path=dir_path)
align <- renderWF(align, inputvars=c(FileName="_FASTQ_PATH_", SampleName="_SampleName_"))

## Instance of 'SYSargs2':
##    Slot names/accessors: 
##       targets: 18 (M1A...V12B), targetsheader: 4 (lines)
##       modules: 2
##       wf: 0, clt: 1, yamlinput: 7 (components)
##       input: 18, output: 18
##       cmdlist: 18
##    WF Steps:
##       1. hisat2-mapping-se.cwl (rendered: TRUE)

`SYSargs2` instance

Slots and accessor functions have the same names

names(align)
#  [1] "targets"       "targetsheader" "modules"       "wf"            "clt"          
#  [6] "yamlinput"     "cmdlist"       "input"         "output"        "cwlfiles"     
# [11] "inputvars"

`cmdlist` return command-line arguments for the specific software, here `HISAT2` for the first sample:

cmdlist(align)[1]
# $M1A
# $M1A$`hisat2-mapping-se.cwl`
# [1] "hisat2 -S results/M1A.sam  -x ./data/tair10.fasta  -k 1  --min-intronlen 30  --max-intronlen 3000  -U ./data/SRR446027_1.fastq.gz --threads 4"

The output components of `SYSargs2` define all the expected output files for each step in the workflow; some of which are the input for the next workflow step

output(align)[1]
# $M1A
# $M1A$`hisat2-mapping-se.cwl`
# [1] "results/M1A.sam"

Workflow Demonstration

Read mapping with Hisat2

The NGS reads of this project will be aligned against the reference genome sequence using Hisat2 (Kim, Langmead, and Salzberg 2015).

targets <- system.file("extdata", "targets.txt", package="systemPipeR")
dir_path <- system.file("extdata/cwl/hisat2-se", package="systemPipeR")
align <- loadWorkflow(targets=targets, wf_file="hisat2-mapping-se.cwl",
                   input_file="hisat2-mapping-se.yml", dir_path=dir_path)

align <- renderWF(align, inputvars=c(FileName="_FASTQ_PATH_", SampleName="_SampleName_"))

Subsetting SYSargs2 instance slots for each workflow step:

subsetWF(align, slot="input", subset='FileName')[1:2]
#>                           M1A                           M1B 
#> "./data/SRR446027_1.fastq.gz" "./data/SRR446028_1.fastq.gz"
subsetWF(align, slot="output", subset=1)[1:2]
#>               M1A               M1B 
#> "results/M1A.sam" "results/M1B.sam"
subsetWF(align, slot="step", subset=1)[1] ## subset all the HISAT2 commandline 
#>                              M1A 
#> "hisat2 -S results/M1A.sam  -x ./data/tair10.fasta  -k 1  --min-intronlen 30  --max-intronlen 3000  -U ./data/SRR446027_1.fastq.gz --threads 4"
subsetWF(align, slot="output", subset=1, delete=TRUE)[1] ##DELETE
#> The subset cannot be deleted: no such file
#>               M1A 
#> "results/M1A.sam"

Run on single machines

Run command-line tool, here Hisat2, on single machine. Command-line tool needs to be installed for this.

cmdlist(align)[1:2]
system("hisat2-build ./data/tair10.fasta ./data/tair10.fasta")
runCommandline(align, make_bam = FALSE) 
output(align)

runCommandline, by default, auto detects SAM file outputs and converts them to sorted and indexed BAM files, using internally Rsamtools package (Morgan et al. 2019)
runCommandline allows the user to create an exclusive results folder for each step in the workflow and a sub-folder for each sample defined in the targets file

Parallelization on clusters

Submit command-line or R processes to a computer cluster with a queueing system (e.g. Slurm)
clusterRun function submits non-R command-line software to queuing systems of clusters using run specifications defined by runCommandline

library(batchtools)
resources <- list(walltime=120, ntasks=1, ncpus=4, memory=1024)
reg <- clusterRun(align, FUN = runCommandline, more.args = list(dir=TRUE), 
                  conffile = ".batchtools.conf.R", template = "batchtools.slurm.tmpl", 
                  Njobs=18, runid="01", resourceList=resources)
getStatus(reg=reg)
output(align)

`SYSargs2Pipe`

Check whether all BAM files have been created with the construction of SYSargs2Pipe

WF_track <- run_track(WF_ls = c(align))
names(WF_track)
WF_steps(WF_track)
track(WF_track)
summaryWF(WF_track)

# $`hisat2-mapping-se`
# $`hisat2-mapping-se`$Done
# [1] "Existing expected outputs files: 18"
# 
# $`hisat2-mapping-se`$NotRun
# $`hisat2-mapping-se`$NotRun$Summary
# [1] "Missing expected outputs files: 0"
# 
# $`hisat2-mapping-se`$NotRun$ListFiles
# character(0)

Additional features to be developed here

Alignment Stats

The following shows the alignment statistics for a sample file provided by the systemPipeR package

read.table(system.file("extdata", "alignStats.xls", package="systemPipeR"), header=TRUE)[1:4,]

##   FileName Nreads2x Nalign Perc_Aligned Nalign_Primary
## 1      M1A   192918 177961     92.24697         177961
## 2      M1B   197484 159378     80.70426         159378
## 3      A1A   189870 176055     92.72397         176055
## 4      A1B   188854 147768     78.24457         147768
##   Perc_Aligned_Primary
## 1             92.24697
## 2             80.70426
## 3             92.72397
## 4             78.24457

Create `new targets` files

To establish the connectivity between different instances, it is possible by writing the subsetting output with the writeTargetsout function to a new targets file that serves as input to the next loadWorkflow and renderWF call

names(clt(align))
writeTargetsout(x=align, file="default", step=1)

Read Quantification

Read counting with summarizeOverlaps in parallel mode using multiple cores (BiocParallel)
The read counting is preformed for exonic gene regions in a non-strand-specific manner while ignoring overlaps among different genes
CODE

read.table(system.file("extdata", "countDFeByg.xls", package="systemPipeR"), header=TRUE)[1:4,]

##           M1A M1B A1A A1B V1A V1B M6A M6B A6A A6B V6A  V6B M12A M12B A12A
## AT1G01010  57 244 201 169 365 229  41  38 152  46 294  405  117  139  132
## AT1G01020  23  93  69 126 107  88  18  25  20   5  88  151   43   74   33
## AT1G01030  41  98  73  58  94 156   9  13   8   6  36  147   32   29   12
## AT1G01040 180 684 522 664 585 680 162 163 249  66 697 1060  338  604  360
##           A12B V12A V12B
## AT1G01010   64  230  120
## AT1G01020   18   43   31
## AT1G01030    9   70   85
## AT1G01040   80  203  171

DEG analysis

Fully automated for simple pairwise designs
Supports edgeR and DESeq2
The following shows DESeq2 (Love, Huber, and Anders 2014) for any number of pairwise sample comparisons specified under the cmp argument

targetspath <- system.file("extdata", "targets.txt", package="systemPipeR")
targets <- read.delim(targetspath, comment="#")
cmp <- readComp(file=targetspath, format="matrix", delim="-")
countDFeBygpath <- system.file("extdata", "countDFeByg.xls", package="systemPipeR")
countDFeByg <- read.delim(countDFeBygpath, row.names=1)
degseqDF <- run_DESeq2(countDF=countDFeByg, targets=targets, cmp=cmp[[1]], independent=FALSE)
degseqDF[1:4,1:6]

##           M1-A1_baseMean M1-A1_logFC M1-A1_lfcSE  M1-A1_stat M1-A1_pvalue
## AT1G01010      145.25327  0.03975293   0.4624215  0.08596688    0.9314927
## AT1G01020       44.13612 -0.26585739   0.4340369 -0.61252257    0.5401921
## AT1G01030       44.26939  0.63339311   0.6246251  1.01403724    0.3105650
## AT1G01040      315.30420 -0.01585074   0.3126902 -0.05069151    0.9595713
##           M1-A1_FDR
## AT1G01010         1
## AT1G01020         1
## AT1G01030         1
## AT1G01040         1

DEG analysis

Filter and plot DEG results for up and down regulated genes.

DEG_list2 <- filterDEGs(degDF=degseqDF, filter=c(Fold=2, FDR=10))

Run Workflows with a Single Command

All the workflows can run the code from the provided *.Rmd template file from within R interactively
Alternatively, it is possible to run the entire workflows or their components from the command-line
- The following shows how to execute the chosen sample workflow (e.g., systemPipeRNAseq.Rmd) by executing from the command-line:

Rscript -e "rmarkdown::render('systemPipeRNAseq.Rmd')"

During the evaluation of the R code, reports are dynamically autogenerated in PDF or HTML format.

Acknowledgement

Girke Lab - University of California Riverside
Funding Sources
- NSF ABI-1661152
- Bioconductor Travel Award

Questions

Additional

Advantages of systemPipeR

Design of complex NGS workflows involving multiple R/Bioconductor packages
Simplifies usage of command-line software from within R
Accelerates runtime of workflows via parallelization on computer systems with multiple CPU cores and/or multiple compute nodes
Automates generation of analysis reports improving reproducibility
Common workflow interface for different NGS applications
Makes NGS analysis more accessible to new users

Return

CWL Design

Return / Design

Read quality filtering and trimming

The function `preprocessReads` allows to apply predefined or custom read preprocessing functions to all FASTQ files referenced in a `SYSargs` container, such as quality filtering or adapter trimming routines. The following example performs adapter trimming with the `trimLRPatterns` function from the `Biostrings` package. After the trimming step a new targets file is generated (here `targets_trim.txt`) containing the paths to the trimmed FASTQ files. The new targets file can be used for the next workflow step with an updated `SYSargs` instance, e.g. running the NGS alignments using the trimmed FASTQ files.

args <- systemArgs(sysma="param/trim.param", mytargets="targets.txt")
preprocessReads(args=args, Fct="trimLRPatterns(Rpattern='GCCCGGGTAA', subject=fq)",
                batchsize=100000, overwrite=TRUE, compress=TRUE)
writeTargetsout(x=args, file="targets_trim.txt", overwrite=TRUE)

Return

FASTQ quality report

The following `seeFastq` and `seeFastqPlot` functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The results are written to a PDF file named `fastqReport.pdf`.

args <- systemArgs(sysma="param/hisat2.param", mytargets="targets.txt")
fqlist <- seeFastq(fastq=infile1(args), batchsize=100000, klength=8)
pdf("./results/fastqReport.pdf", height=18, width=4*length(fqlist))
seeFastqPlot(fqlist)
dev.off()

Return

Read counting with `summarizeOverlaps` in parallel mode using multiple cores

Reads overlapping with annotation ranges of interest are counted for each sample using the `summarizeOverlaps` function (Lawrence et al. 2013). The read counting is preformed for exonic gene regions in a non-strand-specific manner while ignoring overlaps among different genes. The raw read count table (`countDFeByg.xls`) is written to a files in the directory of this project. Parallelization is achieved with the `BiocParallel` package, here using 8 CPU cores. Return

library("GenomicFeatures"); library(BiocParallel)
args <- systemArgs(sysma="param/hisat2.param", mytargets="targets.txt")
txdb <- makeTxDbFromGFF(file="data/tair10.gff", format="gff",
                        dataSource="TAIR", organism="Arabidopsis thaliana")
saveDb(txdb, file="./data/tair10.sqlite")
txdb <- loadDb("./data/tair10.sqlite")
(align <- readGAlignments(outpaths(args)[1])) # Demonstrates how to read bam file into R
eByg <- exonsBy(txdb, by=c("gene"))
bfl <- BamFileList(outpaths(args), yieldSize=50000, index=character())
multicoreParam <- MulticoreParam(workers=2); register(multicoreParam)
registered()
counteByg <- bplapply(bfl, function(x) summarizeOverlaps(eByg, x, mode="Union",
                                                         ignore.strand=TRUE, 
                                                         inter.feature=FALSE,
                                                         singleEnd=TRUE)) 
countDFeByg <- sapply(seq(along=counteByg), 
                      function(x) assays(counteByg[[x]])$counts)
rownames(countDFeByg) <- names(rowRanges(counteByg[[1]]))
colnames(countDFeByg) <- names(bfl)
write.table(countDFeByg, "results/countDFeByg.xls", col.names=NA,
            quote=FALSE, sep="\t")

Sample of data slice of count table

read.delim("results/countDFeByg.xls", row.names=1, check.names=FALSE)[1:4,1:5]

Return

References

Kim, Daehwan, Ben Langmead, and Steven L Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nat. Methods 12 (4): 357–60.

Lawrence, Michael, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Comput. Biol. 9 (8): e1003118. https://doi.org/10.1371/journal.pcbi.1003118.

Love, Michael, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Genome Biol. 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.

Morgan, Martin, Hervé Pagès, Valerie Obenchain, and Nathaniel Hayden. 2019. Rsamtools: Binary Alignment (Bam), Fasta, Variant Call (Bcf), and Tabix File Import. http://bioconductor.org/packages/Rsamtools.

systemPipeR's New CWL Command-line Interface

Daniela Cassol (danielac@ucr.edu) Thomas Girke (thomas.girke@ucr.edu)

24 June, 2019

Outline

Outline

Introduction

Motivation

Common Workflow Environment: Requirements

systemPipeR

Adopting CWL

Advantages

Design

Workflow design in systemPipeR

Workflow design in systemPipeR

Getting Started

Install and load packages

Load Sample Workflow

systemPipeRdata

Structure of Workflow Templates

Targets file organizes samples

SYSargs2

SYSargs2 instances are constructed from a targets file and two param file hisat2-mapping-se.cwl file contains the settings for running command-line software hisat2-mapping-se.yml file define all the variables to be input in the specific command-line step

SYSargs2 instance

Slots and accessor functions have the same names

cmdlist return command-line arguments for the specific software, here HISAT2 for the first sample:

The output components of SYSargs2 define all the expected output files for each step in the workflow; some of which are the input for the next workflow step

Workflow Demonstration

Read Preprocessing

Read mapping with Hisat2

Run on single machines

Parallelization on clusters

SYSargs2Pipe

Alignment Stats

Create new targets files

Read Quantification

DEG analysis

DEG analysis

Run Workflows with a Single Command

Acknowledgement

Acknowledgement

Questions

Additional

Advantages of systemPipeR

CWL Design

Read quality filtering and trimming

FASTQ quality report

Read counting with summarizeOverlaps in parallel mode using multiple cores

Sample of data slice of count table

References

`systemPipeR's`
New CWL Command-line Interface

Daniela Cassol (danielac@ucr.edu)
Thomas Girke (thomas.girke@ucr.edu)

SYSargs2 instances are constructed from a `targets` file and two `param` file

`hisat2-mapping-se.cwl` file contains the settings for running command-line software

`hisat2-mapping-se.yml` file define all the variables to be input in the specific command-line step

`SYSargs2` instance

`cmdlist` return command-line arguments for the specific software, here `HISAT2` for the first sample:

The output components of `SYSargs2` define all the expected output files for each step in the workflow; some of which are the input for the next workflow step

`SYSargs2Pipe`

Create `new targets` files

Read counting with `summarizeOverlaps` in parallel mode using multiple cores