Introduction
Note: if you use {
systemPipeR
} in published research, please cite: Backman, T.W.H and Girke, T. (2016). systemPipeR: NGS Workflow and Report Generation Environment. BMC Bioinformatics, 17: 388. >10.1186/s12859-016-1241-0.
systemPipeR
(SPR)
provides flexible utilities for designing, building and running automated end-to-end
analysis workflows for a wide range of research applications, including
next-generation sequencing (NGS) experiments, such as RNA-Seq, ChIP-Seq,
VAR-Seq and Ribo-Seq (H Backman and Girke 2016). Important features include a uniform
workflow interface across different data analysis applications, automated
report generation, and support for running both R and command-line software,
such as NGS aligners or peak/variant callers, on local computers or compute
clusters (Figure 1). The latter supports interactive job submissions and batch submissions to queuing systems of clusters.
systemPipeR
has been designed to improve the reproducibility of large-scale data
analysis projects while substantially reducing the time it takes to analyze
complex omics data sets. It provides a uniform workflow interface and management
system that allows the users to run selected workflow steps, as well as customize and design entirely new workflows. Additionally, the package take advantage of central community S4
classes of the Bioconductor ecosystem, and enhances them with command-line software support.
The main motivation and advantages of using systemPipeR
for complex data analysis tasks are:
- Design of complex workflows involving multiple R/Bioconductor packages
- Common workflow interface for different applications
- User-friendly access to widely used Bioconductor utilities
- Support of command-line software from within R
- Reduced complexity of using compute clusters from R
- Accelerated runtime of workflows via parallelization on computer systems with multiple CPU cores and/or multiple nodes
- Improved reproducibility by automating the generation of analysis reports
A central concept for designing workflows within the systemPipeR
environment is the use of workflow management containers. Workflow management
containers facilitate the design and execution of complex data
analysis steps. For its command-line interface systemPipeR
adopts the
widely used Common Workflow Language (CWL)
(Amstutz et al. 2016). The interface to CWL is established by systemPipeR's
workflow control class called SYSargsList
(see Figure 2). This
design offers many advantages such as: (i) options to run workflows either
entirely from within R, from various command-line wrappers (e.g., cwl-runner)
or from other languages (, e.g., Bash or Python). Apart from providing
support for both command-line and R/Bioconductor software, the package provides
utilities for containerization, parallel evaluations on computer clusters and
automated generation of interactive analysis reports.
An important feature of systemPipeR's
CWL interface is that it provides two
options to run command-line tools and workflows based on CWL. First, one can
run CWL in its native way via an R-based wrapper utility for cwl-runner or
cwl-tools (CWL-based approach). Second, one can run workflows using CWL’s
command-line and workflow instructions from within R (R-based approach). In the
latter case the same CWL workflow definition files (e.g. *.cwl
and *.yml
)
are used but rendered and executed entirely with R functions defined by
systemPipeR
, and thus use CWL mainly as a command-line and workflow
definition format rather than execution software to run workflows. In this regard
systemPipeR
also provides several convenience functions that are useful for
designing and debugging workflows, such as a command-line rendering function to
retrieve the exact command-line strings for each data set and processing step
prior to running a command-line.
Workflow Management with SYSargsList
The SYSargsList
S4 class is a list-like container that stores the paths to
all input and output files along with the corresponding parameters used in each
analysis step (see Figure below). SYSargsList
instances are constructed from an
optional targets files, and two CWL parameter files including *.cwl
and
*.yml
(for details, see below). When running preconfigured NGS workflows, the
only input the user needs to provide is the initial targets file containing the
paths to the input files (e.g., FASTQ) and experiment design information, such
as sample labels and biological replicates. Subsequent targets instances are
created automatically, based on the connectivity establish between each
workflow step. SYSargsList
containers store all information required for
one or multiple steps. This establishes central control for running, monitoring
and debugging complex workflows from start to finish.
Workflow steps with input/output file operations are controlled by the SYSargsList
container. Each of its components (SYSargs2
) are constructed from an optional targets and two param files. Alternatively, LineWise
instances containing pure R code can be used.
Command-line software support
systemPipeR
adopted the widely used community standard Common Workflow
Language (CWL) (Amstutz et al. 2016) for describing command-line tools and workflows
in a declarative, generic, and reproducible manner. CWL specifications are
text-based files and can be structured using YAML (https://yaml.org/) syntax.
Therefore, the description files can be easily accessed and are readable.
The significant advantage of adopting CWL as a standard description of
command-line tools within is the flexibility of workflow
reusability for different computing environments and workflow frameworks in the
community, improving reproducibility, portability, and shareability between
collaborators and community.
Following the CWL Command Line Tool Description Specification (https://www.commonwl.org/v1.2/CommandLineTool.html), the basic elements of the CWL tool description are defined in two files. Figure 3A-B illustrate the “hello world” example. The main file contains all the information necessary to build the command-line that will be executed, specifying the input, expected output files, and arguments for the command-line (Figure 3A). The second file is optional yet provides flexibility to assign values to parameters required to make the input or output objects when building the command-line.
Reference
Amstutz, Peter, Michael R Crusoe, Nebojša Tijanić, Brad Chapman, John Chilton, Michael Heuer, Andrey Kartashov, et al. 2016. “Common Workflow Language, V1.0,” July. https://doi.org/10.6084/m9.figshare.3115156.v2.
H Backman, Tyler W, and Thomas Girke. 2016. “systemPipeR: NGS workflow and report generation environment.” BMC Bioinformatics 17 (1): 388. https://doi.org/10.1186/s12859-016-1241-0.