Build workflow interactively
To start, we use the same workflow instance like the last section.
sal <- SPRproject(logs.dir= ".SPRproject", sys.file=".SPRproject/SYSargsList.yml")
## Creating directory: /home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/data
## Creating directory: /home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/param
## Creating directory: /home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/results
## Creating directory '/home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/.SPRproject'
## Creating file '/home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/.SPRproject/SYSargsList.yml'
sal
## Instance of 'SYSargsList':
## No workflow steps added
Adding the first step
The first step is R code based, and we are splitting the iris
dataset by Species
and for each Species
will be saved on file. Please note that this code will
not be executed now; it is just store in the container for further execution.
This constructor function requires the step_name
and the R-based code under
the code
argument.
The R code should be enclosed by braces ({}
) and separated by a new line.
appendStep(sal) <- LineWise(code = {
mapply(function(x, y) write.csv(x, y),
split(iris, factor(iris$Species)),
file.path("results", paste0(names(split(iris, factor(iris$Species))), ".csv"))
)
},
step_name = "export_iris")
For a brief overview of the workflow, we can check the object as follows:
sal
## Instance of 'SYSargsList':
## WF Steps:
## 1. export_iris --> Status: Pending
##
Also, for printing and double-check the R code in the step, we can use the
codeLine
method:
codeLine(sal)
## export_iris
## mapply(function(x, y) write.csv(x, y), split(iris, factor(iris$Species)), file.path("results", paste0(names(split(iris, factor(iris$Species))), ".csv")))
Adding more steps
Next, an example of how to compress the exported files using
gzip
command-line.
The constructor function creates an SYSargsList
S4 class object using data from
three input files:
- CWL command-line specification file (`wf_file` argument);
- Input variables (`input_file` argument);
- Targets file (`targets` argument).
In CWL, files with the extension .cwl
define the parameters of a chosen
command-line step or workflow, while files with the extension .yml
define the
input variables of command-line steps.
The targets
file is optional for workflow steps lacking input
files. The connection
between input
variables and the targets
file is defined under the inputvars
argument. It is required a named vector
, where each element name needs to match
with column names in the targets
file, and the value must match the names of
the input
variables defined in the *.yml
files (see Figure ??).
A detailed description of the dynamic between input
variables and targets
files can be found here.
In addition, the CWL syntax overview can be found here.
Besides all the data form targets
, wf_file
, input_file
and dir_path
arguments,
SYSargsList
constructor function options include:
step_name
: a unique name for the step. This is not mandatory; however, it is highly recommended. If no name is provided, a defaultstep_x
, wherex
reflects the step index, will be added.dir
: this option allows creating an exclusive subdirectory for the step in the workflow. All the outfiles and log files for this particular step will be generated in the respective folders.dependency
: after the first step, all the additional steps appended to the workflow require the information of the dependency tree.
The appendStep<-
method is used to append a new step in the workflow.
targetspath <- system.file("extdata/cwl/gunzip", "targets_gunzip.txt", package = "systemPipeR")
appendStep(sal) <- SYSargsList(step_name = "gzip",
targets = targetspath, dir = TRUE,
wf_file = "gunzip/workflow_gzip.cwl", input_file = "gunzip/gzip.yml",
dir_path = system.file("extdata/cwl", package = "systemPipeR"),
inputvars = c(FileName = "_FILE_PATH_", SampleName = "_SampleName_"),
dependency = "export_iris")
Note: This will not work if the gzip
is not available on your system
(installed and exported to PATH) and may only work on Windows systems using PowerShell.
For a overview of the workflow, we can check the object as follows:
sal
## Instance of 'SYSargsList':
## WF Steps:
## 1. export_iris --> Status: Pending
## 2. gzip --> Status: Pending
## Total Files: 3 | Existing: 0 | Missing: 3
## 2.1. gzip
## cmdlist: 3 | Pending: 3
##
Note that we have two steps, and it is expected three files from the second step. Also, the workflow status is Pending, which means the workflow object is rendered in R; however, we did not execute the workflow yet. In addition to this summary, it can be observed this step has three command lines.
For more details about the command-line rendered for each target file, it can be checked as follows:
cmdlist(sal, step = "gzip")
## $gzip
## $gzip$SE
## $gzip$SE$gzip
## [1] "gzip -c results/setosa.csv > results/SE.csv.gz"
##
##
## $gzip$VE
## $gzip$VE$gzip
## [1] "gzip -c results/versicolor.csv > results/VE.csv.gz"
##
##
## $gzip$VI
## $gzip$VI$gzip
## [1] "gzip -c results/virginica.csv > results/VI.csv.gz"
Using the outfiles
for the next step
For building this step, all the previous procedures are being used to append the next step. However, here, we can observe power features that build the connectivity between steps in the workflow.
In this example, we would like to use the outfiles from gzip Step, as input from the next step, which is the gunzip. In this case, let’s look at the outfiles from the first step:
outfiles(sal)
## $export_iris
## DataFrame with 0 rows and 0 columns
##
## $gzip
## DataFrame with 3 rows and 1 column
## gzip_file
## <character>
## SE results/SE.csv.gz
## VE results/VE.csv.gz
## VI results/VI.csv.gz
The column we want to use is “gzip_file”. For the argument targets
in the
SYSargsList
function, it should provide the name of the correspondent step in
the Workflow and which outfiles
you would like to be incorporated in the next
step.
The argument inputvars
allows the connectivity between outfiles
and the
new targets
file. Here, the name of the previous outfiles
should be provided
it. Please note that all outfiles
column names must be unique.
It is possible to keep all the original columns from the targets
files or remove
some columns for a clean targets
file.
The argument rm_targets_col
provides this flexibility, where it is possible to
specify the names of the columns that should be removed. If no names are passing
here, the new columns will be appended.
appendStep(sal) <- SYSargsList(step_name = "gunzip",
targets = "gzip", dir = TRUE,
wf_file = "gunzip/workflow_gunzip.cwl", input_file = "gunzip/gunzip.yml",
dir_path = system.file("extdata/cwl", package = "systemPipeR"),
inputvars = c(gzip_file = "_FILE_PATH_", SampleName = "_SampleName_"),
rm_targets_col = "FileName",
dependency = "gzip")
We can check the targets automatically create for this step,
based on the previous outfiles
:
targetsWF(sal[3])
## $gunzip
## DataFrame with 3 rows and 2 columns
## gzip_file SampleName
## <character> <character>
## SE results/SE.csv.gz SE
## VE results/VE.csv.gz VE
## VI results/VI.csv.gz VI
We can also check all the expected outfiles
for this particular step, as follows:
outfiles(sal[3])
## $gunzip
## DataFrame with 3 rows and 1 column
## gunzip_file
## <character>
## SE results/SE.csv
## VE results/VE.csv
## VI results/VI.csv
Now, we can observe that the third step has been added and contains one substep.
sal
## Instance of 'SYSargsList':
## WF Steps:
## 1. export_iris --> Status: Pending
## 2. gzip --> Status: Pending
## Total Files: 3 | Existing: 0 | Missing: 3
## 2.1. gzip
## cmdlist: 3 | Pending: 3
## 3. gunzip --> Status: Pending
## Total Files: 3 | Existing: 0 | Missing: 3
## 3.1. gunzip
## cmdlist: 3 | Pending: 3
##
In addition, we can access all the command lines for each one of the substeps.
cmdlist(sal["gzip"], targets = 1)
## $gzip
## $gzip$SE
## $gzip$SE$gzip
## [1] "gzip -c results/setosa.csv > results/SE.csv.gz"
Getting data from a workflow instance
The final step in this simple workflow is an R code step. For that, we are using
the LineWise
constructor function as demonstrated above.
One interesting feature showed here is the getColumn
method that allows
extracting the information for a workflow instance. Those files can be used in
an R code, as demonstrated below.
getColumn(sal, step = "gunzip", 'outfiles')
## SE VE VI
## "results/SE.csv" "results/VE.csv" "results/VI.csv"
appendStep(sal) <- LineWise(code = {
df <- lapply(getColumn(sal, step = "gunzip", 'outfiles'), function(x) read.delim(x, sep = ",")[-1])
df <- do.call(rbind, df)
stats <- data.frame(cbind(mean = apply(df[,1:4], 2, mean), sd = apply(df[,1:4], 2, sd)))
stats$species <- rownames(stats)
plot <- ggplot2::ggplot(stats, ggplot2::aes(x = species, y = mean, fill = species)) +
ggplot2::geom_bar(stat = "identity", color = "black", position = ggplot2::position_dodge()) +
ggplot2::geom_errorbar(ggplot2::aes(ymin = mean-sd, ymax = mean+sd), width = .2, position = ggplot2::position_dodge(.9))
},
step_name = "iris_stats",
dependency = "gzip")
Session
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] systemPipeR_2.3.4 ShortRead_1.54.0
## [3] GenomicAlignments_1.32.0 SummarizedExperiment_1.26.1
## [5] Biobase_2.56.0 MatrixGenerics_1.8.0
## [7] matrixStats_0.62.0 BiocParallel_1.30.2
## [9] Rsamtools_2.12.0 Biostrings_2.64.0
## [11] XVector_0.36.0 GenomicRanges_1.48.0
## [13] GenomeInfoDb_1.32.2 IRanges_2.30.0
## [15] S4Vectors_0.34.0 BiocGenerics_0.42.0
##
## loaded via a namespace (and not attached):
## [1] lattice_0.20-45 png_0.1-7 assertthat_0.2.1
## [4] digest_0.6.29 utf8_1.2.2 R6_2.5.1
## [7] evaluate_0.15 ggplot2_3.3.6 blogdown_1.10
## [10] pillar_1.7.0 zlibbioc_1.42.0 rlang_1.0.2
## [13] rstudioapi_0.13 jquerylib_0.1.4 Matrix_1.4-1
## [16] rmarkdown_2.14 stringr_1.4.0 htmlwidgets_1.5.4
## [19] RCurl_1.98-1.6 munsell_0.5.0 DelayedArray_0.22.0
## [22] compiler_4.2.0 xfun_0.31 pkgconfig_2.0.3
## [25] htmltools_0.5.2 tidyselect_1.1.2 tibble_3.1.7
## [28] GenomeInfoDbData_1.2.8 bookdown_0.26 fansi_1.0.3
## [31] dplyr_1.0.9 crayon_1.5.1 bitops_1.0-7
## [34] grid_4.2.0 DBI_1.1.2 jsonlite_1.8.0
## [37] gtable_0.3.0 lifecycle_1.0.1 magrittr_2.0.3
## [40] scales_1.2.0 cli_3.3.0 stringi_1.7.6
## [43] hwriter_1.3.2.1 latticeExtra_0.6-29 bslib_0.3.1
## [46] generics_0.1.2 ellipsis_0.3.2 vctrs_0.4.1
## [49] RColorBrewer_1.1-3 tools_4.2.0 glue_1.6.2
## [52] purrr_0.3.4 jpeg_0.1-9 parallel_4.2.0
## [55] fastmap_1.1.0 yaml_2.3.5 colorspace_2.0-3
## [58] knitr_1.39 sass_0.4.1