Explore workflow instances

We have discussed about how to run/manage a workflow. There are many useful methods (functions) of the SYSargsList. We have discussed some of them in previous secions like, appendStep, addResources, and more. We will be exploring them in details in this section. To get help quickly, use ?\SYSargsList-class`` to call up the help files.

suppressPackageStartupMessages({
    library(systemPipeR)
})

We still use the simple workflow to demonstrate.

sal <- SPRproject()
## Creating directory:  /home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/data 
## Creating directory:  /home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/param 
## Creating directory:  /home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/results 
## Creating directory '/home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/.SPRproject'
## Creating file '/home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/.SPRproject/SYSargsList.yml'
sal <- importWF(sal, file_path = system.file("extdata", "spr_simple_wf.Rmd", package = "systemPipeR"), verbose = FALSE)
sal
## Instance of 'SYSargsList': 
##     WF Steps:
##        1. load_library --> Status: Pending
##        2. export_iris --> Status: Pending
##        3. gzip --> Status: Pending 
##            Total Files: 3 | Existing: 0 | Missing: 3 
##          3.1. gzip
##              cmdlist: 3 | Pending: 3
##        4. gunzip --> Status: Pending 
##            Total Files: 3 | Existing: 0 | Missing: 3 
##          4.1. gunzip
##              cmdlist: 3 | Pending: 3
##        5. stats --> Status: Pending
##

Accessor Methods

systemPipeR provide several accessor methods and useful functions to explore SYSargsList workflow object.

Several accessor methods are available that are named after the slot names of the SYSargsList workflow object.

names(sal)

## [1] "stepsWF"            "statusWF"           "targetsWF"         
## [4] "outfiles"           "SE"                 "dependency"        
## [7] "targets_connection" "projectInfo"        "runInfo"

Check the length of the workflow:

length(sal)
## [1] 5

Extract all steps of the workflow in a list:

stepsWF(sal)
## $load_library
## Instance of 'LineWise'
##     Code Chunk length: 1
## 
## $export_iris
## Instance of 'LineWise'
##     Code Chunk length: 1
## 
## $gzip
## Instance of 'SYSargs2':
##    Slot names/accessors: 
##       targets: 3 (SE...VI), targetsheader: 1 (lines)
##       modules: 0
##       wf: 1, clt: 1, yamlinput: 4 (inputs)
##       input: 3, output: 3
##       cmdlist: 3
##    Sub Steps:
##       1. gzip (rendered: TRUE)
## 
## 
## 
## $gunzip
## Instance of 'SYSargs2':
##    Slot names/accessors: 
##       targets: 3 (SE...VI), targetsheader: 1 (lines)
##       modules: 0
##       wf: 1, clt: 1, yamlinput: 4 (inputs)
##       input: 3, output: 3
##       cmdlist: 3
##    Sub Steps:
##       1. gunzip (rendered: TRUE)
## 
## 
## 
## $stats
## Instance of 'LineWise'
##     Code Chunk length: 5

Checking the command-line for each target sample:

cmdlist() method printing the system commands for running command-line software as specified by a given *.cwl file combined with the paths to the input samples (e.g. FASTQ files) provided by a targets file. The example below shows the cmdlist() output for running gzip and gunzip on the first sample. Evaluating the output of cmdlist() can be very helpful for designing and debugging *.cwl files of new command-line software or changing the parameter settings of existing ones.

cmdlist(sal, step = c(2,3), targets = 1)
## export_iris step have been dropped because it is not a SYSargs2 object.
## $gzip
## $gzip$SE
## $gzip$SE$gzip
## [1] "gzip -c  results/setosa.csv > results/SE.csv.gz"

Check the workflow status:

statusWF(sal)
## $load_library
## DataFrame with 1 row and 2 columns
##           Step status.summary
##    <character>    <character>
## 1 load_library        Pending
## 
## $export_iris
## DataFrame with 1 row and 2 columns
##          Step status.summary
##   <character>    <character>
## 1 export_iris        Pending
## 
## $gzip
## DataFrame with 3 rows and 5 columns
##        Targets Total_Files Existing_Files Missing_Files     gzip
##    <character>   <numeric>      <numeric>     <numeric> <matrix>
## SE          SE           1              0             1  Pending
## VE          VE           1              0             1  Pending
## VI          VI           1              0             1  Pending
## 
## $gunzip
## DataFrame with 3 rows and 5 columns
##        Targets Total_Files Existing_Files Missing_Files   gunzip
##    <character>   <numeric>      <numeric>     <numeric> <matrix>
## SE          SE           1              0             1  Pending
## VE          VE           1              0             1  Pending
## VI          VI           1              0             1  Pending
## 
## $stats
## DataFrame with 1 row and 2 columns
##          Step status.summary
##   <character>    <character>
## 1       stats        Pending

Check the workflow targets files:

targetsWF(sal[2])
## $export_iris
## DataFrame with 0 rows and 0 columns

Checking the expected outfiles files:

The outfiles components of SYSargsList define the expected outfiles files for each step in the workflow, some of which are the input for the next workflow step.

outfiles(sal[2])
## $export_iris
## DataFrame with 0 rows and 0 columns

Check the workflow dependencies:

dependency(sal)
## $load_library
## [1] NA
## 
## $export_iris
## [1] "load_library"
## 
## $gzip
## [1] "export_iris"
## 
## $gunzip
## [1] "gzip"
## 
## $stats
## [1] "gunzip"

Check the sample comparisons:

Sample comparisons are defined in the header lines of the targets file starting with ‘# <CMP>’. This information can be accessed as follows:

targetsheader(sal, step = "gzip")
## $targetsheader
## [1] "# Project ID: SPR Example gzip"

This demo workflow does not have any comparison groups, for a real analysis like RNAseq or ChIPseq, comparisons can be defined in the header, like this file.

Get the workflow steps names:

stepName(sal)
## [1] "load_library" "export_iris"  "gzip"         "gunzip"       "stats"

Get the Sample Id for on particular step:

SampleName(sal, step = "gzip")
## [1] "SE" "VE" "VI"
SampleName(sal, step = "stats")
## NULL

Get the `outfiles` or `targets` column files:

getColumn(sal, "outfiles", step = "gzip", column = "gzip_file")
##                  SE                  VE                  VI 
## "results/SE.csv.gz" "results/VE.csv.gz" "results/VI.csv.gz"
getColumn(sal, "targetsWF", step = "gzip", column = "FileName")
##                       SE                       VE                       VI 
##     "results/setosa.csv" "results/versicolor.csv"  "results/virginica.csv"

Get the R code for a `LineWise` step:

codeLine(sal, step = "export_iris")
## export_iris
##     mapply(function(x, y) write.csv(x, y), split(iris, factor(iris$Species)), file.path("results", paste0(names(split(iris, factor(iris$Species))), ".csv")))

View all the objects in the running environment:

viewEnvir(sal)
## <environment: 0x55761c0b2b20>
## character(0)

Copy one or multiple objects from the running environment to a new environment:

copyEnvir(sal, list = c("plot"), new.env = globalenv(), silent = FALSE)
## <environment: 0x55761c0b2b20>
## Copying to 'new.env': 
## plot

Accessing the `*.yml` data

yamlinput(sal, step = "gzip")
## $file
## $file$class
## [1] "File"
## 
## $file$path
## [1] "_FILE_PATH_"
## 
## 
## $SampleName
## [1] "_SampleName_"
## 
## $ext
## [1] "csv.gz"
## 
## $results_path
## $results_path$class
## [1] "Directory"
## 
## $results_path$path
## [1] "./results"

Subsetting the workflow details

The `SYSargsList` class and its subsetting operator `[`:

sal[1]
## Instance of 'SYSargsList': 
##     WF Steps:
##        1. load_library --> Status: Pending
## 
sal[1:3]
## Instance of 'SYSargsList': 
##     WF Steps:
##        1. load_library --> Status: Pending
##        2. export_iris --> Status: Pending
##        3. gzip --> Status: Pending 
##            Total Files: 3 | Existing: 0 | Missing: 3 
##          3.1. gzip
##              cmdlist: 3 | Pending: 3
## 
sal[c(1,3)]
## Instance of 'SYSargsList': 
##     WF Steps:
##        1. load_library --> Status: Pending
##        2. gzip --> Status: Pending 
##            Total Files: 3 | Existing: 0 | Missing: 3 
##          2.1. gzip
##              cmdlist: 3 | Pending: 3
##

or use a character vector

sal[c("gzip", "stats")]
## Instance of 'SYSargsList': 
##     WF Steps:
##        1. gzip --> Status: Pending 
##            Total Files: 3 | Existing: 0 | Missing: 3 
##          1.1. gzip
##              cmdlist: 3 | Pending: 3
##        2. stats --> Status: Pending
##

The `SYSargsList` class and its subsetting by steps and input samples:

sal_sub <- subset(sal, subset_steps = c(3,4), input_targets = ("SE"), keep_steps = TRUE)
stepsWF(sal_sub)
## $load_library
## Instance of 'LineWise'
##     Code Chunk length: 1
## 
## $export_iris
## Instance of 'LineWise'
##     Code Chunk length: 1
## 
## $gzip
## Instance of 'SYSargs2':
##    Slot names/accessors: 
##       targets: 1 (SE...SE), targetsheader: 1 (lines)
##       modules: 0
##       wf: 1, clt: 1, yamlinput: 4 (inputs)
##       input: 1, output: 1
##       cmdlist: 1
##    Sub Steps:
##       1. gzip (rendered: TRUE)
## 
## 
## 
## $gunzip
## Instance of 'SYSargs2':
##    Slot names/accessors: 
##       targets: 1 (SE...SE), targetsheader: 1 (lines)
##       modules: 0
##       wf: 1, clt: 1, yamlinput: 4 (inputs)
##       input: 1, output: 1
##       cmdlist: 1
##    Sub Steps:
##       1. gunzip (rendered: TRUE)
## 
## 
## 
## $stats
## Instance of 'LineWise'
##     Code Chunk length: 5
targetsWF(sal_sub)
## $load_library
## DataFrame with 0 rows and 0 columns
## 
## $export_iris
## DataFrame with 0 rows and 0 columns
## 
## $gzip
## DataFrame with 1 row and 2 columns
##              FileName  SampleName
##           <character> <character>
## SE results/setosa.csv          SE
## 
## $gunzip
## DataFrame with 1 row and 2 columns
##            gzip_file  SampleName
##          <character> <character>
## SE results/SE.csv.gz          SE
## 
## $stats
## DataFrame with 0 rows and 0 columns
outfiles(sal_sub)
## $load_library
## DataFrame with 0 rows and 0 columns
## 
## $export_iris
## DataFrame with 0 rows and 0 columns
## 
## $gzip
## DataFrame with 1 row and 1 column
##           gzip_file
##         <character>
## 1 results/SE.csv.gz
## 
## $gunzip
## DataFrame with 1 row and 1 column
##      gunzip_file
##      <character>
## 1 results/SE.csv
## 
## $stats
## DataFrame with 0 rows and 0 columns

In this way, we are only selecting sample SE to run in step 3 (gzip) and 4 (gunzip). Other samples in these steps are removed.

The `SYSargsList` class and its operator `+`

sal[1] + sal[2] + sal[3]
## Instance of 'SYSargsList': 
##     WF Steps:
##        1. load_library --> Status: Pending
##        2. export_iris --> Status: Pending
##        3. gzip --> Status: Pending 
##            Total Files: 3 | Existing: 0 | Missing: 3 
##          3.1. gzip
##              cmdlist: 3 | Pending: 3
##

Replacement Methods

Update a `input` parameter in the workflow

sal_c <- sal
## check values
yamlinput(sal_c, step = "gzip")
## $file
## $file$class
## [1] "File"
## 
## $file$path
## [1] "_FILE_PATH_"
## 
## 
## $SampleName
## [1] "_SampleName_"
## 
## $ext
## [1] "csv.gz"
## 
## $results_path
## $results_path$class
## [1] "Directory"
## 
## $results_path$path
## [1] "./results"
## check on command-line
cmdlist(sal_c, step = "gzip", targets = 1)
## $gzip
## $gzip$SE
## $gzip$SE$gzip
## [1] "gzip -c  results/setosa.csv > results/SE.csv.gz"
## Replace
yamlinput(sal_c, step = "gzip", paramName = "ext") <- "txt.gz"

## check NEW values
yamlinput(sal_c, step = "gzip")
## $file
## $file$class
## [1] "File"
## 
## $file$path
## [1] "_FILE_PATH_"
## 
## 
## $SampleName
## [1] "_SampleName_"
## 
## $ext
## [1] "txt.gz"
## 
## $results_path
## $results_path$class
## [1] "Directory"
## 
## $results_path$path
## [1] "./results"
## Check on command-line
cmdlist(sal_c, step = "gzip", targets = 1)
## $gzip
## $gzip$SE
## $gzip$SE$gzip
## [1] "gzip -c  results/setosa.csv > results/SE.txt.gz"

Here you can see we replace the yaml input of "ext" from "csv.gz" to "txt.gz", so the following rendered command is also changed. In this way, we can easily tweak command-line parameters of a certain steps. For example, we are training a machine learning model, the results are not ideal and we wish to increase the iteration numbers. Then we can use similar method above to change the iteration parameter and next use runWF(.., steps = "train_model", force = TRUE) to only rerun this step instead of rebuilding the whole workflow and rerun all steps.

Append and Replacement methods for R Code Steps

appendCodeLine(sal_c, step = "export_iris", after = 1) <- "log_cal_100 <- log(100)"
codeLine(sal_c, step = "export_iris")
## export_iris
##     mapply(function(x, y) write.csv(x, y), split(iris, factor(iris$Species)), file.path("results", paste0(names(split(iris, factor(iris$Species))), ".csv")))
##     log_cal_100 <- log(100)

replaceCodeLine(sal_c, step="export_iris", line = 2) <- LineWise(code={
                    log_cal_100 <- log(50)
                    })
codeLine(sal_c, step = 1)
## load_library
##     library(systemPipeR)

Rename a Step

renameStep(sal_c, step = 1) <- "newStep"
renameStep(sal_c, c(1, 2)) <- c("newStep2", "newIndex")
sal_c
## Instance of 'SYSargsList': 
##     WF Steps:
##        1. newStep2 --> Status: Pending
##        2. newIndex --> Status: Pending
##        3. gzip --> Status: Pending 
##            Total Files: 3 | Existing: 0 | Missing: 3 
##          3.1. gzip
##              cmdlist: 3 | Pending: 3
##        4. gunzip --> Status: Pending 
##            Total Files: 3 | Existing: 0 | Missing: 3 
##          4.1. gunzip
##              cmdlist: 3 | Pending: 3
##        5. stats --> Status: Pending
## 
names(outfiles(sal_c))
## [1] "newStep2" "newIndex" "gzip"     "gunzip"   "stats"
names(targetsWF(sal_c))
## [1] "newStep2" "newIndex" "gzip"     "gunzip"   "stats"
dependency(sal_c)
## $newStep2
## [1] NA
## 
## $newIndex
## [1] "newStep2"
## 
## $gzip
## [1] "newIndex"
## 
## $gunzip
## [1] "gzip"
## 
## $stats
## [1] "gunzip"

Replace a Step

sal_test <- sal[c(1,2)]
replaceStep(sal_test, step = 1, step_name = "gunzip" ) <- sal[3]
sal_test
## Instance of 'SYSargsList': 
##     WF Steps:
##        1. gunzip --> Status: Pending 
##            Total Files: 3 | Existing: 0 | Missing: 3 
##          1.1. gzip
##              cmdlist: 3 | Pending: 3
##        2. export_iris --> Status: Pending
##

Note: Please use this method with attention, because it can disrupt all the dependency graphs.

Removing a Step

sal_test <- sal[-2]
sal_test
## Instance of 'SYSargsList': 
##     WF Steps:
##        1. load_library --> Status: Pending
##        2. gzip --> Status: Pending 
##            Total Files: 3 | Existing: 0 | Missing: 3 
##          2.1. gzip
##              cmdlist: 3 | Pending: 3
##        3. gunzip --> Status: Pending 
##            Total Files: 3 | Existing: 0 | Missing: 3 
##          3.1. gunzip
##              cmdlist: 3 | Pending: 3
##        4. stats --> Status: Pending
##

Session

sessionInfo()

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] systemPipeR_2.3.4           ShortRead_1.54.0           
##  [3] GenomicAlignments_1.32.0    SummarizedExperiment_1.26.1
##  [5] Biobase_2.56.0              MatrixGenerics_1.8.0       
##  [7] matrixStats_0.62.0          BiocParallel_1.30.2        
##  [9] Rsamtools_2.12.0            Biostrings_2.64.0          
## [11] XVector_0.36.0              GenomicRanges_1.48.0       
## [13] GenomeInfoDb_1.32.2         IRanges_2.30.0             
## [15] S4Vectors_0.34.0            BiocGenerics_0.42.0        
## 
## loaded via a namespace (and not attached):
##  [1] lattice_0.20-45        png_0.1-7              assertthat_0.2.1      
##  [4] digest_0.6.29          utf8_1.2.2             R6_2.5.1              
##  [7] evaluate_0.15          ggplot2_3.3.6          blogdown_1.10         
## [10] pillar_1.7.0           zlibbioc_1.42.0        rlang_1.0.2           
## [13] rstudioapi_0.13        jquerylib_0.1.4        Matrix_1.4-1          
## [16] rmarkdown_2.14         stringr_1.4.0          htmlwidgets_1.5.4     
## [19] RCurl_1.98-1.6         munsell_0.5.0          DelayedArray_0.22.0   
## [22] compiler_4.2.0         xfun_0.31              pkgconfig_2.0.3       
## [25] htmltools_0.5.2        tidyselect_1.1.2       tibble_3.1.7          
## [28] GenomeInfoDbData_1.2.8 bookdown_0.26          fansi_1.0.3           
## [31] dplyr_1.0.9            crayon_1.5.1           bitops_1.0-7          
## [34] grid_4.2.0             DBI_1.1.2              jsonlite_1.8.0        
## [37] gtable_0.3.0           lifecycle_1.0.1        magrittr_2.0.3        
## [40] scales_1.2.0           cli_3.3.0              stringi_1.7.6         
## [43] hwriter_1.3.2.1        latticeExtra_0.6-29    bslib_0.3.1           
## [46] generics_0.1.2         ellipsis_0.3.2         vctrs_0.4.1           
## [49] RColorBrewer_1.1-3     tools_4.2.0            glue_1.6.2            
## [52] purrr_0.3.4            jpeg_0.1-9             parallel_4.2.0        
## [55] fastmap_1.1.0          yaml_2.3.5             colorspace_2.0-3      
## [58] knitr_1.39             sass_0.4.1

Last modified 2022-06-03 : new SPR version docs no_render (cba777c6)