Visualize workflows
suppressPackageStartupMessages({
library(systemPipeR)
})
In the last section, we have learned how to run/manage workflows. In this section, we will learn advanced options how to visualize workflows.
First let’s set up the workflow using the example workflow template. For real production purposes, we recommend you to check out the complex templates over here.
dependency graph
The workflow plot is also called the dependency graph. It shows users how one step is depend on another. This is very important in SPR. A step will not be run unless all dependencies has been executed successfully.
To understand a workflow, we can simply call the sal object to print on console like so
sal
## Instance of 'SYSargsList':
## WF Steps:
## 1. load_library --> Status: Success
## 2. export_iris --> Status: Success
## 3. gzip --> Status: Success
## Total Files: 3 | Existing: 3 | Missing: 0
## 3.1. gzip
## cmdlist: 3 | Success: 3
## 4. gunzip --> Status: Success
## Total Files: 3 | Existing: 3 | Missing: 0
## 4.1. gunzip
## cmdlist: 3 | Success: 3
## 5. stats --> Status: Success
However, when the workflow becomes very long and complex, the relation between steps are hard to see from console. Workflow plot is the useful tool to understand the workflow.
For example, the VARseq workflow is complex, we can show it by:
systemPipeRdata::genWorkenvir("varseq")
setwd("varseq")
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeVARseq.Rmd")
sal
## Instance of 'SYSargsList':
## WF Steps:
## 1. load_SPR --> Status: Pending
## 2. fastq_report_pre --> Status: Pending
## 3. trimmomatic --> Status: Pending
## Total Files: 32 | Existing: 0 | Missing: 32
## 3.1. trimmomatic-pe
## cmdlist: 8 | Pending: 8
## 4. preprocessing --> Status: Pending
## Total Files: 16 | Existing: 0 | Missing: 16
## 4.1. preprocessReads-pe
## cmdlist: 8 | Pending: 8
## 5. fastq_report_pos --> Status: Pending
## 6. bwa_index --> Status: Pending
## Total Files: 5 | Existing: 0 | Missing: 5
## 6.1. bwa_index_-a_bwtsw
## cmdlist: 1 | Pending: 1
## 7. fasta_index --> Status: Pending
## Total Files: 1 | Existing: 1 | Missing: 0
## 7.1. gatk_CreateSequenceDictionary
## cmdlist: 1 | Pending: 1
## 8. faidx_index --> Status: Pending
## Total Files: 1 | Existing: 1 | Missing: 0
## 8.1. samtools_faidx
## cmdlist: 1 | Pending: 1
## 9. bwa_alignment --> Status: Pending
## Total Files: 32 | Existing: 0 | Missing: 32
## 9.1. bwa_mem
## cmdlist: 8 | Pending: 8
## 9.2. samtools-view
## cmdlist: 8 | Pending: 8
## 9.3. samtools-sort
## cmdlist: 8 | Pending: 8
## 9.4. samtools-index
## cmdlist: 8 | Pending: 8
## 10. align_stats --> Status: Pending
## 11. bam_urls --> Status: Pending
## 12. fastq2ubam --> Status: Pending
## Total Files: 8 | Existing: 0 | Missing: 8
## 12.1. gatk
## cmdlist: 8 | Pending: 8
## 13. merge_bam --> Status: Pending
## Total Files: 8 | Existing: 0 | Missing: 8
## 13.1. gatk
## cmdlist: 8 | Pending: 8
## 14. sort --> Status: Pending
## Total Files: 8 | Existing: 0 | Missing: 8
## 14.1. gatk
## cmdlist: 8 | Pending: 8
## 15. mark_dup --> Status: Pending
## Total Files: 16 | Existing: 0 | Missing: 16
## 15.1. gatk
## cmdlist: 8 | Pending: 8
## 16. fix_tag --> Status: Pending
## Total Files: 16 | Existing: 0 | Missing: 16
## 16.1. gatk
## cmdlist: 8 | Pending: 8
## 17. hap_caller --> Status: Pending
## Total Files: 16 | Existing: 0 | Missing: 16
## 17.1. gatk
## cmdlist: 8 | Pending: 8
## 18. import --> Status: Pending
## Total Files: 1 | Existing: 0 | Missing: 1
## 18.1. bash
## cmdlist: 1 | Pending: 1
## 19. call_variants --> Status: Pending
## Total Files: 2 | Existing: 0 | Missing: 2
## 19.1. gatk
## cmdlist: 1 | Pending: 1
## 20. filter --> Status: Pending
## Total Files: 2 | Existing: 0 | Missing: 2
## 20.1. bash
## cmdlist: 1 | Pending: 1
## 21. create_vcf --> Status: Pending
## Total Files: 8 | Existing: 0 | Missing: 8
## 21.1. bcftools
## cmdlist: 8 | Pending: 8
## 22. create_vcf_BCFtool --> Status: Pending
## Total Files: 40 | Existing: 0 | Missing: 40
## 22.1. mark
## cmdlist: 8 | Pending: 8
## 22.2. sort
## cmdlist: 8 | Pending: 8
## 22.3. index
## cmdlist: 8 | Pending: 8
## 22.4. raw_call
## cmdlist: 8 | Pending: 8
## 22.5. vcf_call
## cmdlist: 8 | Pending: 8
## 23. filter_vcf --> Status: Pending
## 24. filter_vcf_BCFtools --> Status: Pending
## 25. annotate_vcf --> Status: Pending
## 26. combine_var --> Status: Pending
## 27. summary_var --> Status: Pending
## 28. venn_diagram --> Status: Pending
## 29. plot_variant --> Status: Pending
##
Directly printing the sal
object as above does not give us the dependencies between
steps and it is hard to see the full picture. Here, we can use plotWF
to
visualize the full workflow.
plotWF(sal)
Advanced use
The VARseq workflow is too large and too complex. Here, for demonstration purposes, we still use the simple workflow.
sal <- SPRproject()
## Creating directory: /home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/data
## Creating directory: /home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/param
## Creating directory: /home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/results
## Creating directory '/home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/.SPRproject'
## Creating file '/home/lab/Desktop/spr/systemPipeR.github.io/content/en/sp/spr/sp_run/.SPRproject/SYSargsList.yml'
sal <- importWF(sal, file_path = system.file("extdata", "spr_simple_wf.Rmd", package = "systemPipeR"), verbose = FALSE)
plotWF(sal, rstudio = TRUE)
Rstudio
By default, the plot is opened in a new browser tab, because workflow
can be very large and long. Viewing in the small Rstudio window is not ideal.
This is controlled by the rstudio
argument, and it is default rstudio = FALSE
. It means
whether to open the plot in a new browser or view it inside the current tool,
for example many people use IDEs like Rstudio.
If you insist to view it in the built-in viewer, or sometimes rendering the R markdown
from an interactive session, where we do not want
to open in a new browser tab, rstudio = TRUE
must be added.
Height and width
Workflow plot height and width are adjustable by height
and width
. They can
take any valid CSS unit. By
default, it take 100% of the parent element
width, and automatically calculate the height based on need. Sometimes these
fraction based or automatically generated units are not right.
We can manually set them
plotWF(sal, width = "50%", rstudio = TRUE)
plotWF(sal, height = "300px")
Color and text
On the plot, different colors and numbers indicate different status. This information can be found also in the plot legends.
Shapes:
- circular steps: pure R code steps
- rounded squares steps:
sysargs
steps, steps that will invoke command-line calls - blue colored steps and arrows: main branch (see main branch section below)
Step colors
- black: pending steps
- Green: successful steps, all pass
- Orange: successful steps, but some samples have warning
- Red: failed steps, at least one sample failed
Number and colors
There are 4 numbers in the second row of each step, separated by /
- First No.: number of passed samples
- Second No.: number of warning samples
- Third No.: number of error samples
- Fourth No.: number of total samples
Duration
This is shown after the sample information, as how long it took to run this step.
Units are a few seconds (s), some minutes (m), or some hours (h).
For example, let’s append a warning step and an error step to the sal
appendStep(sal) <- LineWise(step_name = "warning_step", {warning("this creates a warning")}, dependency = "stats")
appendStep(sal) <- LineWise(step_name = "error_step", {stop("this creates an error")}, dependency = "stats")
sal
## Instance of 'SYSargsList':
## WF Steps:
## 1. load_library --> Status: Pending
## 2. export_iris --> Status: Pending
## 3. gzip --> Status: Pending
## Total Files: 3 | Existing: 0 | Missing: 3
## 3.1. gzip
## cmdlist: 3 | Pending: 3
## 4. gunzip --> Status: Pending
## Total Files: 3 | Existing: 0 | Missing: 3
## 4.1. gunzip
## cmdlist: 3 | Pending: 3
## 5. stats --> Status: Pending
## 6. warning_step --> Status: Pending
## 7. error_step --> Status: Pending
##
sal <- runWF(sal)
Then let’s plot it
plotWF(sal, width = "80%")
Do you see the color difference?
On hover
By default plotWF
uses SVG to make the plot so it is interactive.
When the mouse is hovering on each step, detailed information will be displayed,
like sample information, processing time, duration, etc.
Embedding
In additional to SVG embedding, PNG embedding is supported, but the plot will no longer be interactively, good for browsers without optimal SVG support.
plotWF(sal, plot_method = "png", width = "80%")
Right click on the plot of SVG and PNG, we can see, SVGs are not directly savable, but PNGs are. However, PNGs are not vectorized, so it means it becomes blurry when we zoom in.
Responsiveness
This is a term often used in web development. It means will the plot resize itself if the user
resize the document window? By default, plotWF
will be responsive, meaning it
will fit current window container size and adjust the size once the window size has
changed. To always display the full sized plot, use responsive = FALSE
. It is useful
for embedding the plot in a full-screen mode.
plotWF(sal, responsive = FALSE, width = "80%")
Now resize your window width and watch plot above vs. other plots.
Pan-zoom
The Pan-zoom option enables users to drag the plot instead of scrolling, and to
use mouse wheel to zoom in/out. If you do not like the scroll bars in responsive = FALSE
,
try this option. Note it cannot be used with responsive = TRUE
together.
If both TRUE
, responsive
will be automatically set to FALSE
. To enable this function
internet connection is required to download Javascript libraries on-the-fly.
plotWF(sal, pan_zoom = TRUE)
## Warning in plotWF(sal, pan_zoom = TRUE): Pan-zoom and responsive cannot be used
## together. Pan-zoom has priority, now `responsive` has set to FALSE
Layout
There a few different layout you can choose. There is no best layout. It all depends
on the workflow structure you have. The default is compact
but we recommend you
to try different layouts to find the best fitting one.
compact
: try to plot steps as close as possible.vertical
: main branch will be placed vertically and side branches will be placed on the same horizontal level and sub steps of side branches will be placed vertically.horizontal
: main branch is placed horizontally and side branches and sub steps will be placed vertically.execution
: a linear plot to show the workflow execution order of all steps.
Here we are talking about the concept of main branch. It is a way to decide the plot center. We will discuss more below.
vertical
plotWF(sal, layout = "vertical", height = "600px")
If the plot is very long, use height
to make it smaller.
horizontal
plotWF(sal, layout = "horizontal")
execution
plotWF(sal, layout = "execution", height = "600px", responsive = FALSE)
If the plot is too long, we can use height
to limit it and/or use responsive
to make it scrollable.
Main branch
From the examples above, you can see that there are many steps which do not point to any other steps in downstream. These dead-ends are called ending steps. If we connect the first step, steps in between and these ending step, this will become a branch. Imagine the workflow is a top-bottom tree structure and the root is the first step. Therefore, there are many possible ways to connect the workflow. For the convenience of plotting, we introduce a concept of “main branch”, meaning one of the possible connecting strategies that will be placed at the center of the plot. Other steps that are not in this major branch will surround this major space. This “main branch” will not affect how a workflow is run, but just an algorithm to compute the best visualization. It will have impact on how we plot the workflow.
This main branch will not impact the compact
layout so much but will have a huge
effect on horizontal
and vertical
layouts.
The algorithm in plotWF
will automatically choose a best branch for
you by default. In simple words, it favors: (a). branches that connect first and last step;
(b). as long as possible.
You can also choose a branch you want by branch_method = "choose"
. It will first
list all possible branches, and then give you a prompt to ask for your favorite branch.
Here, for rendering the Rmarkdown, we cannot have a prompt, so we use a second argument
in combination, branch_no = x
to directly choose a branch and skip the prompt. Also,
we use the verbose = TRUE
to imitate the branch listing in console. In a real case,
you only need branch_method = "choose"
.
To have the main branch marked, mark_main_branch = TRUE
must be added (default FALSE
).
Watch closely how the plot change by choosing different branches. Here we use vertical
layout to demo. Remember, the main branch is marked in blue.
Choose branch 1
plotWF(sal, mark_main_branch = TRUE, layout = "vertical", branch_method = "choose", branch_no = 1, verbose = FALSE, height = "450px")
Choose branch 2
plotWF(sal, mark_main_branch = TRUE, layout = "vertical", branch_method = "choose", branch_no = 2, verbose = FALSE, height = "450px")
Do you see the difference?
Legends
The legend can also be removed by show_legend = FALSE
plotWF(sal, show_legend = FALSE, height = "500px")
Output formats
There are current three output formats: "html"
and "dot"
, "dot_print"
. If first
two were chosen, you also need provide a path out_path
to save the file.
- html: a single html file contains the plot.
- dot: a DOT script file with the code to reproduce the plot in a graphiz DOT engine.
- dot_print: directly cat the dot script to console.
HTML
HTML format is very useful if you want to view the plot later or share it to other
people. This format is also helpful when you are working on a remote computer
cluster. To view the workflow plot, a browser device (viewer) must be available,
but often time this is not the case for computer clusters. When you plot a workflow
and see the message "Couldn't get a file descriptor referring to the console"
,
it means your computer (cluster) does not have a browser device. Saving to
HTML format is the best option.
plotWF(sal, out_format = "html", out_path = "example_out.html")
file.exists("example_out.html")
## [1] TRUE
DOT
Saving workflow plot to dot
format allows one to reproduce the plot with
the Graphviz language.
plotWF(sal, out_format = "dot", out_path = "example_out.dot")
file.exists("example_out.dot")
## [1] TRUE
DOT print
Instead of saving the Graphviz plotting code to a file, this option directly prints out the code on console. If you have a Graphviz plotting device in hand, simply copy and paste the code to that engine to reproduce the plot. For example, use our Workflow Plot Editor.
plotWF(sal, out_format = "dot_print")
Saving Static image file
Some users may want to save the plot to a static image, like .png
format. We will
need do some extra work to save the file. The reason we cannot directly save it to
a png file is the plot is generated in real-time by a browser javascript engine. It
requires one type of javascript engine, like Chrome, MS Edge, Viewer in Rstudio,
to render the plot before we can see it, no matter you use SVG or PNG embedding.
Interactive
- With the
plot_ctr = TRUE
(default) option, a plot control panel is displayed on the top-left corner. One can choose from different formats like png, jpg, svg or pdf to download them from the webpage. To enable these buttons, internet connection is required. The underlying Javascript libraries are download on-the-fly. Please make sure internet is connected. There are known conflicts of underlying web format creating libraries with R markdown web libraries, so some of these buttons may not work inside R markdown as you are seeing in this vignette right now. However, they should work properly if the workflow plot is saved to a stand-alone HTML file. - If you are working in Rstudio, you can also use the
export
button in the viewer to save an image file.
Note: due to the web libraries conflicts of this website and the libraries used in
plotWF
. Some buttons may not work when you click, but it will work when you open make workflow plots interactively and view it in a stand-alone browser tab.
Non-interactive
If you cannot have an interactive session, like submitting a job to a cluster, but still want the png, we recommend to use the {webshot2} package to screenshot the plot. It runs headless Chrome in the back-end (which has a javascript engine).
Install the package
remotes::install_github("rstudio/webshot2")
Save to html first
plotWF(sal, out_format = "html", out_path = "example_out.html")
file.exists("example_out.html")
Use webshot2
to save the image
webshot2::webshot("example_out.html", "example_out.png")
In logs
The workflow steps will also become clickable if in_log = TRUE
. This will create links
for each step that navigate to corresponding log section in the SPR
workflow log file. Normally this option
is handled by SPR log file generating function renderLogs
to create this plot on top of the log file,
so when a certain step is click, it will navigate to the detailed section down the page.
Visit this page to see a real example. Try to click on the step in the workflow plot and watch what happens.
Session
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] systemPipeR_2.3.4 ShortRead_1.54.0
## [3] GenomicAlignments_1.32.0 SummarizedExperiment_1.26.1
## [5] Biobase_2.56.0 MatrixGenerics_1.8.0
## [7] matrixStats_0.62.0 BiocParallel_1.30.2
## [9] Rsamtools_2.12.0 Biostrings_2.64.0
## [11] XVector_0.36.0 GenomicRanges_1.48.0
## [13] GenomeInfoDb_1.32.2 IRanges_2.30.0
## [15] S4Vectors_0.34.0 BiocGenerics_0.42.0
##
## loaded via a namespace (and not attached):
## [1] lattice_0.20-45 png_0.1-7 assertthat_0.2.1
## [4] digest_0.6.29 utf8_1.2.2 R6_2.5.1
## [7] evaluate_0.15 highr_0.9 ggplot2_3.3.6
## [10] blogdown_1.10 pillar_1.7.0 zlibbioc_1.42.0
## [13] rlang_1.0.2 rstudioapi_0.13 jquerylib_0.1.4
## [16] Matrix_1.4-1 rmarkdown_2.14 labeling_0.4.2
## [19] stringr_1.4.0 htmlwidgets_1.5.4 RCurl_1.98-1.6
## [22] munsell_0.5.0 DelayedArray_0.22.0 compiler_4.2.0
## [25] xfun_0.31 pkgconfig_2.0.3 htmltools_0.5.2
## [28] tidyselect_1.1.2 tibble_3.1.7 GenomeInfoDbData_1.2.8
## [31] bookdown_0.26 fansi_1.0.3 dplyr_1.0.9
## [34] crayon_1.5.1 bitops_1.0-7 grid_4.2.0
## [37] DBI_1.1.2 jsonlite_1.8.0 gtable_0.3.0
## [40] lifecycle_1.0.1 magrittr_2.0.3 scales_1.2.0
## [43] cli_3.3.0 stringi_1.7.6 farver_2.1.0
## [46] hwriter_1.3.2.1 latticeExtra_0.6-29 bslib_0.3.1
## [49] generics_0.1.2 ellipsis_0.3.2 vctrs_0.4.1
## [52] RColorBrewer_1.1-3 tools_4.2.0 glue_1.6.2
## [55] purrr_0.3.4 jpeg_0.1-9 parallel_4.2.0
## [58] fastmap_1.1.0 yaml_2.3.5 colorspace_2.0-3
## [61] knitr_1.39 sass_0.4.1