featureCoverage.Rd
Computes read coverage along single and multi component features based on
genomic alignments. The coverage segments of component features are spliced to
continuous ranges, such as exons to transcripts or CDSs to ORFs. The results
can be obtained with single nucleotide resolution (e.g. around start and stop
codons) or as mean coverage of relative bin sizes, such as 100 bins for each
feature. The latter allows comparisons of coverage trends among transcripts of
variable length. The results can be obtained for single or many features (e.g.
any number of transcritpts) at once. Visualization of the coverage results is
facilitated by a downstream plotfeatureCoverage
function.
featureCoverage(bfl, grl, resizereads = NULL, readlengthrange = NULL, Nbins = 20,
method = mean, fixedmatrix, resizefeatures, upstream, downstream,
outfile, overwrite = FALSE)
Paths to BAM files provided as BamFileList
object. The name slot of the
BAM files will be used for naming samples in the results.
Genomic ranges provided as GRangesList
typically generated form
txdb
instances with operations like: cdsBy(txdb, "tx")
or
exonsBy(txdb, "tx")
. Single component features will be processed the
same way as multi component features.
Positive integer defining the length alignments should be resized to prior to the
coverage calculation. NULL
will omit the resizing step.
Positive integer of length 2 determining the read length range to use for
the coverage calculation. Reads falling outside of the specified length range
will be excluded from the coverage calculation. For instance,
readlengthrange=c(30:40)
will base the coverage calculation on reads
between 30 to 40 bps. Assigning NULL
will skip this filtering step.
Single positive integer defining the number of segments the coverage of each feature should be binned into in order to obtain coverage summaries of constant length, e.g. for plotting purposes.
Defines the summary statistics to use for binning. The default is method=mean
.
If set to TRUE
, a coverage matrix with single nucleotide resolution will be
returned for any number of transcripts centered around precise anchor points in
a genome annotation, such a stop/start codons or transcription start sites. For instance, a
matrix with coverage information 20bps upstream and downstream of the stop/start codons can be
obtained with fixedmatrix=TRUE, upstream=20, downstream=20
along with a grl
instance containing the CDS exon ranges required for this operation, e.g. generated
with cdsBy(txdb, "tx")
.
Needs to be set to TRUE
when fixedmatrix=TRUE
. Internally, this will use
the systemPipeR::.resizeFeature
function to extend single and multi component
features at their most left and most right end coordinates. The corresponding extension
values are specified under the upstream
and downstream
arguments.
Single positive integer specifying the upstream extension length relative to the orientation of each feature in the genome. More details are given above.
Single positive integer specifying the downstream extension length relative to the orientation of each feature in the genome. More details are given above.
Default NULL
omits writing of the results to a file. If a file name is specified then
the results are written to a tabular file. If bfl
contains the paths to several BAM files
then the results will be appended to the same file where the first column specifies the
sample labels. Redirecting the results to file is particularly useful when processing large
files of many sample where computation times can be significant.
If set to TRUE
any existing file assigned to outfile
will be overwritten.
The function allows to return the following four distinct outputs. The settings to return these instances are illustrated below in the example section.
data.frame
containing binned coverage where rows are features and columns
coverage bins. The first four columns contain (i) the sample names, (ii) the number of
total aligned reads in the corresponding BAM files (useful for normalization), (iii) the
feature IDs, (iv) strand of the coverage. All following columns are numeric and contain
the actual coverage data for the sense and antisense strand of each feature.
data.frame
containing coverage with single nucleotide
resolution around anchor points such as start and stop codons. The two matrix
components are appended column-wise. To clearly distinguish the two data components,
they are separated by a specialty column containing pipe characters. The first four
columns are the same as described under (A). The column title for the
anchor point is 0. For instance, if the features are CDSs then the first 0 corresponds to the
first nucleotide of the start codon and the second 0 to the last nucleotide of
the stop codon. Upstream and downstream positions are indicated by negative and
positive column numbers, respectively.
data.frame
containing combined results of (A) and (B) where the
first set of columns contains to the coverage around the start codons, the second one
the binned coverage of the CDSs and the third one the coverage around the stop codons
separated by the same pipe columns mentioned under (B).
Rle
list containing the nucleotide level coverage of each feature
plotfeatureCoverage
## Construct SYSargs2 object from param and targets files
targetspath <- system.file("extdata", "targets.txt", package="systemPipeR")
dir_path <- system.file("extdata/cwl", package="systemPipeR")
args <- loadWorkflow(targets=targetspath, wf_file="hisat2/hisat2-mapping-se.cwl",
input_file="hisat2/hisat2-mapping-se.yml", dir_path=dir_path)
args <- renderWF(args, inputvars=c(FileName="_FASTQ_PATH1_", SampleName="_SampleName_"))
args
#> Instance of 'SYSargs2':
#> Slot names/accessors:
#> targets: 18 (M1A...V12B), targetsheader: 4 (lines)
#> modules: 1
#> wf: 0, clt: 1, yamlinput: 7 (inputs)
#> input: 18, output: 18
#> cmdlist: 18
#> Sub Steps:
#> 1. hisat2-mapping-se (rendered: TRUE)
#>
#>
if (FALSE) {
## Features from sample data of systemPipeRdata package
library(GenomicFeatures)
file <- system.file("extdata/annotation", "tair10.gff", package="systemPipeRdata")
txdb <- makeTxDbFromGFF(file=file, format="gff3", organism="Arabidopsis")
targetspath <- system.file("extdata", "targets.txt", package="systemPipeR")
dir_path <- system.file("extdata/cwl", package="systemPipeR")
args <- loadWorkflow(targets=targetspath, wf_file="hisat2/hisat2-mapping-se.cwl",
input_file="hisat2/hisat2-mapping-se.yml", dir_path=dir_path)
args <- renderWF(args, inputvars=c(FileName="_FASTQ_PATH1_", SampleName="_SampleName_"))
args <- runCommandline(args, make_bam = TRUE, dir = TRUE)
outpaths <- subsetWF(args , slot="output", subset=1, index=1)
file.exists(outpaths)
## (A) Generate binned coverage for two BAM files and 4 transcripts
grl <- cdsBy(txdb, "tx", use.names=TRUE)
fcov <- featureCoverage(bfl=BamFileList(outpaths[1:2]), grl=grl[1:4], resizereads=NULL,
readlengthrange=NULL, Nbins=20, method=mean, fixedmatrix=FALSE,
resizefeatures=TRUE, upstream=20, downstream=20,
outfile="results/featureCoverage.xls", overwrite=TRUE)
plotfeatureCoverage(covMA=fcov, method=mean, scales="fixed", scale_count_val=10^6)
## (B) Coverage matrix upstream and downstream of start/stop codons
fcov <- featureCoverage(bfl=BamFileList(outpaths[1:2]), grl=grl[1:4], resizereads=NULL,
readlengthrange=NULL, Nbins=NULL, method=mean, fixedmatrix=TRUE,
resizefeatures=TRUE, upstream=20, downstream=20,
outfile="results/featureCoverage_up_down.xls", overwrite=TRUE)
plotfeatureCoverage(covMA=fcov, method=mean, scales="fixed", scale_count_val=10^6)
## (C) Combined matrix for both binned and start/stop codon
fcov <- featureCoverage(bfl=BamFileList(outpaths[1:2]), grl=grl[1:4], resizereads=NULL,
readlengthrange=NULL, Nbins=20, method=mean, fixedmatrix=TRUE,
resizefeatures=TRUE, upstream=20, downstream=20,
outfile="results/featureCoverage_binned.xls", overwrite=TRUE)
plotfeatureCoverage(covMA=fcov, method=mean, scales="fixed", scale_count_val=10^6)
## (D) Rle coverage objects one for each query feature
fcov <- featureCoverage(bfl=BamFileList(outpaths[1:2]), grl=grl[1:4], resizereads=NULL,
readlengthrange=NULL, Nbins=NULL, method=mean, fixedmatrix=FALSE,
resizefeatures=TRUE, upstream=20, downstream=20,
outfile="results/featureCoverage_query.xls", overwrite=TRUE)
}