Use Rsubread to align reads against one or more filter libraries and subsequently remove mapped reads
Source:R/filter_host.R
filter_host.Rd
After aligning your sample to a target library with align_target()
,
use filter_host()
to remove unwelcome host contamination using filter
reference libraries. This function takes as input the name of the .bam file
produced via align_target()
, and produces a sorted .bam file with any
reads that match the filter libraries removed. This resulting .bam file may
be used upstream for further analysis. This function uses Rsubread. For the
Rbowtie2 equivalent of this function, see filter_host_bowtie
.
Usage
filter_host(
reads_bam,
lib_dir = NULL,
libs,
make_bam = FALSE,
output = paste(tools::file_path_sans_ext(reads_bam), "filtered", sep = "."),
subread_options = align_details,
YS = 1e+05,
threads = 1,
quiet = TRUE
)
Arguments
- reads_bam
The name of a merged, sorted .bam file that has previously been aligned to a reference library. Likely, the output from running an instance of
align_target()
.- lib_dir
Path to the directory that contains the filter Subread index files.
- libs
The basename of the filter libraries (without index extension).
- make_bam
Logical, whether to also output a bam file with host reads filtered out. A .csv.gz file will be created instead if
FALSE
. Creating a bam file is costly on resources over creating a compressed csv file with only relevant information, so default isFALSE
.- output
The desired name of the output .bam or .csv.gz file. Extension is automatically defined by whether
make_bam = TRUE.
Default is the basename ofunfiltered_bam
+.filtered
+ extension.- subread_options
A named
list
specifying alignment parameters for theRsubread::align()
function, which is called insidealign_target()
. Elements should include type, nthreads, maxMismatches, nsubreads, phredOffset, unique, and nBestLocations. Descriptions of these parameters are available under?Rsubread::align
. Defaults to the globalalign_details
object.- YS
yieldSize, an integer. The number of alignments to be read in from the bam file at once for chunked functions. Default is 100000.
- threads
The amount of threads available for the function. Default is 1 thread.
- quiet
Turns off most messages. Default is
TRUE
.
Value
The name of a filtered, sorted .bam file written to the user's
current working directory. Or, if make_bam = FALSE
, a .csv.gz file
containing a data frame of only requisite information to run
Details
A compressed .csv can be created to produce a smaller output file that is
created more efficiently and is still compatible with metascope_id()
.
Examples
#### Filter reads from bam file that align to any of the filter libraries
## Assuming a bam file has been created previously with align_target()
# \donttest{
## Create temporary directory
filter_ref_temp <- tempfile()
dir.create(filter_ref_temp)
## Download filter genome
all_species <- c("Staphylococcus aureus subsp. aureus str. Newman")
all_ref <- vapply(all_species, MetaScope::download_refseq,
reference = FALSE, representative = FALSE, compress = TRUE,
out_dir = filter_ref_temp, caching = FALSE,
FUN.VALUE = character(1))
#> No ENTREZ API key provided
#> Get one via taxize::use_entrez()
#> See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
#> No ENTREZ API key provided
#> Get one via taxize::use_entrez()
#> See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
ind_out <- vapply(all_ref, mk_subread_index, FUN.VALUE = character(1))
#>
#> ========== _____ _ _ ____ _____ ______ _____
#> ===== / ____| | | | _ \| __ \| ____| /\ | __ \
#> ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
#> ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
#> ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
#> ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
#> Rsubread 2.18.0
#>
#> //================================= setting ==================================\\
#> || ||
#> || Index name : Staphylococcus_aureus_subsp._aureus_str._N ... ||
#> || Index space : base space ||
#> || Index split : no-split ||
#> || Repeat threshold : 100 repeats ||
#> || Gapped index : no ||
#> || ||
#> || Free / total memory : 198.0GB / 251.3GB ||
#> || ||
#> || Input files : 1 file in total ||
#> || o Staphylococcus_aureus_subsp._aureus_str._Newman.fasta.gz ||
#> || ||
#> \\============================================================================//
#>
#> //================================= Running ==================================\\
#> || ||
#> || Check the integrity of provided reference sequences ... ||
#> || No format issues were found ||
#> || Scan uninformative subreads in reference sequences ... ||
#> || Estimate the index size... ||
#> || 8%, 0 mins elapsed, rate=3946.6k bps/s ||
#> || 16%, 0 mins elapsed, rate=5305.1k bps/s ||
#> || 24%, 0 mins elapsed, rate=5989.7k bps/s ||
#> || 41%, 0 mins elapsed, rate=6859.2k bps/s ||
#> || 49%, 0 mins elapsed, rate=7031.3k bps/s ||
#> || 58%, 0 mins elapsed, rate=7143.5k bps/s ||
#> || 66%, 0 mins elapsed, rate=7252.9k bps/s ||
#> || 74%, 0 mins elapsed, rate=7335.4k bps/s ||
#> || 83%, 0 mins elapsed, rate=7392.9k bps/s ||
#> || 91%, 0 mins elapsed, rate=7442.8k bps/s ||
#> || 3.0 GB of memory is needed for index building. ||
#> || Build the index... ||
#> || 8%, 0 mins elapsed, rate=520.6k bps/s ||
#> || 16%, 0 mins elapsed, rate=913.1k bps/s ||
#> || 24%, 0 mins elapsed, rate=1221.5k bps/s ||
#> || 41%, 0 mins elapsed, rate=1689.5k bps/s ||
#> || 49%, 0 mins elapsed, rate=1855.9k bps/s ||
#> || 58%, 0 mins elapsed, rate=1997.4k bps/s ||
#> || 66%, 0 mins elapsed, rate=2120.2k bps/s ||
#> || 74%, 0 mins elapsed, rate=2220.7k bps/s ||
#> || 83%, 0 mins elapsed, rate=2311.3k bps/s ||
#> || 91%, 0 mins elapsed, rate=2392.5k bps/s ||
#> || Save current index block... ||
#> || [ 0.0% finished ] ||
#> || [ 10.0% finished ] ||
#> || [ 20.0% finished ] ||
#> || [ 30.0% finished ] ||
#> || [ 40.0% finished ] ||
#> || [ 50.0% finished ] ||
#> || [ 60.0% finished ] ||
#> || [ 70.0% finished ] ||
#> || [ 80.0% finished ] ||
#> || [ 90.0% finished ] ||
#> || [ 100.0% finished ] ||
#> || ||
#> || Total running time: 0.2 minutes. ||
#> ||Index /scratch/261666.1.ood/Rtmp4IgmVN/file3f4b0e7ecc02b/Staphylococcu ... ||
#> || ||
#> \\============================================================================//
#>
## Get path to example reads
readPath <- system.file("extdata", "subread_target.bam",
package = "MetaScope")
## Copy the example reads to the temp directory
refPath <- file.path(filter_ref_temp, "subread_target.bam")
file.copy(from = readPath, to = refPath)
#> [1] TRUE
utils::data("align_details")
align_details[["type"]] <- "rna"
align_details[["phredOffset"]] <- 10
# Just to get it to align - toy example!
align_details[["maxMismatches"]] <- 10
## Align and filter reads
filtered_map <- filter_host(
refPath, lib_dir = filter_ref_temp,
libs = stringr::str_replace_all(all_species, " ", "_"),
threads = 1, subread_options = align_details)
#>
|
| | 0%
|
|======================================================================| 100%
#>
#> ========== _____ _ _ ____ _____ ______ _____
#> ===== / ____| | | | _ \| __ \| ____| /\ | __ \
#> ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
#> ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
#> ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
#> ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
#> Rsubread 2.18.0
#>
#> //================================= setting ==================================\\
#> || ||
#> || Function : Read alignment (RNA-Seq) ||
#> || Input file : intermediate.fastq.gz ||
#> || Output file : subread_target.Staphylococcus_aureus_subsp._aureus_str ... ||
#> || Index name : Staphylococcus_aureus_subsp._aureus_str._Newman ||
#> || ||
#> || ------------------------------------ ||
#> || ||
#> || Threads : 1 ||
#> || Phred offset : 64 ||
#> || Min votes : 3 / 10 ||
#> || Max mismatches : 10 ||
#> || Max indel length : 5 ||
#> || Report multi-mapping reads : yes ||
#> || Max alignments per multi-mapping read : 16 ||
#> || ||
#> \\============================================================================//
#>
#> //=============== Running (25-Jun-2024 14:00:38, pid=4147982) ================\\
#> || ||
#> || Check the input reads. ||
#> || The input file contains base space reads. ||
#> || Initialise the memory objects. ||
#> || Estimate the mean read length. ||
#> || WARNING - The specified Phred score offset (64) seems incorrect. ||
#> || ASCII values of the quality scores of read bases included in ||
#> || the first 1049 reads were found to be within the range of ||
#> || [19,19]. ||
#> || ||
#> || Create the output BAM file. ||
#> || Check the index. ||
#> || Init the voting space. ||
#> || Global environment is initialised. ||
#> || Load the 1-th index block... ||
#> || The index block has been loaded. ||
#> || Start read mapping in chunk. ||
#> || ||
#> || Completed successfully. ||
#> || ||
#> \\==================================== ====================================//
#>
#> //================================ Summary =================================\\
#> || ||
#> || Total reads : 1,050 ||
#> || Mapped : 930 (88.6%) ||
#> || Uniquely mapped : 7 ||
#> || Multi-mapping : 923 ||
#> || ||
#> || Unmapped : 120 ||
#> || ||
#> || Indels : 158 ||
#> || ||
#> || WARNING : Phred offset (64) incorrect? ||
#> || ||
#> || Running time : 1.0 minutes ||
#> || ||
#> \\============================================================================//
#>
#>
|
| | 0%
|
|======================================================================| 100%
## Remove temporary directory
unlink(filter_ref_temp, recursive = TRUE)
# }