Identify which genomes are represented in a processed sample
Source:R/metascope_id.R
metascope_id.Rd
This function will read in a .bam or .csv.gz file, annotate the taxonomy and genome names, reduce the mapping ambiguity using a mixture model, and output a .csv file with the results. Currently, it assumes that the genome library/.bam files use NCBI accession names for reference names (rnames in .bam file).
Usage
metascope_id(
input_file,
input_type = "csv.gz",
aligner = "bowtie2",
db = "ncbi",
db_feature_table = NULL,
NCBI_key = NULL,
out_dir = dirname(input_file),
tmp_dir = NULL,
convEM = 1/10000,
maxitsEM = 25,
q50 = FALSE,
update_bam = FALSE,
num_species_plot = NULL,
blast_fastas = FALSE,
num_genomes = 100,
num_reads = 50,
quiet = TRUE
)
Arguments
- input_file
The .bam or .csv.gz file of sample reads to be identified.
- input_type
Extension of file input. Should be either "bam" or "csv.gz". Default is "csv.gz".
- aligner
The aligner which was used to create the bam file. Default is "bowtie2" but can also be set to "subread" or "other".
- db
Currently accepts one of
c("ncbi", "silva", "other")
Default is"ncbi"
, appropriate for samples aligned against indices compiled from NCBI whole genome databases. Alternatively, usage of an alternate database (like Greengenes2) should be specified with"other"
.- db_feature_table
If
db = "other"
, a data.frame must be supplied with two columns, "Feature ID" matching the names of the alignment indices, and a secondcharacter
column supplying the taxon identifying information.- NCBI_key
(character) NCBI Entrez API key. optional. See taxize::use_entrez(). Due to the high number of requests made to NCBI, this function will be less prone to errors if you obtain an NCBI key. You may enter the string as an input or set it as ENTREZ_KEY in .Renviron.
- out_dir
The directory to which the .csv output file will be output. Defaults to
dirname(input_file)
.- tmp_dir
Path to a directory to which bam and updated bam files can be saved. Required.
- convEM
The convergence parameter of the EM algorithm. Default set at
1/10000
.- maxitsEM
The maximum number of EM iterations, regardless of whether the convEM is below the threshhold. Default set at
50
. If set at0
, the algorithm skips the EM step and summarizes the .bam file 'as is'- q50
(inc.)
- update_bam
Whether to update BAM file with new read assignments. Default is
FALSE
. IfTRUE
, requiresinput_type = TRUE
such that a BAM file is the input to the function.- num_species_plot
The number of genome coverage plots to be saved. Default is
NULL
, which saves coverage plots for the ten most highly abundant species.- blast_fastas
Logical, whether or not to output fasta files for MetaBlast. Default is
FALSE
.- num_genomes
Number of genomes to output fasta files for MetaBlast. Default is
100
.- num_reads
Number of reads per genome per fasta file for MetaBlast. Default is
50
.- quiet
Turns off most messages. Default is
TRUE
.
Value
This function returns a .csv file with annotated read counts to genomes with mapped reads. The function itself returns the output .csv file name. Depending on the parameters specified, can also output an updated BAM file, and fasta files for usage downstream with MetaBLAST.
Examples
#### Align reads to reference library and then apply metascope_id()
## Assuming filtered bam files already exist
## Create temporary directory
file_temp <- tempfile()
dir.create(file_temp)
#### Subread aligned bam file
## Create object with path to filtered subread csv.gz file
filt_file <- "subread_target.filtered.csv.gz"
bamPath <- system.file("extdata", filt_file, package = "MetaScope")
file.copy(bamPath, file_temp)
#> [1] TRUE
## Run metascope id with the aligner option set to subread
metascope_id(input_file = file.path(file_temp, filt_file),
aligner = "subread", num_species_plot = 0,
input_type = "csv.gz")
#> No ENTREZ API key provided
#> Get one via taxize::use_entrez()
#> See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
#> # A tibble: 2 × 6
#> TaxonomyID Genome read_count Proportion readsEM EMProportion
#> <chr> <chr> <int> <dbl> <dbl> <dbl>
#> 1 273036 Staphylococcus aureus R… 102 0.85 102. 0.847
#> 2 418127 Staphylococcus aureus s… 18 0.15 18.3 0.153
#### Bowtie 2 aligned .csv.gz file
## Create object with path to filtered bowtie2 bam file
bowtie_file <- "bowtie_target.filtered.csv.gz"
bamPath <- system.file("extdata", bowtie_file, package = "MetaScope")
file.copy(bamPath, file_temp)
#> [1] TRUE
## Run metascope id with the aligner option set to bowtie2
metascope_id(file.path(file_temp, bowtie_file), aligner = "bowtie2",
num_species_plot = 0, input_type = "csv.gz")
#> No ENTREZ API key provided
#> Get one via taxize::use_entrez()
#> See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
#> # A tibble: 1 × 6
#> TaxonomyID Genome read_count Proportion readsEM EMProportion
#> <chr> <chr> <int> <dbl> <dbl> <dbl>
#> 1 11234 Measles virus, complete… 6 1 6 1
## Remove temporary directory
unlink(file_temp, recursive = TRUE)