This function allows the user to check a subset of identified reads against NCBI BLAST and the nucleotide database to confirm or contradict results provided by MetaScope. It aligns the top `metascope_id()` results to NCBI BLAST database. It REQUIRES that command-line BLAST and a separate nucleotide database have already been installed on the host machine. It returns a csv file updated with BLAST result metrics.
Usage
metascope_blast(
metascope_id_path,
bam_file_path = list.files(tmp_dir, ".updated.bam$", full.names = TRUE)[1],
tmp_dir,
out_dir,
sample_name,
fasta_dir = NULL,
num_results = 10,
num_reads = 100,
hit_list = 10,
num_threads = 1,
db_path,
quiet = FALSE,
db = NULL,
accession_path = NULL
)
Arguments
- metascope_id_path
Character string; path to a csv file output by `metascope_id()`.
- bam_file_path
Character string; full path to bam file for the same sample processed by `metascope_id`. Note that the `filter_bam` function must have output a bam file, and not a .csv.gz file. See `?filter_bam_bowtie` for more details. Defaults to
list.files(file_temp, ".updated.bam$")[1]
.- tmp_dir
Character string, a temporary directory in which to host files.
- out_dir
Character string, path to output directory.
- sample_name
Character string, sample name for output files.
- fasta_dir
Directory where fasta files for blast will be stored.
- num_results
Integer, the maximum number of taxa from the metascope_id output to check reads. Takes the top n taxa, i.e. those with largest abundance. Default is 10.
- num_reads
Integer, the maximum number of reads to blast per microbe. If the true number of reads assigned to a given taxon is smaller, then the smaller number will be chosen. Default is 100. Too many reads will involve more processing time.
- hit_list
Integer, number of blast hit results to keep. Default is 10.
- num_threads
Integer, number of threads if running in parallel (recommended). Default is 1.
- db_path
Character string. The database file to be searched (including basename, but without file extension). For example, if the nt database is in the nt folder, this would be /filepath/nt/nt assuming that the database files have the nt basename. Check this path if you get an error message stating "No alias or index file found".
- quiet
Logical, whether to print out more informative messages. Default is FALSE.
- db
Currently accepts one of
c("ncbi", "silva", "other")
Default is"ncbi"
, appropriate for samples aligned against indices compiled from NCBI whole genome databases. Alternatively, usage of an alternate database (like Greengenes2) should be specified with"other"
.- accession_path
(character) Filepath to NCBI accessions SQL database. See
taxonomzr::prepareDatabase()
.
Details
This function assumes that you used the NCBI nucleotide database to process samples, with a download date of 2021 or later. This is necessary for compatibility with the bam file headers.
This is highly computationally intensive and should be ran with multiple cores, submitted as a multi-threaded computing job if possible.
Note, if best_hit_strain is FALSE, then no strain was observed more often among the BLAST results.