This function allows the user to check a subset of identified reads against NCBI BLAST and the nucleotide database to confirm or contradict results provided by MetaScope. It aligns the top `metascope_id()` results to NCBI BLAST database. It REQUIRES that command-line BLAST and a separate nucleotide database have already been installed on the host machine. It returns a csv file updated with BLAST result metrics.
Usage
metascope_blast(
  metascope_id_path,
  bam_file_path = list.files(tmp_dir, ".updated.bam$", full.names = TRUE)[1],
  tmp_dir,
  out_dir,
  sample_name,
  fasta_dir = NULL,
  num_results = 10,
  num_reads = 100,
  hit_list = 10,
  num_threads = 1,
  db_path,
  quiet = FALSE,
  db = NULL,
  accession_path = NULL
)Arguments
- metascope_id_path
- Character string; path to a csv file output by `metascope_id()`. 
- bam_file_path
- Character string; full path to bam file for the same sample processed by `metascope_id`. Note that the `filter_bam` function must have output a bam file, and not a .csv.gz file. See `?filter_bam_bowtie` for more details. Defaults to - list.files(file_temp, ".updated.bam$")[1].
- tmp_dir
- Character string, a temporary directory in which to host files. 
- out_dir
- Character string, path to output directory. 
- sample_name
- Character string, sample name for output files. 
- fasta_dir
- Directory where fasta files for blast will be stored. 
- num_results
- Integer, the maximum number of taxa from the metascope_id output to check reads. Takes the top n taxa, i.e. those with largest abundance. Default is 10. 
- num_reads
- Integer, the maximum number of reads to blast per microbe. If the true number of reads assigned to a given taxon is smaller, then the smaller number will be chosen. Default is 100. Too many reads will involve more processing time. 
- hit_list
- Integer, number of blast hit results to keep. Default is 10. 
- num_threads
- Integer, number of threads if running in parallel (recommended). Default is 1. 
- db_path
- Character string. The database file to be searched (including basename, but without file extension). For example, if the nt database is in the nt folder, this would be /filepath/nt/nt assuming that the database files have the nt basename. Check this path if you get an error message stating "No alias or index file found". 
- quiet
- Logical, whether to print out more informative messages. Default is FALSE. 
- db
- Currently accepts one of - c("ncbi", "silva", "other")Default is- "ncbi", appropriate for samples aligned against indices compiled from NCBI whole genome databases. Alternatively, usage of an alternate database (like Greengenes2) should be specified with- "other".
- accession_path
- (character) Filepath to NCBI accessions SQL database. See - taxonomzr::prepareDatabase().
Details
This function assumes that you used the NCBI nucleotide database to process samples, with a download date of 2021 or later. This is necessary for compatibility with the bam file headers.
This is highly computationally intensive and should be ran with multiple cores, submitted as a multi-threaded computing job if possible.
Note, if best_hit_strain is FALSE, then no strain was observed more often among the BLAST results.