Blast reads from MetaScope aligned files

This function allows the user to check a subset of identified reads against NCBI BLAST and the nucleotide database to confirm or contradict results provided by MetaScope. It aligns the top `metascope_id()` results to NCBI BLAST database. It REQUIRES that command-line BLAST and a separate nucleotide database have already been installed on the host machine. It returns a csv file updated with BLAST result metrics.

Usage

metascope_blast(
  metascope_id_path,
  bam_file_path = list.files(tmp_dir, ".updated.bam$", full.names = TRUE)[1],
  tmp_dir,
  out_dir,
  sample_name,
  fasta_dir = NULL,
  num_results = 10,
  num_reads = 100,
  hit_list = 10,
  num_threads = 1,
  db_path,
  quiet = FALSE,
  db = NULL,
  accession_path = NULL
)

Arguments

metascope_id_path: Character string; path to a csv file output by `metascope_id()`.
bam_file_path: Character string; full path to bam file for the same sample processed by `metascope_id`. Note that the `filter_bam` function must have output a bam file, and not a .csv.gz file. See `?filter_bam_bowtie` for more details. Defaults to list.files(file_temp, ".updated.bam$")[1].
tmp_dir: Character string, a temporary directory in which to host files.
out_dir: Character string, path to output directory.
sample_name: Character string, sample name for output files.
fasta_dir: Directory where fasta files for blast will be stored.
num_results: Integer, the maximum number of taxa from the metascope_id output to check reads. Takes the top n taxa, i.e. those with largest abundance. Default is 10.
num_reads: Integer, the maximum number of reads to blast per microbe. If the true number of reads assigned to a given taxon is smaller, then the smaller number will be chosen. Default is 100. Too many reads will involve more processing time.
hit_list: Integer, number of blast hit results to keep. Default is 10.
num_threads: Integer, number of threads if running in parallel (recommended). Default is 1.
db_path: Character string. The database file to be searched (including basename, but without file extension). For example, if the nt database is in the nt folder, this would be /filepath/nt/nt assuming that the database files have the nt basename. Check this path if you get an error message stating "No alias or index file found".
quiet: Logical, whether to print out more informative messages. Default is FALSE.
db: Currently accepts one of c("ncbi", "silva", "other") Default is "ncbi", appropriate for samples aligned against indices compiled from NCBI whole genome databases. Alternatively, usage of an alternate database (like Greengenes2) should be specified with "other".
accession_path: (character) Filepath to NCBI accessions SQL database. See taxonomzr::prepareDatabase().

Value

This function writes an updated csv file with metrics.

Details

This function assumes that you used the NCBI nucleotide database to process samples, with a download date of 2021 or later. This is necessary for compatibility with the bam file headers.

This is highly computationally intensive and should be ran with multiple cores, submitted as a multi-threaded computing job if possible.

Note, if best_hit_strain is FALSE, then no strain was observed more often among the BLAST results.

Usage

Arguments

Value

Details

Examples