After mapping bisulfite sequencing (BSseq) data to the Crassostrea virginica (Eastern oyster) genome, Steven noticed that there were a large number of unmapped reads. He asked that I attempt to taxonomically claissify the unmapped reads (GitHub Issue), with the idea that maybe these reads could provide additional data on an associated microbiome (GitHub Discussion).
To handle this issue, I used DIAMOND
BLASTx, followed by the MEGAN6 tool daa-meganizer
to prepare the output files for import in MEGAN6 to view taxonomic assignments. See the SBATCH script and the RESULTS section below for info on BLAST database and input FastQ files used, respectively. FastQ files were provided by Steven, so I don’t have any info on how they were generated; although, I’m assuming they came from Bismark
.
One side note on this process. In the past, I’ve opted to convert the output DAA files into RMA6 format instead of performing the “meganization” of the DAA files. The rationale for this was that when you import the DAA files into MEGAN6, they get converted to RMA6, so it seems reasonable to use a more powerful computer (e.g. Mox) to do this conversion. However, the developer indicates that this conversion via the daa-rma
tool is slow and it is faster/more efficient to “meganize” the DAA files and import them into MEGAN6. So, that’s why I’ve reverted back to just performing the “meganization” of the DAA files instead of converting directly to RMA6.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=20220302-cvir-diamond-meganizer-unmapped_bsseq
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=3-12:00:00
## Memory per node
#SBATCH --mem=200G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20220302-cvir-diamond-meganizer-unmapped_bsseq
## Perform DIAMOND BLASTx on unmapped C.virginica BSseq FastQ files.
## Unmapped reads generated by Steven Roberts, per this GitHub Issue:
## https://github.com/sr320/ceabigr/issues/22
## Expects input FastQ files to be match this pattern: *unmapped_reads*.fq.gz
###################################################################################
# These variables need to be set by user
## Assign Variables
# Program paths
diamond=/gscratch/srlab/programs/diamond-0.9.29/diamond
meganizer=/gscratch/srlab/programs/MEGAN-6.22.0/tools/daa-meganizer
# DIAMOND NCBI nr database
dmnd_db=/gscratch/srlab/blastdbs/ncbi-nr-20190925/nr.dmnd
# MEGAN mapping files
megan_mapping_dir=/gscratch/srlab/sam/data/databases/MEGAN
prot_acc2tax=${megan_mapping_dir}/prot_acc2tax-Jul2019X1.abin
acc2interpro=${megan_mapping_dir}/acc2interpro-Jul2019X.abin
acc2eggnog=${megan_mapping_dir}/acc2eggnog-Jul2019X.abin
# FastQ files directory
fastq_dir=/gscratch/srlab/sam/data/C_virginica/BSseq
# CPU threads
threads=40
# Programs associative array
declare -A programs_array
programs_array=(
[diamond]="${diamond}" \
[meganizer]="${meganizer}"
)
###################################################################################################
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Loop through FastQ files, log filenames to fastq_list.txt.
# Run DIAMOND on each FastQ
for fastq in "${fastq_dir}"/*unmapped_reads*.fq.gz
do
# Log input FastQs
echo ""
echo "Generating MD5 checksum for ${fastq}..."
md5sum "${fastq}" | tee --append input_fastqs.md5
echo ""
# Strip leading path and extensions
no_path=$(echo "${fastq##*/}")
no_ext=$(echo "${no_path%%.*}")
# Run DIAMOND with blastx
# Output format 100 produces a DAA binary file for use with MEGAN
"${programs_array[diamond]}" blastx \
--db ${dmnd_db} \
--query "${fastq}" \
--out "${no_ext}".blastx.meganized.daa \
--outfmt 100 \
--top 5 \
--block-size 15.0 \
--index-chunks 4 \
--threads ${threads}
# Meganize DAA files
# Used for ability to import into MEGAN6
"${programs_array[meganizer]}" \
--in "${no_ext}".blastx.meganized.daa \
--threads ${threads} \
--acc2taxa ${prot_acc2tax} \
--acc2interpro2go ${acc2interpro} \
--acc2eggnog ${acc2eggnog}
done
# Generate MD5 checksums
for file in *
do
echo ""
echo "Generating MD5 checksums for ${file}:"
md5sum "${file}" | tee --append checksums.md5
echo ""
done
#######################################################################################################
# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
echo "Logging program options..."
for program in "${!programs_array[@]}"
do
{
echo "Program options for ${program}: "
echo ""
# Handle samtools help menus
if [[ "${program}" == "samtools_index" ]] \
|| [[ "${program}" == "samtools_sort" ]] \
|| [[ "${program}" == "samtools_view" ]]
then
${programs_array[$program]}
# Handle DIAMOND BLAST menu
elif [[ "${program}" == "diamond" ]]; then
${programs_array[$program]} help
# Handle NCBI BLASTx menu
elif [[ "${program}" == "blastx" ]]; then
${programs_array[$program]} -help
fi
${programs_array[$program]} -h
echo ""
echo ""
echo "----------------------------------------------"
echo ""
echo ""
} &>> program_options.log || true
# If MultiQC is in programs_array, copy the config file to this directory.
if [[ "${program}" == "multiqc" ]]; then
cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
fi
done
echo "Finished logging programs options."
echo ""
fi
# Document programs in PATH (primarily for program version ID)
echo "Logging system $PATH..."
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
echo "Finished logging system $PATH."
RESULTS
Run time was was ~3.25 days:
Output folder:
20220302-cvir-diamond-meganizer-unmapped_bsseq/
Meganized DAA files:
*blastx.meganized.daa
These are all ~100MB.
List of input FastQs and MD5 checksums (text)
These are now ready for import into MEGAN6 to view taxonomic distribution of reads.