Earlier today I quality filtered (>=Q7) our C.baird NanoPore reads. One of the things I’d like to do now is to attempt to filter reads taxonomically, since the NanoPore data came from both an uninfected crab and Hematodinium-infected crab.
I ran DIAMOND BLASTx (on Mox), followed by MEGAN6 daa2rma
(on swoose - MEGAN6 requires some weird Java X11 thingy that won’t work on Mox) to convert the DIAMOND output to the proper MEGAN6 format for visualizing taxonomic assignments.
SBATCH script (GitHub) for DIAMOND BLASTx on Mox:
#!/bin/bash
## Job Name
#SBATCH --job-name=cbai_blastx_DIAMOND_nanopore_all
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=200G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200917_cbai_diamond_blastx_nanopore_all_Q7
# Script to run DIAMOND BLASTx on all quality filtered (Q7) C.bairdi NanoPore reads
# from 20200917 using the --long-reads option
# for subsequent import into MEGAN6 to try to separate reads taxonomically.
###################################################################################
# These variables need to be set by user
# Input FastQ file
fastq=/gscratch/srlab/sam/data/C_bairdi/DNAseq/20200917_cbai_nanopore_all_quality-7.fastq
# DIAMOND Output filename prefix
prefix=20200917_cbai_diamond_blastx_nanopore_all_Q7
# Set number of CPUs to use
threads=28
# Program paths
diamond=/gscratch/srlab/programs/diamond-2.0.4/diamond
# DIAMOND NCBI nr database with taxonomy dumps
dmnd_db=/gscratch/srlab/blastdbs/ncbi-nr-20190925/nr.dmnd
###################################################################################
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# SegFault fix?
export THREADS_DAEMON_MODEL=1
# Inititalize arrays
programs_array=()
# Programs array
programs_array=("${diamond}")
md5sum "${fastq}" > fastq_checksums.md5
# Run DIAMOND with blastx
# Output format 6 produces a standard BLAST tab-delimited file
# Run DIAMOND with blastx
# Output format 100 produces a DAA binary file for use with MEGAN
${diamond} blastx \
--long-reads \
--db ${dmnd_db} \
--query "${fastq}" \
--out "${prefix}".blastx.daa \
--outfmt 100 \
--top 5 \
--block-size 8.0 \
--index-chunks 1 \
--threads ${threads}
# Capture program options
for program in "${!programs_array[@]}"
do
{
echo "Program options for ${programs_array[program]}: "
echo ""
${programs_array[program]} help
echo ""
echo ""
echo "----------------------------------------------"
echo ""
echo ""
} &>> program_options.log || true
done
# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
Bash script (GitHub) for daa2rma
on Emu:
#!/bin/bash
# Script to run MEGAN6 daa2rma on DIAMOND DAA files from
# 20200917_cbai_diamond_blastx_nanopore_all_Q7.
# Utilizes the --longReads option
# Requires MEGAN mapping file from:
# http://ab.inf.uni-tuebingen.de/data/software/megan6/download
# MEGAN mapping file
megan_map=/home/sam/data/databases/MEGAN6/megan-map-Jul2020-2.db
# Programs array
declare -A programs_array
programs_array=(
[daa2rma]="/home/shared/megan_6.19.9/tools/daa2rma"
)
threads=16
#########################################################################
# Exit script if any command fails
set -e
## Run daa2rma
# Capture start "time"
# Uses builtin bash variable called ${SECONDS}
start=${SECONDS}
for daa in *.daa
do
start_loop=${SECONDS}
sample_name=$(basename --suffix ".blastx.daa" "${daa}")
echo "Now processing ${sample_name}.daa2rma.rma6"
echo ""
# Run daa2rma with long reads option
${programs_array[daa2rma]} \
--in "${daa}" \
--longReads \
--mapDB ${megan_map} \
--out "${sample_name}".daa2rma.rma6 \
--threads ${threads} \
2>&1 | tee --append daa2rma_log.txt
end_loop=${SECONDS}
loop_runtime=$((end_loop-start_loop))
echo "Finished processing ${sample_name}.daa2rma.rma6 in ${loop_runtime} seconds."
echo ""
done
# Caputure end "time"
end=${SECONDS}
runtime=$((end-start))
# Print daa2rma runtime, in seconds
{
echo ""
echo "---------------------"
echo ""
echo "Total runtime was: ${runtime} seconds"
} >> daa2rma_log.txt
# Capture program options
for program in "${!programs_array[@]}"
do
{
echo "Program options for ${program}: "
echo ""
${programs_array[$program]} --help
echo ""
echo ""
echo "----------------------------------------------"
echo ""
echo ""
} &>> program_options.log || true
done
# Document programs in PATH
{
date
echo ""
echo "System PATH:"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
RESULTS
DIAMOND runtime was pretty quick, 37mins:
DIAMOND BLASTx Output folder:
20200917_cbai_diamond_blastx_nanopore_all_Q7/
DIAMOND BLASTx DAA file:
daa2rma Output folder:
I opened the RMA6 file in MEGAN6 to see the taxonomic breakdown and this is what I got:
NOTE: When using --long-read
mode, taxonomic assignments counts are always in aligned bases, not aligned reads!
Well, the results are pretty intriguing. Here’s a short table of the most intriguing data:
Taxonomy | Bases |
---|---|
Alveolata | 436,841 |
Arthropoda | 35,864,420 |
Enterospora canceri | 18,134,246 |
A few things to note:
As expected, Arthropoda comprises the bulk of assignments.
Alveolata is far fewer than I expected, but it is important to remember that the NanoPore data analyzed here is from two samples: one uninfected, the other infected, so Hematodinium sequence is going to make up a much smaller proportion of reads/bases.
Enterospora canceri (a known microsporidian agent of infection in crabs) ha the second most number of bases assigned. AND, this is at the species level, not Phylum level like Arthropoda and Alveolata!
- I think I’ve previously noticed a relatively high abundance of Enterospora canceri read assignments in some previous RNAseq taxonomic assignment I performed, but I’ll have to double check…
Although not highlighted in the tree/table above, I should not ignore Aquifex sp., which has 6,338,292 bases assigned to it! It turns out that members of the Aquifex genus are thermophylic bacteria… Crabs live on the sea floor (which is where one would find hydrothermal vents), but why would we see such a high abundance of these bacteria in muscle/hemolymph? Very curious!
Where to go from here. Well, I think I might extract the reads and, well, I don’t know what. Individual genome assemblies? I need to think about this a bit more and probably discuss with Steven.