INTRO
In preparation for a grant proposal involving sea star wasting disease, due in about two weeks, Steven asked that I perform taxonomic classification on a Trinity assembly of unmapped P.helanthoides RNA-seq reads (GitHub Issue). This was to compare to the results of classifying just the unmapped reads (Notebook). Due to time constraints, and a bit of luck, I opted to run this on Mox (Mox should have been decommissioned on 11/1/2024, but it’s still running and being able to use Mox, which is already set up, instead of resorting to Klone, was much faster to get this going).
METHODS
DIAMOND BLASTx and MEGANization
The Trinitay assembly FastA was DIAMOND
BLASTx’d against a previously constructed DIAMON BLAST database. Since this was using an assembly and not individual sequencing reads, I used the --long-reads
option. After DIAMOND
BLASTx, the resulting DAA files were “MEGANized” in preparation for use in MEGAN6. Both steps were run on Mox.
SBATCH SCRIPT
GitHub:
#!/bin/bash
## Job Name
#SBATCH --job-name=20241108-phel-diamond-meganizer-trinity_unmapped
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=3-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20241108-phel-diamond-meganizer-trinity_unmapped
## Perform DIAMOND BLASTx on Trinity assembly of unmapped P.helanthoides RNA-seq reads from Steven.
## Run with the --long-reads option due to using assembled reads.
## Will be used to view taxonomic breakdown.
###################################################################################
# These variables need to be set by user
## Assign Variables
# Program paths
diamond=/gscratch/srlab/programs/diamond-v2.1.1/diamond
meganizer=/gscratch/srlab/programs/MEGAN-6.22.0/tools/daa-meganizer
# DIAMOND NCBI nr database
dmnd_db=/gscratch/srlab/blastdbs/20230215-ncbi-nr/20230215-ncbi-nr.dmnd
# MEGAN mapping files
megan_mapping_dir=/gscratch/srlab/sam/data/databases/MEGAN
megan_mapdb="${megan_mapping_dir}/megan-map-Feb2022.db"
# CPU threads
threads=40
# MEGAN memory limit
mem_limit=100G
# Programs associative array
declare -A programs_array
programs_array=(
[diamond]="${diamond}" \
[meganizer]="${meganizer}"
)
###################################################################################################
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Run DIAMOND BLASTx, followed by "MEGANization"
# Run DIAMOND with blastx
# Output format 100 produces a DAA binary file for use with MEGAN
echo "Running DIAMOND BLASTx on ${fastq}."
echo ""
"${programs_array[diamond]}" blastx \
${dmnd_db} \
--db \
--query Trinity.fasta \
--out Trinity.fasta-phel-unmapped.blastx.meganized.daa \
--long-reads \
--outfmt 100 \
--top 5 \
--block-size 15.0 \
--index-chunks 4 ${mem_limit} \
--memory-limit ${threads}
--threads echo "DIAMOND BLASTx complete: Trinity.fasta-phel-unmapped.blastx.meganized.daa"
echo ""
# Meganize DAA files
# Used for ability to import into MEGAN6
echo "Now MEGANizing Trinity.fasta-phel-unmapped.blastx.meganized.daa"
"${programs_array[meganizer]}" \
\
--in Trinity.fasta-phel-unmapped.blastx.meganized.daa ${threads} \
--threads ${megan_mapdb}
--mapDB echo "MEGANization of Trinity.fasta-phel-unmapped.blastx.meganized.daa completed."
echo ""
# Generate MD5 checksums
for file in *
do
if [ "${file}" != "checksums.md5" ]; then
echo ""
echo "Generating MD5 checksums for ${file}:"
md5sum "${file}" | tee --append checksums.md5
echo ""
fi
done
# Generate checksum for MEGAN database(s)
{
md5sum "${megan_mapdb}"
md5sum "${dmnd_db}"
} >> checksums.md5
#######################################################################################################
# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
echo "Logging program options..."
for program in "${!programs_array[@]}"
do
{
echo "Program options for ${program}: "
echo ""
# Handle samtools help menus
if [[ "${program}" == "samtools_index" ]] \
|| [[ "${program}" == "samtools_sort" ]] \
|| [[ "${program}" == "samtools_view" ]]
then
${programs_array[$program]}
# Handle DIAMOND BLAST menu
elif [[ "${program}" == "diamond" ]]; then
${programs_array[$program]} help
# Handle NCBI BLASTx menu
elif [[ "${program}" == "blastx" ]]; then
${programs_array[$program]} -help
fi
${programs_array[$program]} -h
echo ""
echo ""
echo "----------------------------------------------"
echo ""
echo ""
} &>> program_options.log || true
# If MultiQC is in programs_array, copy the config file to this directory.
if [[ "${program}" == "multiqc" ]]; then
cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
fi
done
echo "Finished logging programs options."
echo ""
fi
# Document programs in PATH (primarily for program version ID)
echo "Logging system $PATH..."
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
echo "Finished logging system $PATH."
DAA to RMA Conversion
Resulting DAA file was converted to RMA6 format for import and analysis in MEGAN6. Due to time constraints, this was performed on Raven without a script. Command used:
/home/shared/megan-6.24.20/tools/daa2rma \
\
--in ./Trinity.fasta-phel-unmapped.blastx.meganized.daa \
--mapDB ~/data/databases/MEGAN/megan-map-Feb2022.db \
--out ./Trinity.fasta-phel-unmapped.blastx.meganized.daa.rma6 2>&1 | tee --append daa2rma_log.txt --threads 40
RESULTS
The Mox run took just under 2hrs to complete:
DISCUSSION
Looking at the entire taxonomies, down to the species level, most of the asembled reads get assigned to Asterias rubens (Wikipedia), the common sea star. This is the same as we saw when we performed this on just the RNA-seq reads. This is surprising, as I’m fairly certain these sequencing reads came from Pycnopodia helianthoides (Wikipedia), the sunflower star. Of course, these results are also dependent upon how much sequence data in NCBI for any given species. However, with that said, sequences are only assigned to a given taxonomic level based on the BLASTx results. Thus, even if a sequence comes from a sea star, it would (should?) not get assigned to a particular species unless the BLASTx confidence was very high…
Looking at the highest read assignments for non-eukaryotic species, we see that of the Bacteria (7,153 reads), Bacillus yapensis (Journal article website), while in Archae (17,742 reads), Methanocaldococcus jannaschii (Wikipedia) had the most reads assigned.
Output files
Output directory:
DIAMOND BLASTX/MEGANIZER
Output files:
Trinity.fasta-phel-unmapped.blastx.meganized.daa
(211MB)- MD5:
c9eb53478aa937ea7e094e023895918c
- MD5:
RMA6
Trinity.fasta-phel-unmapped.blastx.meganized.daa.rma6
- MD5:
50f150a9ce97034ffd5bad54b4017903
- MD5: