Although I previously annotated our C.bairdi transcriptome from 20191218, I realized that the assembly and annotations were combine infected/uninfected samples, possibly making separating crab/Hematodinium sequences a bit more difficult.
I also realized that the MEGAN6 software that I’d previously used for metagenomic taxonomic classification can actually extract sequencing reads. So, I decided to run all of our Tanner crab RNAseq reads through the MEGAN6 process. At the end, I’ll separate out reads, based on taxonomy, and then generate “clean” de novo assemblies of Tanner crab and Hematodinium!
To start this process, the trimmed reads need to be annotated using DIAMOND BLASTx. Then, the DIAMOND output files need to be “meganized” for importing to MEGAN6.
DIAMOND BLASTx took place on Mox, while “meganization” took place on my lab computer (swoose
); this is due to the way that MEGAN6 uses Java - it doesn’t run properly on Mox.
For reference, these include RNAseq data using a newly established “shorthand”: 2018, 2019.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=cbai_blastx_DIAMOND
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=20-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200103_cbai_diamond_blastx
## Perform DIAMOND BLASTx on trimmed Chionoecetes bairdi (Tanner crab) FastQ files.
## Trimmed FastQ files originated here:
## https://gannet.fish.washington.edu/Atumefaciens/20191218_cbai_fastp_RNAseq_trimming
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# SegFault fix?
export THREADS_DAEMON_MODEL=1
# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
# Program paths
diamond=/gscratch/srlab/programs/diamond-0.9.29/diamond
# DIAMOND NCBI nr database
dmnd=/gscratch/srlab/blastdbs/ncbi-nr-20190925/nr.dmnd
# FastQ files directory
fastq_dir=/gscratch/srlab/sam/data/C_bairdi/RNAseq
# Loop through FastQ files, log filenames to fastq_list.txt.
# Run DIAMOND on each FastQ
for fastq in ${fastq_dir}*fastp-trim*.fq.gz
do
# Log input FastQs
echo "${fastq}" >> fastq_list.txt
# Strip leading path and extensions
no_path=$(echo "${fastq##*/}")
no_ext=$(echo "${no_path%%.*}")
# Run DIAMOND with blastx
# Output format 100 produces a DAA binary file for use with MEGAN
${diamond} blastx \
--db ${dmnd} \
--query "${fastq}" \
--out "${no_ext}".blastx.daa \
--outfmt 100 \
--top 5 \
--block-size 15.0 \
--index-chunks 4
done
MEGANIZER script (GitHub):
#!/bin/bash
# Script to run MEGAN6 meganizer on DIAMOND DAA files from
# 20200103_cbai_diamond_blastx Mox job.
# Requires MEGAN mapping files from:
# http://ab.inf.uni-tuebingen.de/data/software/megan6/download
# Program path
meganizer=/home/sam/programs/megan/tools/daa-meganizer
# MEGAN mapping files
prot_acc2tax=/home/sam/data/databases/MEGAN/prot_acc2tax-Jul2019X1.abin
acc2interpro=/home/sam/data/databases/MEGAN/acc2interpro-Jul2019X.abin
acc2eggnog=/home/sam/data/databases/MEGAN/acc2eggnog-Jul2019X.abin
# Variables
threads=20
## Run MEGANIZER
# Capture start "time"
start=${SECONDS}
for daa in *.daa
do
${meganizer} \
--in "${daa}" \
--threads "${threads}" \
--acc2taxa ${prot_acc2tax} \
--acc2interpro2go ${acc2interpro} \
--acc2eggnog ${acc2eggnog}
done
# Caputure end "time"
end=${SECONDS}
runtime=$((end-start))
# Print MEGANIZER runtime, in seconds
echo "Runtime was: ${runtime} seconds"
RESULTS
Runtime was just a bit over two days (but, it sat in the queue for a full day before being able to run):
Output folder:
Now that this is complete, I will proceed with using importing into MEGAN6, to create rma6
file and then separately extract crab reads and Hematodinium reads. These will then be used to generate “clean” transcriptome assemblies for Tanner crab and Hematodinium.
Here’s the full list of MEGANIZED DIAMOND daa
files and their sizes (note: they’re huge files):