After generating a number of C.bairdi (Tanner crab) transcriptomes, we decided we should compare them to evaluate which to help decide which one should become our “canonical” version. As part of that, the Trinity wiki offers a list of tools that one can use to check the quality of transcriptome assemblies. Some of those require a transcriptome of a related species.
A cursory search didn’t yield anything that jumped out at me, so I downloaded (and checked strandedness) of NCBI SRA RNAseq data for P.trituberculatus (Japanese blue crab) and decided to assemble a transcriptome myself.
Ran Trinity on Mox.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=trinity_cbai
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=12-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200523_ptri_trinity_transcriptome
### Trinity de novo assembly of all pooled P.trituberculatus RNAseq data.
### Assembly to be used for assessing our C.bairdi transcriptome assemblies.
### Data was taken from NCBI SRA BioProject PRJNA597187.
### See fastq.list.txt file for list of input files used for assembly.
# Exit script if a command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
# User-defined variables
reads_dir=/gscratch/srlab/sam/data/P_trituberculatus/RNAseq
transcriptome_dir=/gscratch/srlab/sam/data/P_trituberculatus/transcriptomes
threads=28
assembly_stats=assembly_stats.txt
timestamp=$(date +%Y%m%d)
fasta_name="${timestamp}.P_trituberculatus.Trinity.fasta"
# Paths to programs
trinity_dir="/gscratch/srlab/programs/trinityrnaseq-v2.9.0"
samtools="/gscratch/srlab/programs/samtools-1.10/samtools"
## Inititalize arrays
R1_array=()
R2_array=()
# Variables for R1/R2 lists
R1_list=""
R2_list=""
# Create array of fastq R1 files
R1_array=("${reads_dir}"/*.fastq)
# Create array of fastq R2 files
R2_array=("${reads_dir}"/*.fastq)
# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${reads_dir}"/*.fastq
do
echo "${fastq##*/}" >> fastq.list.txt
done
# Create comma-separated lists of FastQ reads
R1_list=$(echo "${R1_array[@]}" | tr " " ",")
R2_list=$(echo "${R2_array[@]}" | tr " " ",")
# Run Trinity
# Running as stranded, based off of analysis on 20200521:
# https://robertslab.github.io/sams-notebook/posts/2020/2020-05-21-SRA-Library-Assessment---Determine-RNAseq-Library-Strandedness-from-P.trituberculatus-SRA-BioProject-PRJNA597187/
${trinity_dir}/Trinity \
--seqType fq \
--max_memory 500G \
--CPU ${threads} \
--SS_lib_type RF \
--left "${R1_list}" \
--right "${R2_list}"
# Rename generic assembly FastA
mv trinity_out_dir/Trinity.fasta trinity_out_dir/"${fasta_name}"
# Assembly stats
${trinity_dir}/util/TrinityStats.pl trinity_out_dir/"${fasta_name}" \
> ${assembly_stats}
# Create gene map files
${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \
trinity_out_dir/"${fasta_name}" \
> trinity_out_dir/"${fasta_name}".gene_trans_map
# Create sequence lengths file (used for differential gene expression)
${trinity_dir}/util/misc/fasta_seq_length.pl \
trinity_out_dir/"${fasta_name}" \
> trinity_out_dir/"${fasta_name}".seq_lens
# Create FastA index
${samtools} faidx \
trinity_out_dir/"${fasta_name}"
# Copy files to transcriptome directory
rsync -av \
trinity_out_dir/"${fasta_name}"* \
${transcriptome_dir}
# Generate FastA MD5 checksum
# See last line of SLURM output file
cd trinity_out_dir
md5sum "${fasta_name}" > "${fasta_name}".checksum.md5
RESULTS
Took ~2 days to run:
This assembly did hit a small hiccup during the Butterfly portion where it hung indefinitely. I posted an issue on the Trinity GitHub Issues and received instructions on how to resolve it. Apparently, it was due to a very high number of low complexity sequences. Check that issue for info on how it was resolved.
Also, while this was running, I ended up tracking down two additional crab transcriptomes on NCBI:
Carcinus maenas (European green crab; Thanks to Grace)
So, I don’t really need the resulting transcriptome assembly… Oh well, here it is anyway!
Output folder:
FastA:
- 20200523_ptri_trinity_transcriptome/trinity_out_dir/20200526.P_trituberculatus.Trinity.fasta (386MB)
FastA Index:
Assembly stats (text):
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 408543
Total trinity transcripts: 580098
Percent GC: 44.63
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3865
Contig N20: 2697
Contig N30: 1980
Contig N40: 1442
Contig N50: 1012
Median contig length: 331
Average contig: 628.37
Total assembled bases: 364514830
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 3226
Contig N20: 2047
Contig N30: 1310
Contig N40: 840
Contig N50: 564
Median contig length: 284
Average contig: 483.10
Total assembled bases: 197366539