SRA Library Assessment - Determine RNAseq Library Strandedness from P.trituberculatus SRA BioProject PRJNA597187

We’ve produced a number of C.bairid transcriptomes utilizing different assembly approaches (e.g. Arthropoda reads only, stranded libraries only, mixed strandedness libraries, etc) and we want to determine which of them is “best”. Trinity has a nice list of tools to assess the quality of transcriptome assemblies, but most of the tools rely on comparison to a transcriptome of a related species.

I was unable to readily find a crab transcriptome assembly anywhere, so decided to make our own assembly using RNAseq data from NCBI SRA BioProject PRJNA597187. This is from the Japanese blue crab, Portunus trituberculatus and has a fair amount of sequencing data, which should result in a pretty comprehensive transcriptome.

Before assembling, however, I want to determine if the libraries in this particular project were stranded or not, as Trinity has an option to indicate stranded libraries for use in the assembly. Trinity also offers a script to determine library strandedness which produces a set of violin plots to help discern the type of library one is working with.

Basically, you have to create a transcriptome assembly and then map the reads back to the assembly. Then, the Trinity script will generate the violin plots. I ran the Trinity script on just a single set of paired-end reads FastQs to make this happen relatively quickly. This was run on Mox.

SBATCH script (GitHub):

## Job Name
#SBATCH --job-name=trinity_ptri_strand_check
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=9-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200521_ptri_trinity_strandedness_check

### De novo transcriptome assembly of Portunus trituberculatus (Japanese blue crab)
### RNAseq data from  NCBI BioProject PRJNA597187.
### Use single set of FastQ reads to determine library standedness.

# Exit script if a command fails
set -e

# Load Python Mox module for Python module availability
module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

# User-defined variables

# Paths to programs

## Inititalize arrays

# Variables for R1/R2 lists

# Create array of fastq R1 files

# Create array of fastq R2 files

# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${!R1_array[@]}"
    echo "${R1_array[${fastq}]##*/}"
    echo "${R2_array[${fastq}]##*/}"
  } >> fastq.list.txt

# Create comma-separated lists of FastQ reads
R1_list=$(echo "${R1_array[@]}" | tr " " ",")
R2_list=$(echo "${R2_array[@]}" | tr " " ",")

# Run Trinity
## Not running as "stranded", due to mix of library types
${trinity_dir}/Trinity \
--seqType fq \
--max_memory 100G \
--CPU ${threads} \
--left "${R1_list}" \
--right "${R2_list}"

# Assembly stats
${trinity_dir}/util/ trinity_out_dir/Trinity.fasta \
> ${assembly_stats}

# Create FastA index
${samtools} faidx \

# Align reads to assembly
${trinity_bowtie} \
--target trinity_out_dir/Trinity.fasta \
--left "${R1_list}" \
--right "${R2_list}" \
| ${samtools} view \
--threads ${threads} \
-Sb - \
| ${samtools} sort \
--threads ${threads} \
- -o bowtie2.coordSorted.bam

# Examine strand specificity
${trinity_strand} bowtie2.coordSorted.bam


This took a surprisingly long time to run, considering I was only using a single set of paired-end reads; ~15hrs (note: failure shown below was due to a missin R package (vioplot) needed for plotting)

strandedness runtime

Output folder:

Violin plots (PDF):

ptri strandedness violin plots

Trinity Example:

Trinity example of stranded library aligned to non-stranded assembly

The shape of my violin plots (“barbell”) match the Trinity example of stranded libraries aligned to an assembly created with the non-stranded setting, indicating that the libraries for these RNAseq reads are stranded. I will create a transcriptome assembly from all of the reads using the stranded setting in Trinity.