Transcriptome Comparison - C.bairdi Transcriptomes Compared with DETONATE on Mox

We’ve produced a number of C.bairdi transcriptomes and we’re interested in doing some comparisons to try to determine which one might be “best”. I previously compared the BUSCO scores of each of these transcriptomes and now will be using the DETONATE software package to perform two different types of comparisons: compared to a reference (REF-EVAL) and determine an overall quality “score” (RSEM-EVAL). I’ll be running REF-EVAL in this notebook.

A link to the paper is here and explains both:

I opted to just “quickly” run through REF-EVAL, as it only requires (minimally) transcriptome FastAs for performing assembly comparisons; although it probably provides more accurate comparisons if you generate a “true assembly” first. However, this can’t always be done if using a publicly available transcriptome which doesn’t have the corresponding FastQ files used for the assembly.

Since this is designed to compare a transcriptome assembly to a “reference” transcriptome of a related species, I did a couple of things:

  1. Downloaded Japanese blue crab (Portunus trituberculatus) transcriptome from NCBI (accession: GFFJ01.1).

  2. Downloaded green crab (Carcinus maenas) transcriptome from NCBI (accession: GBXE01.1).

  3. Downloaded and assembled Portunus trituberculatus NCBI SRA RNAseq data.

I then compared all of our assemblies to each other, to the three “reference” sequences, and the three “reference” sequences to our C.bairdi assemblies (i.e. using our assemblies as the “reference” in the comparison).

This job was run on Mox.

SBATCH script (GitHub):

#!/bin/bash
## Job Name
#SBATCH --job-name=cbai_detonate_transcriptome_comparisons
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=8-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200529_cbai_detonate_transcriptome_comparisons


###################################################################################
# These variables need to be set by user

# Array of the various comparisons to evaluate
# Each condition in each comparison should be separated by a "-"
transcriptomes_array=(
cbai_transcriptome_v1.0.fa \
cbai_transcriptome_v1.5.fa \
cbai_transcriptome_v1.6.fa \
cbai_transcriptome_v1.7.fa \
cbai_transcriptome_v2.0.fa \
cbai_transcriptome_v3.0.fa \
20200526.P_trituberculatus.Trinity.fa \
GFFJ01.1.fa \
GBXE01.1.fa
)


###################################################################################

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log


threads=28

#programs
pblat="/gscratch/srlab/programs/pblat-2.1/pblat"
detonate="/gscratch/srlab/programs/detonate-1.11/ref-eval/ref-eval"


# Determine length of transcriptomes array
transcriptomes_array_length=${#transcriptomes_array[@]}

# Loop through each comparison
for (( i=0; i < transcriptomes_array_length; i++ ))
do
  transcriptome1="${transcriptomes_array[$i]}"

  # Capture FastA checksums for verification
  echo "Generating checksum for ${transcriptome1}"
  md5sum "${transcriptome1}" >> fasta.checksums.md5
  echo "Finished generating checksum for ${transcriptome1}"
  echo ""

  for (( j=0; j < transcriptomes_array_length; j++ ))
  do
    # Don't run comparison of the same transcriptome
    if [[ "${transcriptomes_array[$j]}" != "${transcriptomes_array[$i]}" ]]; then
      transcriptome2="${transcriptomes_array[$j]}"
      comparison1="${transcriptome1}-vs-${transcriptome2}"
      comparison2="${transcriptome2}-vs-${transcriptome1}"

      # Check if pblat output files are not present
      if [[ ! -f "${comparison1}.psl" ]] && [[  -f "${comparison2}".psl ]]; then
        # Run pblat
        echo "Starting pblat: ${comparison1}"
        ${pblat} -minIdentity=80 -threads=${threads} "${transcriptome2}" "${transcriptome1}" "${comparison1}".psl
        echo "Finished pblat: ${comparison1}"
        echo ""

        echo "Starting pblat: ${comparison2}"
        ${pblat} -minIdentity=80 -threads=${threads} "${transcriptome1}" "${transcriptome2}" "${comparison2}".psl
        echo "Finished pblat: ${comparison2}"
        echo ""
      fi

      # Check if DETONATE output file exists
      if [[ ! -f "${comparison1}.scores.txt" ]]; then
        # Run ref-eval, unweighted scores only
        echo "Running DETONATE with ${transcriptome1} and ${transcriptome2}."
        ${detonate} \
        --scores=nucl,pair,contig \
        --weighted=no \
        --A-seqs "${transcriptome1}" \
        --B-seqs "${transcriptome2}" \
        --A-to-B "${comparison1}".psl \
        --B-to-A "${comparison2}".psl \
        | tee "${comparison1}.scores.txt"
        echo "Finished DETONATE with ${transcriptome1} and ${transcriptome2}."
        echo ""
      fi
    fi


  done
done

RESULTS

Took a bit over 3 days to complete:

cbai transcriptome comparisons runtime

NOTE: There are two jobs associated with this because I had to modify SBATCH script after initial run to account for missed reference comparisons.

Well, I ran this and I am completely unsure what it all means. Hah! And, ugh! It was easy enough to setup and run, so it didn’t require too much effort. However, even after reading the paper, it’s still not clear if this has any real value; particularly since the paper focuses on evaluating de novo assemblies. The evaluation to a reference transcriptome seems like it’s an aside. However, the two sets of software are packaged together as DETONATE, so I figured I’d run them both.

All the files are linked below for anyone curious enough to spend time trying to figure out if these results mean anything…

Output folder:

Reference: cbai_transcriptome_v1.0


Reference: cbai_transcriptome_v1.5


Reference: cbai_transcriptome_v1.6


Reference: cbai_transcriptome_v1.7


Reference: cbai_transcriptome_v2.0


Reference: cbai_transcriptome_v3.0


Reference: 20200526.P_trituberculatus


Reference: GBXE01.1


Reference: GFFJ01.1