Genome Annotation - Pgenerosa_v074 Transcript Isoform ID with Stringtie on Mox

After annotating Pgenerosa_v070 and comparing feature counts, there was a drastic difference between the two genome versions. Additionally, both of those genomes ended up with no CDS/exon/gene/mRNA features identified in the largest scaffold. So, to explore this further by seeing where (if??) sequencing reads map to the scaffold, and to obtain transcript isoforms for the genome, I ran Stringtie. A Hisat2 index was prepared earlier.

Here’s the quick rundown of how transcript isoform annotation with Stringtie runs:

  1. Use Hisat2 reference index with identified splice sites and exons (this was done earlier).

  2. Use Hisat2 to create alignments from each pair of trimmed FastQ files. Need to use the --downstream-transcriptome-assembly option!!!

  3. Use Stringtie to create a GTF for each alignment.

  4. Use Stringtie to create a singular, merged GTF from all of the individual GTFs.

The trimmed FastQ files used were extracted from the following Trinity assemblies (links to notebooks):

See the fastq.list.txt for the list of FastQ files used as input. Also, see the Nightingales Google Sheet for more details on sequencing files. NOTE: The “P” in the *.P.qtrim.gz represents trimmed reads that are properly paired, as determined by Trimmomatic/Trinity.

This was run on Mox.

SBATCH script (GitHub):

## Job Name
#SBATCH --job-name=pgen_v074_stringtie
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=25-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_stringtie_pgen_v074

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log


# Paths to programs

# Input/output files

## Inititalize arrays

# Copy Hisat2 genome index
rsync -av "${genome_index_dir}"/Pgenerosa_v074*.ht2 .

# Create array of fastq R1 files
for fastq in "${fastq_dir}"*R1*.gz

# Create array of fastq R2 files
for fastq in "${fastq_dir}"*R2*.gz

# Create array of sample names
## Uses parameter substitution to strip leading path from filename
## Uses awk to parse out sample name from filename
for R1_fastq in "${fastq_dir}"*R1*.gz
  names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"_" '{print $1}'))

# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${fastq_dir}"*.gz
  echo "${fastq#${fastq_dir}}" >> fastq.list.txt

# Hisat2 alignments
for index in "${!fastq_array_R1[@]}"
  sample_name=$(echo "${names_array[index]}")
  "${hisat2}" \
  -x "${genome_index_name}" \
  --downstream-transcriptome-assembly \
  -1 "${fastq_array_R1[index]}" \
  -2 "${fastq_array_R2[index]}" \
  -S "${sample_name}".sam \
  2> "${sample_name}"_hisat2.err

# Sort SAM files, convert to BAM, and index
  "${samtools}" view \
  -@ "${threads}" \
  -Su "${sample_name}".sam \
  | "${samtools}" sort - \
  -@ "${threads}" \
  -o "${sample_name}".sorted.bam
  "${samtools}" index "${sample_name}".sorted.bam

# Run stringtie on alignments
  "${stringtie}" "${sample_name}".sorted.bam \
  -p "${threads}" \
  -o "${sample_name}".gtf \
  -G "${genome_gff}" \
  -C "${sample_name}.cov_refs.gtf"

# Add GTFs to list file, only if non-empty
# Identifies GTF files that only have header
  gtf_lines=$(wc -l < ${sample_name}.gtf )
  if [ "${gtf_lines}" -gt 2 ]; then
    echo "${sample_name}.gtf" >> "${gtf_list}"

# Create singular transcript file, using GTF list file
"${stringtie}" --merge \
"${gtf_list}" \
-p "${threads}" \
-G "${genome_gff}" \
-o "${genome_index_name}".stringtie.gtf

# Delete unneccessary index files
rm "${genome_index_name}"*.ht2

# Delete unneded SAM files
rm ./*.sam


The process of getting this to run took longer than it should have. The job got interrupted by an unintended Mox shutdown and also failed a few times before I was able to figure out the problem. The failure was triggered by one sample that didn’t end up having any transcripts present in the output GTF. When Stringtie encounters an empty GTF file, it not only spits out an error (indicating as such), but it also kills the program. I had to re-write part of the script to evaluate whether or not input GTFs for the Stringtie merge step had any content.

Once it was running with this change, and didn’t get interrupted by Mox problems, it took ~ one day:

Screencap of Pgenerosa_v074 Stringtie runtime on Mox

Output folder:

Merged GTF:

Although I won’t link them here, each input FastQ pair has a corresponding alignment file (BAM), coverage file (.cov_refs.gtf), Hisat2 alignment stats file (_hisat2.err), and GTF.

Additionally, I’m not going to delve too much into the analysis of this. I’ll write a separate notebook post that goes into more depth on how this data looks, how it compares to the Pgenerosa_v074 Stringtie data, and what this means for the annotation of Scaffold 1…

UPDATE 20190829

Created a massive, merged BAM file for use with GenSAS for further annotation of the Pgenerosa_v074 assembly. Ran the following commands to create merged BAM file and corresponding index file:

samtools merge 20190723_sorted.merged.bam *.bam
samtools index 20190723_sorted.merged.bam