After annotating Pgenerosa_v074 and comparing feature counts, there was a drastic difference between the two genome versions. Additionally, both of those genomes ended up with no CDS/exon/gene/mRNA features identified in the largest scaffold. So, to explore this further by seeing where (if??) sequencing reads map to the scaffold, and to obtain transcript isoforms for the genome, I ran Stringtie. A Hisat2 index was prepared earlier.
Here’s the quick rundown of how transcript isoform annotation with Stringtie runs:
Use Hisat2 reference index with identified splice sites and exons (this was done earlier).
Use Hisat2 to create alignments from each pair of trimmed FastQ files. Need to use the
--downstream-transcriptome-assembly
option!!!Use Stringtie to create a GTF for each alignment.
Use Stringtie to create a singular, merged GTF from all of the individual GTFs.
The trimmed FastQ files used were extracted from the following Trinity assemblies (links to notebooks):
See the fastq.list.txt for the list of FastQ files used as input. Also, see the Nightingales Google Sheet for more details on sequencing files. NOTE: The “P” in the *.P.qtrim.gz
represents trimmed reads that are properly paired, as determined by Trimmomatic/Trinity.
This was run on Mox.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=pgen_v070_stringtie
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=25-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_stringtie_pgen_v070
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log
threads=28
genome_index_name="Pgenerosa_v070"
# Paths to programs
hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0"
hisat2="${hisat2_dir}/hisat2"
samtools="/gscratch/srlab/programs/samtools-1.9/samtools"
stringtie="/gscratch/srlab/programs/stringtie-1.3.6.Linux_x86_64/stringtie"
# Input/output files
genome_gff="/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gff"
genome_index_dir="/gscratch/srlab/sam/data/P_generosa/genomes"
fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/"
gtf_list="gtf_list.txt"
## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
names_array=()
# Copy Hisat2 genome index
rsync -av "${genome_index_dir}"/Pgenerosa_v070*.ht2 .
# Create array of fastq R1 files
for fastq in "${fastq_dir}"*R1*.gz
do
fastq_array_R1+=("${fastq}")
done
# Create array of fastq R2 files
for fastq in "${fastq_dir}"*R2*.gz
do
fastq_array_R2+=("${fastq}")
done
# Create array of sample names
## Uses parameter substitution to strip leading path from filename
## Uses awk to parse out sample name from filename
for R1_fastq in "${fastq_dir}"*R1*.gz
do
names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"_" '{print $1}'))
done
# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${fastq_dir}"*.gz
do
echo "${fastq#${fastq_dir}}" >> fastq.list.txt
done
# Hisat2 alignments
for index in "${!fastq_array_R1[@]}"
do
sample_name=$(echo "${names_array[index]}")
"${hisat2}" \
-x "${genome_index_name}" \
--downstream-transcriptome-assembly \
-1 "${fastq_array_R1[index]}" \
-2 "${fastq_array_R2[index]}" \
-S "${sample_name}".sam \
2> "${sample_name}"_hisat2.err
# Sort SAM files, convert to BAM, and index
"${samtools}" view \
-@ "${threads}" \
-Su "${sample_name}".sam \
| "${samtools}" sort - \
-@ "${threads}" \
-o "${sample_name}".sorted.bam
"${samtools}" index "${sample_name}".sorted.bam
# Run stringtie on alignments
"${stringtie}" "${sample_name}".sorted.bam \
-p "${threads}" \
-o "${sample_name}".gtf \
-G "${genome_gff}" \
-C "${sample_name}.cov_refs.gtf"
# Add GTFs to list file
echo "${sample_name}.gtf" >> "${gtf_list}"
done
# Create singular transcript file, using GTF list file
"${stringtie}" --merge \
"${gtf_list}" \
-p "${threads}" \
-G "${genome_gff}" \
-o "${genome_index_name}".stringtie.gtf
# Delete unneccessary index files
rm "${genome_index_name}"*.ht2
# Delete unneded SAM files
rm ./*.sam
RESULTS
This took a bit more than one day to complete:
Output folder:
Merged GTF:
Although I won’t link them here, each input FastQ pair has a corresponding alignment file (BAM), coverage file (.cov_refs.gtf
), Hisat2 alignment stats file (_hisat2.err
), and GTF.
Additionally, I’m not going to delve too much into the analysis of this. I’ll write a separate notebook post that goes into more depth on how this data looks, how it compares to the Pgenerosa_v074 Stringtie data, and what this means for the annotation of Scaffold 1…