Per this thread in Slack, we realized that the “final” annotation of the Olurida_v081 genome only seemed to have singular mRNA annotations and no apparent isoforms. As such, I decided to see if I could tease out this type of info.
I decided to use Stringtie, as it seemed robust and relatively straightforward. Plus, it had a decent user guide explaining how to tackle this exact problem.
However, before being able to start in with Stringtie, a couple of things needed to be done first to prepare files for use with Stringtie:
Create GTF file (basically a GFF specifically for use with transcripts - thus the “T” in GTF) from input GFF file. Done with GFF utilities software.
Identify splice sites and exons in newly-created GTF. Done with Hisat2 software.
Create a Hisat2 reference index that utilizes the GTF. Done with Hisat2 software.
This was run on Mox. The SBATCH script has a bunch of leftover extraneous steps that aren’t relevant to this step of the annotation process; specifically the FastQ manipulation steps. Originally, I had a large script running both this and the subsequent Stringtie process. However, I was having issues with Stringtie and it made more sense to have these GTF/indexing steps as a separate script, so I chopped off the Stringtie stuff and neglected to remove the FastQ stuff. I didn’t want to edit the script after I actually ran it, so have left it in here.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=oly_hisat2
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=25-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190625_hisat2-build_oly_v081
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log
threads=27
genome_index_name="Olurida_v081"
# Paths to programs
gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread"
hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0"
hisat2_build="${hisat2_dir}/hisat2-build"
hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py"
hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py"
# Input/output files
genome_gff="/gscratch/srlab/sam/data/O_lurida/genomes/Olurida_v081/20181127_oly_genome_snap02.all.renamed.putative_function.domain_added.gff"
exons="hisat2_exons.tab"
fastq_dir="/gscratch/srlab/sam/data/O_lurida/RNAseq/"
genome_fasta="/gscratch/srlab/sam/data/O_lurida/genomes/Olurida_v081/Olurida_v081.fa"
splice_sites="hisat2_splice_sites.tab"
transcripts_gtf="20190620_oly_genome_snap02.all.renamed.putative_function.domain_added.transcripts.gtf"
## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
# Create array of fastq R1 files
for fastq in ${fastq_dir}/*R1*.gz
do
fastq_array_R1+=(${fastq})
done
# Create array of fastq R2 files
for fastq in ${fastq_dir}/*R2*.gz
do
fastq_array_R2+=(${fastq})
done
# Create array of sample names
## Uses parameter substitution to strip leading path from filename
## Uses awk to parse out sample name from filename
for R1_fastq in ${fastq_dir}/*R1*.gz
do
names_array+=($(echo ${R1_fastq#${fastq_dir}} | awk -F"[_.]" '{print $1 "_" $5}'))
done
# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in ${fastq_dir}*.gz
do
echo "${fastq#${fastq_dir}}" >> fastq.list.txt
done
# Create transcipts GTF from genome FastA
"${gffread}" -T \
"${genome_gff}" \
-o "${transcripts_gtf}"
# Create Hisat2 exons tab file
"${hisat2_exons}" \
"${transcripts_gtf}" \
> "${exons}"
# Create Hisate2 splice sites tab file
"${hisat2_splice_sites}" \
"${transcripts_gtf}" \
> "${splice_sites}"
# Build Hisat2 reference index using splice sites and exons
"${hisat2_build}" \
"${genome_fasta}" \
"${genome_index_name}" \
--exon "${exons}" \
--ss "${splice_sites}" \
-p "${threads}" \
2> hisat2_build.err
RESULTS
Output folder:
The Hisat2 index files are: *.ht2
. These will be used with Stringtie for transcript isoform annotation.