Essentially, the steps below (which is what was done here) are needed to prepare files for use with Stringtie:
Create GTF file (basically a GFF specifically for use with transcripts - thus the “T” in GTF) from input GFF file. Done with GFF utilities software.
Identify splice sites and exons in newly-created GTF. Done with Hisat2 software.
Create a Hisat2 reference index that utilizes the GTF. Done with Hisat2 software.
This was run on Mox.
The SBATCH script has a bunch of leftover extraneous steps that aren’t relevant to this step of the annotation process; specifically the FastQ manipulation steps. This is due to a copy/paste from a previous Hisat2 run that I neglected to edit out and I didn’t want to edit the script after I actually ran it, so have left it in here.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=oly_hisat2
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=25-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_hisat2-build_pgen_v074
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log
threads=28
genome_index_name="Pgenerosa_v074"
# Paths to programs
gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread"
hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0"
hisat2_build="${hisat2_dir}/hisat2-build"
hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py"
hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py"
# Input/output files
fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq"
genome_dir="/gscratch/srlab/sam/data/P_generosa/genomes"
genome_gff="${genome_dir}/Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff"
exons="hisat2_exons.tab"
genome_fasta="${genome_dir}/Pgenerosa_v074.fa"
splice_sites="hisat2_splice_sites.tab"
transcripts_gtf="Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gtf"
## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
# Create array of fastq R1 files
for fastq in "${fastq_dir}"/*R1*.gz
do
fastq_array_R1+=("${fastq}")
done
# Create array of fastq R2 files
for fastq in "${fastq_dir}"/*R2*.gz
do
fastq_array_R2+=("${fastq}")
done
# Create array of sample names
## Uses parameter substitution to strip leading path from filename
## Uses awk to parse out sample name from filename
for R1_fastq in "${fastq_dir}"/*R1*.gz
do
names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"[_.]" '{print $1 "_" $5}'))
done
# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${fastq_dir}"/*.gz
do
echo "${fastq#${fastq_dir}}" >> fastq.list.txt
done
# Create transcipts GTF from genome FastA
"${gffread}" -T \
"${genome_gff}" \
-o "${transcripts_gtf}"
# Create Hisat2 exons tab file
"${hisat2_exons}" \
"${transcripts_gtf}" \
> "${exons}"
# Create Hisate2 splice sites tab file
"${hisat2_splice_sites}" \
"${transcripts_gtf}" \
> "${splice_sites}"
# Build Hisat2 reference index using splice sites and exons
"${hisat2_build}" \
"${genome_fasta}" \
"${genome_index_name}" \
--exon "${exons}" \
--ss "${splice_sites}" \
-p "${threads}" \
2> hisat2_build.err
# Copy Hisat2 index files to my data directory
rsync -av "${genome_index_name}"*.ht2 "${genome_dir}"
RESULTS
This took a shockingly short amount of time to complete: ~10mins!
Other Hisat indexing I’ve done with Ostrea lurida and the Pgen_v070 assemblies all took ~1hr.
Output folder:
The Hisat2 index files are: *.ht2
. These will be used with Stringtie for transcript isoform annotation.