This is the first step in getting transcript isoform annotations. The annotations and alignments that will be generated with Stringtie will be used to help us get a better grasp of what’s going on with our annotations of the different Panopea generosa genome assembly versions.
Essentially, the steps below (which is what was done here) are needed to prepare files for use with Stringtie:
Create GTF file (basically a GFF specifically for use with transcripts - thus the “T” in GTF) from input GFF file. Done with GFF utilities software.
Identify splice sites and exons in newly-created GTF. Done with Hisat2 software.
Create a Hisat2 reference index that utilizes the GTF. Done with Hisat2 software.
This was run on Mox.
The SBATCH script has a bunch of leftover extraneous steps that aren’t relevant to this step of the annotation process; specifically the FastQ manipulation steps. This is due to a copy/paste from a previous Hisat2 run that I neglected to edit out and I didn’t want to edit the script after I actually ran it, so have left it in here.
SBATCH script (GitHub):
-20190723_hisat2-build_pgen_v070.sh
#!/bin/bash
## Job Name
#SBATCH --job-name=oly_hisat2
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=25-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_hisat2-build_pgen_v070
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log
threads=28
genome_index_name="Pgenerosa_v070"
# Paths to programs
gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread"
hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0"
hisat2_build="${hisat2_dir}/hisat2-build"
hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py"
hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py"
# Input/output files
fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq"
genome_dir="/gscratch/srlab/sam/data/P_generosa/genomes"
genome_gff="${genome_dir}/Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gff"
exons="hisat2_exons.tab"
genome_fasta="${genome_dir}/Pgenerosa_v070.fa"
splice_sites="hisat2_splice_sites.tab"
transcripts_gtf="Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gtf"
## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
# Create array of fastq R1 files
for fastq in "${fastq_dir}"/*R1*.gz
do
fastq_array_R1+=("${fastq}")
done
# Create array of fastq R2 files
for fastq in "${fastq_dir}"/*R2*.gz
do
fastq_array_R2+=("${fastq}")
done
# Create array of sample names
## Uses parameter substitution to strip leading path from filename
## Uses awk to parse out sample name from filename
for R1_fastq in "${fastq_dir}"/*R1*.gz
do
names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"[_.]" '{print $1 "_" $5}'))
done
# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${fastq_dir}"/*.gz
do
echo "${fastq#${fastq_dir}}" >> fastq.list.txt
done
# Create transcipts GTF from genome FastA
"${gffread}" -T \
"${genome_gff}" \
-o "${transcripts_gtf}"
# Create Hisat2 exons tab file
"${hisat2_exons}" \
"${transcripts_gtf}" \
> "${exons}"
# Create Hisate2 splice sites tab file
"${hisat2_splice_sites}" \
"${transcripts_gtf}" \
> "${splice_sites}"
# Build Hisat2 reference index using splice sites and exons
"${hisat2_build}" \
"${genome_fasta}" \
"${genome_index_name}" \
--exon "${exons}" \
--ss "${splice_sites}" \
-p "${threads}" \
2> hisat2_build.err
# Copy Hisat2 index files to my data directory
rsync -av "${genome_index_name}"*.ht2 "${genome_dir}"
RESULTS
Took about an hour to run:
Output folder:
The Hisat2 index files are: *.ht2
. These will be used with Stringtie for transcript isoform annotation.