With the P.generosa v070 annotation taking a very long time (due to the genome/transcriptome sizes, and, hitting some snags in the script I put together), Steven asked if I could subset the genome and annotate the v071 assembly (>10kbp subset) in order to have some annotations to use for some talks coming up ASLO (in this GitHub issue).
After a number of fixes, I finally got the annotation to run to completion. Here’s the SBATCH script used to run on Mox:
#!/bin/bash
## Job Name
#SBATCH --job-name=maker
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=2
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=25-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190213_geoduck_maker_genome_annotation
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Load Open MPI module for parallel, multi-node processing
module load icc_19-ompi_3.1.2
# SegFault fix?
export THREADS_DAEMON_MODEL=1
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log
# Add BLAST to system PATH
export PATH=$PATH:/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin
export BLASTDB=/gscratch/srlab/blastdbs/UniProtKB_20181008
## Establish variables for more readable code
wd=$(pwd)
maker_dir=/gscratch/srlab/programs/maker-2.31.10/bin
snap_dir=/gscratch/srlab/programs/maker-2.31.10/exe/snap
### Paths to Maker binaries
maker=${maker_dir}/maker
gff3_merge=${maker_dir}/gff3_merge
maker2zff=${maker_dir}/maker2zff
fathom=${snap_dir}/fathom
forge=${snap_dir}/forge
hmmassembler=${snap_dir}/hmm-assembler.pl
fasta_merge=${maker_dir}/fasta_merge
map_ids=${maker_dir}/maker_map_ids
map_gff_ids=${maker_dir}/map_gff_ids
map_fasta_ids=${maker_dir}/map_fasta_ids
functional_fasta=${maker_dir}/maker_functional_fasta
functional_gff=${maker_dir}/maker_functional_gff
ipr_update_gff=${maker_dir}/ipr_update_gff
iprscan2gff3=${maker_dir}/iprscan2gff3
blastp_dir=${wd}/blastp_annotation
maker_blastp=${wd}/blastp_annotation/blastp.outfmt6
maker_prot_fasta=${wd}/snap02/Pgenerosa_v071_snap02.all.maker.proteins.fasta
maker_prot_fasta_renamed=${wd}/snap02/Pgenerosa_v071_snap02.all.maker.proteins.renamed.fasta
maker_transcripts_fasta=${wd}/snap02/Pgenerosa_v071_snap02.all.maker.transcripts.fasta
maker_transcripts_fasta_renamed=${wd}/snap02/Pgenerosa_v071_snap02.all.maker.transcripts.renamed.fasta
snap01_est_gff=${wd}/snap01/Pgenerosa_v071_snap01.maker.all.noseqs.est2genome.gff
snap02_gff=${wd}/snap02/Pgenerosa_v071_snap02.all.gff
snap02_gff_renamed=${wd}/snap02/Pgenerosa_v071_snap02.all.renamed.gff
snap01_protein_gff=${wd}/snap01/Pgenerosa_v071_snap01.maker.all.noseqs.protein2genome.gff
snap01_rm_gff=${wd}/snap01/Pgenerosa_v071_snap01.maker.all.noseqs.repeats.gff
put_func_gff=Pgenerosa_v071_genome_snap02.all.renamed.putative_function.gff
put_func_prot=Pgenerosa_v071_genome_snap02.all.maker.proteins.renamed.putative_function.fasta
put_func_trans=Pgenerosa_v071_genome_snap02.all.maker.transcripts.renamed.putative_function.fasta
put_domain_gff=Pgenerosa_v071_genome_snap02.all.renamed.putative_function.domain_added.gff
ips_dir=${wd}/interproscan_annotation
ips_base=Pgenerosa_v071_maker_proteins_ips
ips_name=Pgenerosa_v071_maker_proteins_ips.tsv
id_map=${wd}/snap02/Pgenerosa_v071_genome.map
ips_domains=Pgenerosa_v071_genome_snap02.all.renamed.visible_ips_domains.gff
## Path to blastp
blastp=/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin/blastp
## Path to InterProScan5
interproscan=/gscratch/srlab/programs/interproscan-5.31-70.0/interproscan.sh
## Store path to options control file
maker_opts_file=./maker_opts.ctl
### Path to genome FastA file
genome=/gscratch/srlab/sam/data/P_generosa/generosa_genomes/Pgenerosa_v071.fasta
### Path to transcriptome FastA file
transcriptome=/gscratch/srlab/sam/data/P_generosa/generosa_transcriptomes/20180827_trinity_geoduck.fasta
### Path to Crassotrea gigas NCBI protein FastA
gigas_proteome=/gscratch/srlab/sam/data/C_gigas/gigas_ncbi_protein/GCA_000297895.1_oyster_v9_protein.faa
### Path to Crassostrea virginica NCBI protein FastA
virginica_proteome=/gscratch/srlab/sam/data/C_virginica/virginica_ncbi_protein/GCF_002022765.2_C_virginica-3.0_protein.faa
### Path to Panopea generosa TransDecoder protein FastA
panopea_td_proteome=/gscratch/srlab/sam/data/P_generosa/generosa_proteomes/20180827_trinity_geoduck.fasta.transdecoder.pep
### Path to concatenated NCBI prteins FastA
combined_proteomes=${wd}/combined_proteomes.fasta
### Path to P.generosa-specific repeat library
repeat_library=/gscratch/srlab/sam/data/P_generosa/generosa_repeats/Pgenerosa_v070-families.fa
### Path to SwissProt database for BLASTp
sp_db_blastp=/gscratch/srlab/blastdbs/UniProtKB_20190109/uniprot_sprot.fasta
## Make directories
mkdir blastp_annotation
mkdir interproscan_annotation
## Create Maker control files needed for running Maker, only if it doesn't already exist and then edit it.
### Edit options file
### Set paths to P.generosa genome and transcriptome.
### Set path to combined C. gigas, C.virginica, P.generosa proteomes.
### The use of the % symbol sets the delimiter sed uses for arguments.
### Normally, the delimiter that most examples use is a slash "/".
### But, we need to expand the variables into a full path with slashes, which screws up sed.
### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%").
if [ ! -e maker_opts.ctl ]; then
$maker -CTL
sed -i "/^genome=/ s% %$genome %" "$maker_opts_file"
sed -i "/^est=/ s% %$transcriptome %" "$maker_opts_file"
sed -i "/^protein=/ s% %$combined_proteomes %" "$maker_opts_file"
sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file"
sed -i "/^est2genome=0/ s/est2genome=0/est2genome=1/" "$maker_opts_file"
sed -i "/^protein2genome=0/ s/protein2genome=0/protein2genome=1/" "$maker_opts_file"
fi
## Create combined proteome FastA file, only if it doesn't already exist.
if [ ! -e combined_proteomes.fasta ]; then
touch combined_proteomes.fasta
cat "$gigas_proteome" >> combined_proteomes.fasta
cat "$virginica_proteome" >> combined_proteomes.fasta
cat "$panopea_td_proteome" >> combined_proteomes.fasta
fi
## Run Maker
### Specify number of nodes to use.
mpiexec -n 56 $maker
## Merge gffs
${gff3_merge} -d Pgenerosa_v071.maker.output/Pgenerosa_v071_master_datastore_index.log
## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v071.maker.output/Pgenerosa_v071_master_datastore_index.log > Pgenerosa_v071.maker.all.noseqs.gff
## Merge all FastAs
${fasta_merge} -d Pgenerosa_v071.maker.output/Pgenerosa_v071_master_datastore_index.log
## Extract GFF alignments for use in subsequent MAKER rounds
### Transcript alignments
awk '{ if ($2 == "est2genome") print $0 }' Pgenerosa_v071.maker.all.noseqs.gff > Pgenerosa_v071.maker.all.noseqs.est2genome.gff
### Protein alignments
awk '{ if ($2 == "protein2genome") print $0 }' Pgenerosa_v071.maker.all.noseqs.gff > Pgenerosa_v071.maker.all.noseqs.protein2genome.gff
### Repeat alignments
awk '{ if ($2 ~ "repeat") print $0 }' Pgenerosa_v071.maker.all.noseqs.gff > Pgenerosa_v071.maker.all.noseqs.repeats.gff
## Run SNAP training, round 1
cd ${wd}
cd snap01
${maker2zff} ../Pgenerosa_v071.all.gff
${fathom} -categorize 1000 genome.ann genome.dna
${fathom} -export 1000 -plus uni.ann uni.dna
${forge} export.ann export.dna
${hmmassembler} genome . > Pgenerosa_v071_snap01.hmm
## Initiate second Maker run.
### Copy initial maker control files and
### - change gene prediction settings to 0 (i.e. don't generate Maker gene predictions)
### - use GFF subsets generated in first round of MAKER
### - set location of snaphmm file to use for gene prediction
### Percent symbols used below are the sed delimiters, instead of the default "/",
### due to the need to use file paths.
if [ ! -e maker_opts.ctl ]; then
$maker -CTL
sed -i "/^genome=/ s% %$genome %" maker_opts.ctl
sed -i "/^est2genome=1/ s/est2genome=1/est2genome=0/" maker_opts.ctl
sed -i "/^protein2genome=1/ s/protein2genome=1/protein2genome=0/" maker_opts.ctl
sed -i "/^est_gff=/ s% %../Pgenerosa_v071.maker.all.noseqs.est2genome.gff %" maker_opts.ctl
sed -i "/^protein_gff=/ s% %../Pgenerosa_v071.maker.all.noseqs.protein2genome.gff %" maker_opts.ctl
sed -i "/^rm_gff=/ s% %../Pgenerosa_v071.maker.all.noseqs.repeats.gff %" maker_opts.ctl
sed -i "/^snaphmm=/ s% %Pgenerosa_v071_snap01.hmm %" maker_opts.ctl
fi
## Run Maker
### Set basename of files and specify number of CPUs to use
mpiexec -n 56 $maker \
-base Pgenerosa_v071_snap01
## Merge gffs
${gff3_merge} -d Pgenerosa_v071_snap01.maker.output/Pgenerosa_v071_snap01_master_datastore_index.log
## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v071_snap01.maker.output/Pgenerosa_v071_snap01_master_datastore_index.log > Pgenerosa_v071_snap01.maker.all.noseqs.gff
## Run SNAP training, round 2
cd ${wd}
cd snap02
${maker2zff} ../snap01/Pgenerosa_v071_snap01.all.gff
${fathom} -categorize 1000 genome.ann genome.dna
${fathom} -export 1000 -plus uni.ann uni.dna
${forge} export.ann export.dna
${hmmassembler} genome . > Pgenerosa_v071_snap02.hmm
## Initiate third and final Maker run.
if [ ! -e maker_opts.ctl ]; then
$maker -CTL
sed -i "/^genome=/ s% %$genome %" maker_opts.ctl
sed -i "/^est2genome=1/ s/est2genome=1/est2genome=0/" maker_opts.ctl
sed -i "/^protein2genome=1/ s/protein2genome=1/protein2genome=0/" maker_opts.ctl
sed -i "/^est_gff=/ s% %../Pgenerosa_v071.maker.all.noseqs.est2genome.gff %" maker_opts.ctl
sed -i "/^protein_gff=/ s% %../Pgenerosa_v071.maker.all.noseqs.protein2genome.gff %" maker_opts.ctl
sed -i "/^rm_gff=/ s% %../Pgenerosa_v071.maker.all.noseqs.repeats.gff %" maker_opts.ctl
sed -i "/^snaphmm=/ s% %Pgenerosa_v071_snap02.hmm %" maker_opts.ctl
fi
## Run Maker
### Set basename of files and specify number of CPUs to use
mpiexec -n 56 $maker \
-base Pgenerosa_v071_snap02
## Merge gffs
${gff3_merge} \
-d Pgenerosa_v071_snap02.maker.output/Pgenerosa_v071_snap02_master_datastore_index.log
## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v071_snap02.maker.output/Pgenerosa_v071_snap02_master_datastore_index.log > Pgenerosa_v071_snap02.maker.all.noseqs.gff
## Merge FastAs
${fasta_merge} \
-d Pgenerosa_v071_snap02.maker.output/Pgenerosa_v071_snap02_master_datastore_index.log
# Create copies of files for mapping
cp ${maker_prot_fasta} ${maker_prot_fasta_renamed}
cp ${maker_transcripts_fasta} ${maker_transcripts_fasta_renamed}
cp ${snap02_gff} ${snap02_gff_renamed}
# Map IDs
## Change gene names
${map_ids} \
--prefix PGEN_ \
--justify 8 \
${snap02_gff} \
> ${id_map}
## Map GFF IDs
${map_gff_ids} \
${id_map} \
${snap02_gff_renamed}
## Map FastAs
### Proteins
${map_fasta_ids} \
${id_map} \
${maker_prot_fasta_renamed}
### Transcripts
${map_fasta_ids} \
${id_map} \
${maker_transcripts_fasta_renamed}
# Run InterProScan 5
## disable-precalc since this requires external database access (which Mox does not allow)
cd ${ips_dir}
${interproscan} \
--input ${maker_prot_fasta_renamed} \
--goterms \
--output-file-base ${ips_base} \
--disable-precalc
# Run BLASTp
cd ${blastp_annotation}
${blastp} \
-query ${maker_prot_fasta_renamed} \
-db ${sp_db_blastp} \
-out ${maker_blastp} \
-max_target_seqs 1 \
-evalue 1e-6 \
-outfmt 6 \
-num_threads 28
# Functional annotations
cd ${wd}
## Add putative gene functions
### GFF
${functional_gff} \
${sp_db_blastp} \
${maker_blastp} \
${snap02_gff_renamed} \
> ${put_func_gff}
### Proteins
${functional_fasta} \
${sp_db_blastp} \
${maker_blastp} \
${maker_prot_fasta_renamed} \
> ${put_func_prot}
### Transcripts
${functional_fasta} \
${sp_db_blastp} \
${maker_blastp} \
${maker_transcripts_fasta_renamed} \
> ${put_func_trans}
## Add InterProScan domain info
### Add searchable tags
${ipr_update_gff} \
${put_func_gff} \
${ips_dir}/${ips_name} \
> ${put_domain_gff}
### Add viewable features for genome browsers (JBrowse, Gbrowse, Web Apollo)
${iprscan2gff3} \
${ips_dir}/${ips_name} \
${snap02_gff_renamed} \
> ${ips_domains}
Here are links to the input files used:
Genome assembly (from 20190211)
- Pgenerosa_v071.fasta (1.3GB)
Transcriptome assembly (from 20180904):
- 20180827_trinity_geoduck.fasta (972MB)
Protein FastA (from 20181121)
Repeats library FastA (from 20181219)
- Pgenerosa_v070-families.fa (1.4MB)
RESULTS
After all the rigamarole with fixing various script issues, this finally completed on 20190226. I think this would have finished much, much sooner if I didn’t have so many small oversights in the SBATCH script.
Output folder:
Final annotation GFF:
Supplemental InterProScan domains GFF: