Remarkably, I managed to burn through our Xsede computing resources and don’t have terribly much to show for it. Ooof! This is a major bummer, as it “only” takes ~8hrs for a WQ-MAKER job to run there, as opposed to months the last time I tried running it on Mox. Although we have used up our Xsede allocation, all is not lost! The experience of setting up/running WQ-MAKER has enlightened me on how it all works and how to run it on Mox so it will (hopefully) take far, far less time than the last Mox attempt. With that said, here we go…
Firstly, I re-installed MAKER (v2.31.10) and configured for OpenMPI support. This is computing jargon that basically allows MAKER to work on a computer cluster efficiently. Now that we have two Mox nodes, I think this will help accelerate the process.
With that out of the way, here’s a very brief overview of the entire MAKER annotation process. Be aware, despite it’s “brevity”, this is still a lengthy read:
Create custom repeat library for your organism (did this with RepeatModeler on 20181022)
Provide RNAseq data (transcriptome v3)
Provide species-specific and/or related proteomic data (used publicly available proteomes from NCBI for C.gigas GCA_000297895.1_oyster_v9) and C.virginica (GCF_002022765.2_C_virginica-3.0)
Run MAKER to produce initial gene models.
Merge all the hundred thousands (seriously) of individual GFFs and FastAs in to a singular file of each file type. MAKER has built-in scripts to do this.
Generate ab initio gene prediction using SNAP. This is integrated in to MAKER.
Run MAKER again, using the SNAP gene models.
Merge the new set of GFFs.
Run SNAP a second time.
Run MAKER a third time using the second set of SNAP gene models.
Merge the final set of GFFs.
Done???
So, that’s how it’s done! Easy!
With each round of MAKER, a “control” file needs to be generated and modified appropriately. Modifications consist of telling MAKER locations of files and whether or not to use certain types of files when producing a new model (e.g. RNAseq data, SNAP HMM file, etc.). Here are the three control files that were used to run MAKER. The links are simply text files, despite their extension, so they can be downloaded and viewed in any text editor, if desired, but I’ve pasted their contents below for easier review:
ROUND 1 (Initial MAKER run)
#-----Genome (these are always required)
genome=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081/Olurida_v081.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
#-----EST Evidence (for best results provide a file for at least one)
est=/gscratch/srlab/sam/data/O_lurida/oly_transcriptome_assemblies/Olurida_transcriptome_v3.fasta #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format
#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/gscratch/scrubbed/samwhite/outputs/20181127_oly_maker_genome_annotation/gigas_virginica_ncbi_proteomes.fasta #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff= #aligned protein homology evidence from an external GFF3 file
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib=/gscratch/srlab/sam/data/O_lurida/Ostrea_lurida_v081-families.fa #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/gscratch/srlab/programs/maker-2.31.10/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file
#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)
#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)
pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)
split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes
tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files
ROUND 2 (After initial SNAP run)
#-----Genome (these are always required)
genome=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081/Olurida_v081.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
#-----EST Evidence (for best results provide a file for at least one)
est=/gscratch/srlab/sam/data/O_lurida/oly_transcriptome_assemblies/Olurida_transcriptome_v3.fasta #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format
#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/gscratch/scrubbed/samwhite/outputs/20181127_oly_maker_genome_annotation/gigas_virginica_ncbi_proteomes.fasta #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff= #aligned protein homology evidence from an external GFF3 file
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib=/gscratch/srlab/sam/data/O_lurida/Ostrea_lurida_v081-families.fa #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/gscratch/srlab/programs/maker-2.31.10/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
#-----Gene Prediction
snaphmm=20181127__oly_snap01.hmm #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file
#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)
#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)
pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)
split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes
tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files
ROUND 3 (After second SNAP run)
#-----Genome (these are always required)
genome=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081/Olurida_v081.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
#-----EST Evidence (for best results provide a file for at least one)
est=/gscratch/srlab/sam/data/O_lurida/oly_transcriptome_assemblies/Olurida_transcriptome_v3.fasta #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format
#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/gscratch/scrubbed/samwhite/outputs/20181127_oly_maker_genome_annotation/gigas_virginica_ncbi_proteomes.fasta #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff= #aligned protein homology evidence from an external GFF3 file
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib=/gscratch/srlab/sam/data/O_lurida/Ostrea_lurida_v081-families.fa #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/gscratch/srlab/programs/maker-2.31.10/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
#-----Gene Prediction
snaphmm=20181127__oly_snap01.hmm #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file
#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)
#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)
pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)
split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes
tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files
[samwhite@mox1 snap01]$ cd ../snap02
[samwhite@mox1 snap02]$ cat maker_opts.ctl
#-----Genome (these are always required)
genome=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081/Olurida_v081.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
#-----EST Evidence (for best results provide a file for at least one)
est=/gscratch/srlab/sam/data/O_lurida/oly_transcriptome_assemblies/Olurida_transcriptome_v3.fasta #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format
#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/gscratch/scrubbed/samwhite/outputs/20181127_oly_maker_genome_annotation/gigas_virginica_ncbi_proteomes.fasta #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff= #aligned protein homology evidence from an external GFF3 file
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib=/gscratch/srlab/sam/data/O_lurida/Ostrea_lurida_v081-families.fa #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/gscratch/srlab/programs/maker-2.31.10/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
#-----Gene Prediction
snaphmm=20181127__oly_snap02.hmm #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file
#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)
#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)
pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)
split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes
tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files
Here’s the SBATCH script (plain text file) used to actually run this job:
#!/bin/bash
## Job Name
#SBATCH --job-name=20181127_oly_maker_genome_annotation
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=2
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=15-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20181127_oly_maker_genome_annotation
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Load Open MPI module for parallel, multi-node processing
module load icc_19-ompi_3.1.2
# SegFault fix?
export THREADS_DAEMON_MODEL=1
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log
## Establish variables for more readable code
### Paths to Maker binaries
maker=/gscratch/srlab/programs/maker-2.31.10/bin/maker
gff3_merge=/gscratch/srlab/programs/maker-2.31.10/bin/gff3_merge
fasta_merge=/gscratch/srlab/programs/maker-2.31.10/bin/fasta_merge
maker2zff=/gscratch/srlab/programs/maker-2.31.10/bin/maker2zff
fathom=/gscratch/srlab/programs/maker-2.31.10/exe/snap/fathom
forge=/gscratch/srlab/programs/maker-2.31.10/exe/snap/forge
hmmassembler=/gscratch/srlab/programs/maker-2.31.10/exe/snap/hmm-assembler.pl
### Path to Olympia oyster genome FastA file
oly_genome=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081/Olurida_v081.fa
### Path to Olympia oyster transcriptome FastA file
oly_transcriptome=/gscratch/srlab/sam/data/O_lurida/oly_transcriptome_assemblies/Olurida_transcriptome_v3.fasta
### Path to Crassotrea gigas NCBI protein FastA
gigas_proteome=/gscratch/srlab/sam/data/C_gigas/gigas_ncbi_protein/GCA_000297895.1_oyster_v9_protein.faa
### Path to Crassostrea virginica NCBI protein FastA
virginica_proteome=/gscratch/srlab/sam/data/C_virginica/virginica_ncbi_protein/GCF_002022765.2_C_virginica-3.0_protein.faa
### Path to concatenated NCBI prteins FastA
gigas_virginica_ncbi_proteomes=/gscratch/scrubbed/samwhite/outputs/20181127_oly_maker_genome_annotation/gigas_virginica_ncbi_proteomes.fasta
### Path to O.lurida-specific repeat library
oly_repeat_library=/gscratch/srlab/sam/data/O_lurida/Ostrea_lurida_v081-families.fa
## Create Maker control files needed for running Maker
$maker -CTL
## Store path to options control file
maker_opts_file=./maker_opts.ctl
## Create combined proteome FastA file, only if it doesn't already exist.
if [ ! -e gigas_virginica_ncbi_proteomes.fasta ]; then
touch gigas_virginica_ncbi_proteomes.fasta
cat "$gigas_proteome" >> gigas_virginica_ncbi_proteomes.fasta
cat "$virginica_proteome" >> gigas_virginica_ncbi_proteomes.fasta
fi
## Edit options file
### Set paths to O.lurida genome and transcriptome.
### Set path to combined C. gigas and C.virginica proteomes.
## The use of the % symbol sets the delimiter sed uses for arguments.
## Normally, the delimiter that most examples use is a slash "/".
## But, we need to expand the variables into a full path with slashes, which screws up sed.
## Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%").
sed -i "/^genome=/ s% %$oly_genome %" "$maker_opts_file"
sed -i "/^est=/ s% %$oly_transcriptome %" "$maker_opts_file"
sed -i "/^protein=/ s% %$gigas_virginica_ncbi_proteomes %" "$maker_opts_file"
sed -i "/^rmlib=/ s% %$oly_repeat_library %" "$maker_opts_file"
sed -i "/^est2genome=0/ s/est2genome=0/est2genome=1/" "$maker_opts_file"
sed -i "/^protein2genome=0/ s/protein2genome=0/protein2genome=1/" "$maker_opts_file"
## Run Maker
### Specify number of nodes to use.
mpiexec -n 56 $maker
## Merge gffs
${gff3_merge} -d Olurida_v081.maker.output/Olurida_v081_master_datastore_index.log
## GFF with no FastA in footer
${gff3_merge} -n -s -d Olurida_v081.maker.output/Olurida_v081_master_datastore_index.log > Olurida_v081.maker.all.noseqs.gff
## Merge all FastAs
${fasta_merge} -d Olurida_v081.maker.output/Olurida_v081_master_datastore_index.log
## Run SNAP training, round 1
mkdir snap01 && cd snap01
${maker2zff} ../Olurida_v081.all.gff
${fathom} -categorize 1000 genome.ann genome.dna
${fathom} -export 1000 -plus uni.ann uni.dna
${forge} export.ann export.dna
${hmmassembler} test_snap1 . > 20181127__oly_snap01.hmm
## Initiate second Maker run.
### Copy initial maker control files and
### - change gene prediction settings to 0 (i.e. don't generate Maker gene predictions)
### - set location of snaphmm file to use for gene prediction
cp ../maker_* .
sed -i "/^est2genome=1/ s/est2genome=1/est2genome=0/" maker_opts.ctl
sed -i "/^protein2genome=1/ s/protein2genome=1/protein2genome=0/" maker_opts.ctl
sed -i "/^snaphmm=/ s% %20181127__oly_snap01.hmm %" maker_opts.ctl
## Run Maker
### Set basename of files and specify number of CPUs to use
mpiexec -n 56 $maker \
-base 20181127_oly_genome_snap01
## Merge gffs
${gff3_merge} -d 20181127_oly_genome_snap01.maker.output/20181127_oly_genome_snap01_master_datastore_index.log
### GFF with no FastA in footer
${gff3_merge} -n -s -d 20181127_oly_genome_snap01.maker.output/20181127_oly_genome_snap01_master_datastore_index.log > 20181127_oly_genome_snap01.all.noseqs.gff
### Merge all FastAs
${fasta_merge} -d 20181127_oly_genome_snap01.maker.output/20181127_oly_genome_snap01_master_datastore_index.log
## Run SNAP training, round 2
cd ..
mkdir snap02 && cd snap02
${maker2zff} ../snap01/20181127_oly_genome_snap01.all.gff
${fathom} -categorize 1000 genome.ann genome.dna
${fathom} -export 1000 -plus uni.ann uni.dna
${forge} export.ann export.dna
${hmmassembler} test_snap1 . > 20181127__oly_snap02.hmm
## Initiate third and final Maker run.
### Copy initial maker control files and:
### - change gene prediction settings to 0 (i.e. don't generate Maker gene predictions)
### - set location of snaphmm file to use for gene prediction
cp ../maker_* .
sed -i "/^est2genome=1/ s/est2genome=1/est2genome=0/" maker_opts.ctl
sed -i "/^protein2genome=1/ s/protein2genome=1/protein2genome=0/" maker_opts.ctl
sed -i "/^snaphmm=/ s% %20181127__oly_snap02.hmm %" maker_opts.ctl
## Run Maker
### Set basename of files and specify number of CPUs to use
mpiexec -n 56 $maker \
-base 20181127_oly_genome_snap02
## Merge gffs
${gff3_merge} \
-d 20181127_oly_genome_snap02.maker.output/20181127_oly_genome_snap02_master_datastore_index.log
### GFF with no FastA in footer
{gff3_merge} -n -s -d 20181127_oly_genome_snap02.maker.output/20181127_oly_genome_snap02_master_datastore_index.log > 20181127_oly_genome_snap02.all.noseqs.gff
### Merge all FastAs
${fasta_merge} -d 20181127_oly_genome_snap02.maker.output/20181127_oly_genome_snap02_master_datastore_index.log
We’ll see how this goes…
RESULTS
This actually completed, and in a reasonable (relatively) amount of time (2 weeks)!
All files were rsync
’d to my folder on Gannet.
Output directory:
Here are some of the key output files:
MAKER Protein FastA
MAKER Transcripts FastAs
Initial MAKER Run GFF3
2nd MAKER Run GFF (after 1st SNAP)
3rd MAKER Run GFF (after 2nd SNAP)
So, what to do with all of this?
Firstly, I’ve learned a lot about how all of this should work and recently discovered this remarkably informative GitHub Gist that walks through the process of annotating a snake genome:
The post above provides details on how to speed the process up (hint: use GFFs for subsequent MAKER rounds to avoid repeated BLAST-ing. BLAST-ing is one of the slowest parts of the process.) and provides some explanations of how to evaluate the process, as well as how to run BUSCO/Augustus.
The 3rd MAKER Run GFF should contain the most refined gene models. This GFF has individual genes, coding sequences (CDS), mRNAs, and proteins. However, it’s a good idea to load all three GFFs in a genome browser and see how they compare.
A run through BUSCO/Augustus gene prediction should refine these models even further and seems to be the standard practice when annotating genomes.
Additionally, the protein FastA file needs to be subject to BLASTp, as well as run through InterProScan to actually assign functions to the genome.
Finally, MAKER can put all this together and create better sequence ID info in the FastA files and the GFFs (will create NCBI-standardized sequence IDs).