For reference, this annotation will be referred to as Pgenerosa_v074.1.
I previously created a subset of the Pgenerosa_v070 genome assembly that contains just the largest 18 scaffolds (these scaffolds were produced by Phase Genomics, utilizing some Hi-C sequencing). The new subsetted genome is labeled as Pgenerosa_v074.fa (914MB).
As part of that, I previously annotated this using MAKER, on 20190709.
This time, I’ve add transcripts and their isoforms, as predicted by Stringtie on (from 20190723) to the MAKER process to see if we can improve annotations, particularly in Scaffold1. This scaffold has been void of any gene-feature annotations in any previous MAKER runs (Pgenerosa_v070, Pgenerosa_v071, and Pgenerosa_v074).
This will perform the following:
- one round of MAKER gene model predictions
- two rounds of SNAP gene model training/predictions
- renaming of gene models to NCBI-standardized convention
- functional characterization of protein models (via BLASTp)
- functional characterization of protein domains (via InterProScan5)
Here are a list of the input files used for the various components of the MAKER annotation:
Stringtie GFF
Pgenerosa_v074.stringtie.gff: Made from Pgenerosa_v074.stringtie.gtf using
gffread Pgenerosa_v074.stringtie.gtf -o Pgenerosa_v074.stringtie.gff
on 20190819. Run using Mox build node.
Transcriptome FastA files (links to notebook entries):
NCBI Protein FastA files
NCBI Crassostrea gigas proteome (downloaded 20181119):
GCA_000297895.1_oyster_v9_protein.faa
NCBI Crassostrea virginica proteome (downloaded 20181119):
GCF_002022765.2_C_virginica-3.0_protein.faa
SwissProt BLASTp database(downloaded 20190109): uniprot_sprot.fasta
TransDecoder protein FastA files (links to notebook entries)
Repeats Files
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=maker_stingtie_pgen074
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=2
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=21-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20190819_pgen_maker_v074_with_stringtie_annotation
# Exit if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Load Open MPI module for parallel, multi-node processing
module load icc_19-ompi_3.1.2
# SegFault fix?
export THREADS_DAEMON_MODEL=1
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log
# Add BLAST to system PATH
export PATH=$PATH:/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin
export BLASTDB=/gscratch/srlab/blastdbs/UniProtKB_20181008
## Establish variables for more readable code
wd=$(pwd)
maker_dir=/gscratch/srlab/programs/maker-2.31.10/bin
snap_dir=/gscratch/srlab/programs/maker-2.31.10/exe/snap
### Paths to Maker binaries
maker=${maker_dir}/maker
gff3_merge=${maker_dir}/gff3_merge
maker2zff=${maker_dir}/maker2zff
fathom=${snap_dir}/fathom
forge=${snap_dir}/forge
hmmassembler=${snap_dir}/hmm-assembler.pl
fasta_merge=${maker_dir}/fasta_merge
map_ids=${maker_dir}/maker_map_ids
map_gff_ids=${maker_dir}/map_gff_ids
map_fasta_ids=${maker_dir}/map_fasta_ids
functional_fasta=${maker_dir}/maker_functional_fasta
functional_gff=${maker_dir}/maker_functional_gff
ipr_update_gff=${maker_dir}/ipr_update_gff
iprscan2gff3=${maker_dir}/iprscan2gff3
blastp_dir=${wd}/blastp_annotation
maker_blastp=${wd}/blastp_annotation/blastp.outfmt6
maker_prot_fasta=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.proteins.fasta
maker_prot_fasta_renamed=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.proteins.renamed.fasta
maker_transcripts_fasta=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.transcripts.fasta
maker_transcripts_fasta_renamed=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.transcripts.renamed.fasta
snap02_gff=${wd}/snap02/Pgenerosa_v074_snap02.all.gff
snap02_gff_renamed=${wd}/snap02/Pgenerosa_v074_snap02.all.renamed.gff
put_func_gff=Pgenerosa_v074_genome_snap02.all.renamed.putative_function.gff
put_func_prot=Pgenerosa_v074_genome_snap02.all.maker.proteins.renamed.putative_function.fasta
put_func_trans=Pgenerosa_v074_genome_snap02.all.maker.transcripts.renamed.putative_function.fasta
put_domain_gff=Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff
ips_dir=${wd}/interproscan_annotation
ips_base=Pgenerosa_v074_maker_proteins_ips
ips_name=Pgenerosa_v074_maker_proteins_ips.tsv
id_map=${wd}/snap02/Pgenerosa_v074_genome.map
ips_domains=Pgenerosa_v074_genome_snap02.all.renamed.visible_ips_domains.gff
## Path to blastp
blastp=/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin/blastp
## Path to InterProScan5
interproscan=/gscratch/srlab/programs/interproscan-5.31-70.0/interproscan.sh
## Store path to options control file
maker_opts_file=./maker_opts.ctl
### Path to genome FastA file
genome=/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v074.fa
### Paths to transcriptome FastA files
ctendia_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/ctenidia/Trinity.fasta
gonad_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/gonad/Trinity.fasta
heart_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/heart/Trinity.fasta
EPI99_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/larvae/EPI99/Trinity.fasta
EPI115_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI115/Trinity.fasta
EPI116_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI116/Trinity.fasta
EPI123_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI123/Trinity.fasta
EPI124_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI124/Trinity.fasta
### Path to Stringtie transcripts GFF
stringtie_gff=/gscratch/srlab/sam/data/P_generosa/transcriptomes/Pgenerosa_v074.stringtie.gff
### Path to Crassotrea gigas NCBI protein FastA
gigas_proteome=/gscratch/srlab/sam/data/C_gigas/gigas_ncbi_protein/GCA_000297895.1_oyster_v9_protein.faa
### Path to Crassostrea virginica NCBI protein FastA
virginica_proteome=/gscratch/srlab/sam/data/C_virginica/virginica_ncbi_protein/GCF_002022765.2_C_virginica-3.0_protein.faa
### Path to Panopea generosa TransDecoder protein FastAs
panopea_td_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/20180827_trinity_geoduck.fasta.transdecoder.pep
pgen_td_ctenidia_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/ctenidia/Trinity.fasta.transdecoder.pep
pgen_td_larvae_EPI99_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/larvae/EPI99/Trinity.fasta.transdecoder.pep
pgen_td_juv_EPI115_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI115/Trinity.fasta.transdecoder.pep
pgen_td_juv_EPI116_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI116/Trinity.fasta.transdecoder.pep
pgen_td_juv_EPI123_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI123/Trinity.fasta.transdecoder.pep
pgen_td_juv_EPI124_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI124/Trinity.fasta.transdecoder.pep
pgen_td_gonad_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/gonad/Trinity.fasta.transdecoder.pep
pgen_td_heart_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/heart/Trinity.fasta.transdecoder.pep
### Path to P.generosa-specific RepeatModeler library
repeat_library=/gscratch/srlab/sam/data/P_generosa/repeats/Pgenerosa_v074-families.fa
### Path to P.generosa-specific RepeatMasker GFF
rm_gff=/gscratch/srlab/sam/data/P_generosa/repeats/Pgenerosa_v074.fa.out.gff
### Path to SwissProt database for BLASTp
sp_db_blastp=/gscratch/srlab/blastdbs/UniProtKB_20190109/uniprot_sprot.fasta
## Make directories
mkdir blastp_annotation
mkdir interproscan_annotation
mkdir snap01
mkdir snap02
## Create Maker control files needed for running Maker, only if it doesn't already exist and then edit it.
### Edit options file
### Set paths to P.generosa genome and transcriptome.
### Set path to combined C. gigas, C.virginica, P.generosa proteomes.
### The use of the % symbol sets the delimiter sed uses for arguments.
### Normally, the delimiter that most examples use is a slash "/".
### But, we need to expand the variables into a full path with slashes, which screws up sed.
### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%").
if [ ! -e maker_opts.ctl ]; then
$maker -CTL
sed -i "/^genome=/ s% %$genome %" "$maker_opts_file"
# Set transcriptomes to use
sed -i "/^est=/ s% %\
${ctendia_transcriptome},\
${gonad_transcriptome},\
${heart_transcriptome},\
${EPI99_transcriptome},\
${EPI115_transcriptome},\
${EPI116_transcriptome},\
${EPI123_transcriptome},\
${EPI124_transcriptome} \
%" \
"$maker_opts_file"
# Set transcipt GFFs to use
sed -i "est_gff=/ s% %\
${stringtie_gff} \
%" \
"$maker_opts_file"
# Set proteomes to use
sed -i "/^protein=/ s% %\
${gigas_proteome},\
${panopea_td_proteome},\
${pgen_td_ctenidia_proteome},\
${pgen_td_gonad_proteome},\
${pgen_td_heart_proteome},\
${pgen_td_juv_EPI115_proteome},\
${pgen_td_juv_EPI116_proteome},\
${pgen_td_juv_EPI123_proteome},\
${pgen_td_juv_EPI124_proteome},\
${pgen_td_larvae_EPI99_proteome},\
${virginica_proteome} \
%" \
"$maker_opts_file"
# Set RepeatModeler library to use
sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file"
# Set RepeatMasker GFF to use
sed -i "/^rm_gff=/ s% %${rm_gff} %" "$maker_opts_file"
# Set est2ggenome to 1 - tells MAKER to use transcriptome FastAs
sed -i "/^est2genome=0/ s/est2genome=0/est2genome=1/" "$maker_opts_file"
# Set protein2genome to 1 - tells MAKER to use protein FastAs
sed -i "/^protein2genome=0/ s/protein2genome=0/protein2genome=1/" "$maker_opts_file"
fi
## Run Maker
### Specify number of nodes to use.
mpiexec -n 56 $maker
## Merge gffs
${gff3_merge} -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log
## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log > Pgenerosa_v074.maker.all.noseqs.gff
## Merge all FastAs
${fasta_merge} -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log
## Extract GFF alignments for use in subsequent MAKER rounds
### Transcript alignments
awk '{ if ($2 == "est2genome") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.est2genome.gff
### Protein alignments
awk '{ if ($2 == "protein2genome") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.protein2genome.gff
### Repeat alignments
awk '{ if ($2 ~ "repeat") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.repeats.gff
## Run SNAP training, round 1
cd "${wd}"
cd snap01
${maker2zff} ../Pgenerosa_v074.all.gff
${fathom} -categorize 1000 genome.ann genome.dna
${fathom} -export 1000 -plus uni.ann uni.dna
${forge} export.ann export.dna
${hmmassembler} genome . > Pgenerosa_v074_snap01.hmm
## Initiate second Maker run.
### Copy initial maker control files and
### Default gene prediction settings are 0 (i.e. don't generate Maker gene predictions)
### - use GFF subsets generated in first round of MAKER
### - set location of snaphmm file to use for gene prediction
### Percent symbols used below are the sed delimiters, instead of the default "/",
### due to the need to use file paths.
if [ ! -e maker_opts.ctl ]; then
$maker -CTL
sed -i "/^genome=/ s% %$genome %" maker_opts.ctl
# Set transcriptomes to use
sed -i "/^est=/ s% %\
${ctendia_transcriptome},\
${gonad_transcriptome},\
${heart_transcriptome},\
${EPI99_transcriptome},\
${EPI115_transcriptome},\
${EPI116_transcriptome},\
${EPI123_transcriptome},\
${EPI124_transcriptome} %" \
"$maker_opts_file"
# Set proteomes to use
sed -i "/^protein=/ s% %\
${gigas_proteome},\
${panopea_td_proteome},\
${pgen_td_ctenidia_proteome},\
${pgen_td_gonad_proteome},\
${pgen_td_heart_proteome},\
${pgen_td_juv_EPI115_proteome},\
${pgen_td_juv_EPI116_proteome},\
${pgen_td_juv_EPI123_proteome},\
${pgen_td_juv_EPI124_proteome},\
${pgen_td_larvae_EPI99_proteome},\
${virginica_proteome} \
%" \
"$maker_opts_file"
# Set RepeatModeler library to use
sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file"
sed -i "/^est_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.est2genome.gff %" maker_opts.ctl
sed -i "/^protein_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.protein2genome.gff %" maker_opts.ctl
sed -i "/^rm_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.repeats.gff %" maker_opts.ctl
sed -i "/^snaphmm=/ s% %Pgenerosa_v074_snap01.hmm %" maker_opts.ctl
fi
## Run Maker
### Set basename of files and specify number of CPUs to use
mpiexec -n 56 $maker \
-base Pgenerosa_v074_snap01
## Merge gffs
${gff3_merge} -d Pgenerosa_v074_snap01.maker.output/Pgenerosa_v074_snap01_master_datastore_index.log
## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v074_snap01.maker.output/Pgenerosa_v074_snap01_master_datastore_index.log > Pgenerosa_v074_snap01.maker.all.noseqs.gff
## Run SNAP training, round 2
cd "${wd}"
cd snap02
${maker2zff} ../snap01/Pgenerosa_v074_snap01.all.gff
${fathom} -categorize 1000 genome.ann genome.dna
${fathom} -export 1000 -plus uni.ann uni.dna
${forge} export.ann export.dna
${hmmassembler} genome . > Pgenerosa_v074_snap02.hmm
## Initiate third and final Maker run.
if [ ! -e maker_opts.ctl ]; then
$maker -CTL
sed -i "/^genome=/ s% %$genome %" maker_opts.ctl
# Set transcriptomes to use
sed -i "/^est=/ s% %\
${ctendia_transcriptome},\
${gonad_transcriptome},\
${heart_transcriptome},\
${EPI99_transcriptome},\
${EPI115_transcriptome},\
${EPI116_transcriptome},\
${EPI123_transcriptome},\
${EPI124_transcriptome} %" \
"$maker_opts_file"
# Set proteomes to use
sed -i "/^protein=/ s% %\
${gigas_proteome},\
${panopea_td_proteome},\
${pgen_td_ctenidia_proteome},\
${pgen_td_gonad_proteome},\
${pgen_td_heart_proteome},\
${pgen_td_juv_EPI115_proteome},\
${pgen_td_juv_EPI116_proteome},\
${pgen_td_juv_EPI123_proteome},\
${pgen_td_juv_EPI124_proteome},\
${pgen_td_larvae_EPI99_proteome},\
${virginica_proteome} \
%" \
"$maker_opts_file"
# Set RepeatModeler library to use
sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file"
sed -i "/^est_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.est2genome.gff %" maker_opts.ctl
sed -i "/^protein_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.protein2genome.gff %" maker_opts.ctl
sed -i "/^rm_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.repeats.gff %" maker_opts.ctl
sed -i "/^snaphmm=/ s% %Pgenerosa_v074_snap02.hmm %" maker_opts.ctl
fi
## Run Maker
### Set basename of files and specify number of CPUs to use
mpiexec -n 56 $maker \
-base Pgenerosa_v074_snap02
## Merge gffs
${gff3_merge} \
-d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log
## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log > Pgenerosa_v074_snap02.maker.all.noseqs.gff
## Merge FastAs
${fasta_merge} \
-d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log
# Create copies of files for mapping
cp "${maker_prot_fasta}" "${maker_prot_fasta_renamed}"
cp "${maker_transcripts_fasta}" "${maker_transcripts_fasta_renamed}"
cp "${snap02_gff}" "${snap02_gff_renamed}"
# Map IDs
## Change gene names
${map_ids} \
--prefix PGEN_ \
--justify 8 \
"${snap02_gff}" \
> "${id_map}"
## Map GFF IDs
${map_gff_ids} \
"${id_map}" \
"${snap02_gff_renamed}"
## Map FastAs
### Proteins
${map_fasta_ids} \
"${id_map}" \
"${maker_prot_fasta_renamed}"
### Transcripts
${map_fasta_ids} \
"${id_map}" \
"${maker_transcripts_fasta_renamed}"
# Run InterProScan 5
## disable-precalc since this requires external database access (which Mox does not allow)
cd "${ips_dir}"
${interproscan} \
--input "${maker_prot_fasta_renamed}" \
--goterms \
--output-file-base ${ips_base} \
--disable-precalc
# Run BLASTp
cd "${blastp_dir}"
${blastp} \
-query "${maker_prot_fasta_renamed}" \
-db ${sp_db_blastp} \
-out "${maker_blastp}" \
-max_target_seqs 1 \
-evalue 1e-6 \
-outfmt 6 \
-num_threads 28
# Functional annotations
cd "${wd}"
## Add putative gene functions
### GFF
${functional_gff} \
${sp_db_blastp} \
"${maker_blastp}" \
"${snap02_gff_renamed}" \
> ${put_func_gff}
### Proteins
${functional_fasta} \
${sp_db_blastp} \
"${maker_blastp}" \
"${maker_prot_fasta_renamed}" \
> ${put_func_prot}
### Transcripts
${functional_fasta} \
${sp_db_blastp} \
"${maker_blastp}" \
"${maker_transcripts_fasta_renamed}" \
> ${put_func_trans}
## Add InterProScan domain info
### Add searchable tags
${ipr_update_gff} \
${put_func_gff} \
"${ips_dir}"/${ips_name} \
> ${put_domain_gff}
### Add viewable features for genome browsers (JBrowse, Gbrowse, Web Apollo)
${iprscan2gff3} \
"${ips_dir}"/${ips_name} \
"${snap02_gff_renamed}" \
> ${ips_domains}
RESULTS
Well, this ran relatively quickly: A little over seven days.
Output folder:
The important files:
- Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff (373MB)
- GFF file with all contigs annotated with putative functions and functional domains.
- INCLUDES SEQUENCE FASTAS AT END OF FILE!
- Generated with one round of MAKER gene prediction, followed by two rounds of SNAP ab-initio gene prediction.
- MD5: 8e8391912e481695addac8fed320a8a1
- Pgenerosa_v074_genome_snap02.all.maker.proteins.renamed.putative_function.fasta (432KB)
- Annotated proteins FastA file.
- Generated with one round of MAKER gene prediction, followed by two rounds of SNAP ab-initio gene prediction.
- Pgenerosa_v074_genome_snap02.all.maker.transcripts.renamed.putative_function.fasta (1.2MB)
- Annotated transcripts FastA file.
- Generated with one round of MAKER gene prediction, followed by two rounds of SNAP ab-initio gene prediction.
I should’ve just split the GFF as part of the Mox SBATCH script, but I didn’t so I did it locally on my computer. Here are the commands for splitting the GFF. All the GFF files have been addeed to our Genomic Resources wiki (GitHub).
CDS GFF:
- Pgenerosa_v074.1.CDS.gff (605KB)
- MD5:
78165439a8fd3e399e90e7eb1a0ea221
awk 'BEGIN { print "##gff-version 3" ; } $3 == "CDS" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.1.CDS.gff
exon GFF:
- Pgenerosa_v074.1.exon.gff (655KB)
- MD5:
a9f88dfc6fa3a2a6b43535341cd421f2
awk 'BEGIN { print "##gff-version 3" ; } $3 == "exon" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.1.exon.gff
five prime UTR GFF:
- Pgenerosa_v074.1.five_prime_UTR.gff (54KB)
- MD5:
f651827d5e6c9a235b78670af08f7b2c
awk 'BEGIN { print "##gff-version 3" ; } $3 == "five_prime_UTR" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.1.five_prime_UTR.gff
gene GFF:
- Pgenerosa_v074.1.gene.gff (269KB)
- MD5:
5b3af590a9b34c0aa87abef2442d9bde
awk 'BEGIN { print "##gff-version 3" ; } $3 == "gene" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.1.gene.gff
mRNA GFF:
- Pgenerosa_v074.1.mRNA.gff (337KB)
- MD5:
fd4889cfc5b1184f2c39054a4ead3f62
awk 'BEGIN { print "##gff-version 3" ; } $3 == "mRNA" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.1.mRNA.gff
three prime UTR GFF:
- Pgenerosa_v074.1.three_prime_UTR.gff (37KB)
- MD5:
3e9547af413d5e4230c4d556cac13d68
awk 'BEGIN { print "##gff-version 3" ; } $3 == "three_prime_UTR" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.1.three_prime_UTR.gff
With all of that out of the way, a cursory glance at the results are, honestly, quite shocking. A quick grep -c ">"
on the FastA files reveals:
- 825 proteins/transcripts
Here’s table with feature counts from each of the GFF files listed above:
Feature Count | Feature File |
---|---|
4905 | Pgenerosa_v074.1.CDS.gff |
5037 | Pgenerosa_v074.1.exon.gff |
371 | Pgenerosa_v074.1.five_prime_UTR.gff |
826 | Pgenerosa_v074.1.gene.gff |
826 | Pgenerosa_v074.1.mRNA.gff |
250 | Pgenerosa_v074.1.three_prime_UTR.gff |
So, how does this compare to the previous annotation of Pgenerosa_v074 from 20190709?
Pgenerosa_v074.1 has ~50% fewer genes annotated than the Pgenerosa_v074 annotation.
Pgenerosa_v074.1 only exhibits annotations on PGA_scaffold_18. The Pgenerosa_v074 annotation had annotation present on PGA_scaffold_17 & PGA_scaffold_17 (no others).
The Stringtie information clearly had an influence on gene predictions and the resulting annotations. Admittedly, I was expecting more gene predictions, since the Stringtie info shows transcripts present on all the other scaffolds. Instead, we ended up with a more stringent set of predicted genes!
Will peruse all of this data a bit more and see what I can make of it…