In our various attempts to get the Panopea generosa genome annotated in such a manner that we’re comfortable with (the previous annotation attempts we’re lacking any annotations in almost all of the largest scaffolds, which didn’t seem right), Steven stumbled across GenSAS, a web/GUI-based genome annotation program, so we gave it a shot.
This version of the genome annotation will be referred to as:
- Panopea-generosa-vv0.74.a3
I uploaded the following to the GenSAS website to potentially use as “evidence files”:
Transcriptome FastA files (links to notebook entries):
TransDecoder protein FastA files (links to notebook entries)
Repeats Files
RESULTS
This took way longer than I was expecting! This took nearly an entire month (the majority of that time was running Augustus ab initio gene prediction, which took ~3 weeks):
Output folder:
Merged GFF (SwissProt IDs in Column 9 - from BLASTp and DIAMOND):
Feature counts:
awk 'NR>3 { print $3 }' Panopea-generosa-vv0.74.a3-merged-2019-09-03-6-14-33.gff3 | sort | uniq -c
192022 CDS
192022 exon
45748 gene
45748 mRNA
BUSCO assessment:
- 68.4% complete BUSCOs present in predicted genes
Individual feature GFFs were made with the following shell commands:
features_array=(CDS exon gene mRNA)
for feature in ${features_array[@]}
do
output="Panopea-generosa-vv0.74.a3.${feature}.gff3"
input="Panopea-generosa-vv0.74.a3-merged-2019-09-24-9-20-04.gff3"
head -n 3 Panopea-generosa-vv0.74.a3-merged-2019-09-24-9-20-04.gff3 \
>> ${output}
awk -v feature="$feature" '$3 == feature {print}' ${input} \
>> ${output}
done
SwissProt functional annotations (tab-delimited text):
- BLASTp
- DIAMOND
Pfam annotations (tab-delimited text):
Grabbed the top 10 most abundant Pfam Accessions to see how things looked:
awk '{print $2}' Panopea-generosa-vv0.74.a3.5d65aaa449919-pfam.tab | sort | uniq -c | sort -nr | head
Feature Count | Pfam Accession | Pfam |
---|---|---|
1062 | PF00078.22 | Reverse transcriptase (RNA-dependent DNA polymerase) |
370 | PF00665.21 | Integrase |
353 | PF13358.1 | DDE superfamily endonuclease |
264 | PF03372.18 | Endonuclease/Exonuclease/phosphatase family |
213 | PF14529.1 | Endonuclease-reverse transcriptase |
208 | PF00643.19 | B-box zinc finger |
199 | PF07690.11 | Major Facilitator Superfamily |
158 | PF00001.16 | 7 transmembrane receptor (rhodopsin family) |
148 | PF12796.2 | Ankyrin repeat |
136 | PF00096.21 | Zinc finger |
A couple of interesting things that I notice from this table:
The four of the top five most abundant are involved in DNA transposition.
A rhodopsin protein family appears in the Top 10 most abundant Pfams?! Proteins in this family are involved in light detection…
InterProScan annotations (tab-delimited text):
Panopea-generosa-vv0.74.a3.5d65aaa22961e-interproscan.tab
- Contains gene ontology (GO) terms
Project Summary file (TEXT):
=================================
Project Summary
---------------------------------
# Project Information
Project Name : Pgenerosa_v074
Create Date : 2019-07-09 14:07:39
# Project Properties
Genus : Panopea
Species : generosa
Project Type : invertebrate
Prefix : PGEN_
Common Name : Pacific geoduck
Genetic Code : Standard Code
# Input FASTA
Filename : Pgenerosa_v074.fa
Filesize : 913.68 MB
Number of Sequence : 18
=================================
Job Information
---------------------------------
# Official Gene Set
>PASA Refinement
- version : 2.3.3
- Transcripts FASTA file : Trinity.fasta
# The source Job of the refinement job
>Augustus-01
- version : 3.3.1
- Species : fly
- Report genes on : both
- Allowed gene structure : partial
- cDNA (transcripts) sequences : Trinity.fasta
- Protein sequences : 20180827_trinity_geoduck.fasta.transdecoder.fa
# The consensus mask Job
>Masked Repeat Consensus
# The source jobs for consensus mask job
>RepeatMasker
>RepeatModeler
# Family copy number summary
Family Copy Numbers
DNA 675
DNA/Academ 1327
DNA/Crypton 344
DNA/Ginger 130
DNA/Kolobok-T2 141
DNA/Maverick 942
DNA/MuLE-MuDR 201
DNA/MuLE-NOF? 142
DNA/P 167
DNA/PIF-Harbinger 227
DNA/RC 587
DNA/Sola 508
DNA/TcMar-Fot1 117
DNA/TcMar-Mariner 6734
DNA/TcMar-Tc1 3718
DNA/hAT-Tip100 516
DNA/hAT-hAT5 1037
Type:DNA 17513
LINE 883
LINE/CR1 5204
LINE/CR1-Zenon 14653
LINE/I 980
LINE/I-Nimb 1119
LINE/L1 4031
LINE/L1-Tx1 6620
LINE/L2 8879
LINE/L2-Hydra 113
LINE/Penelope 1026
LINE/RTE-X 21214
Type:LINE 64722
SINE/tRNA-Core-L2 41152
Type:SINE 41152
LTR/Caulimovirus 140
LTR/DIRS 448
LTR/Gypsy 1031
LTR/Ngaro 343
LTR/Pao 82
Type:LTR 2044
Type:EVERYTHING_TE 125431
Type:Simple_repeat 19235
Type:Unknown 1465471
# The functional Jobs on the OGS
>InterProScan
- version : 5.29-68.0
>Pfam
- version : 1.6
- E-value Sequence : 1
- E-value Domain : 10
>SignalP
- version : 4.1
- Organism group : euk
- Method : best
- D-cutoff for SignalP-noTM networks : 0.45
- D-cutoff for SignalP-TM networks : 0.50
- Minimal predicted signal peptide length : 10
- Truncate to sequence length : 70
>BLAST protein vs protein (blastp)_SP01
- version : 2.7.1
- Protein Data Set : SwissProt
- Maximum HSP Distinace : 30000
- Output type : tab
- Matrix : BLOSUM62
- Expect : 1e-8
- Word Size : 3
- Gap Open : 11
- Gap Extend : 1
>DIAMOND Functional_SP01
- version : 0.9.22
- Protein Data Set : SwissProt
>BLAST protein vs protein (blastp)
- version : 2.7.1
- Protein Data Set : 20180827_trinity_geoduck.fasta.transdecoder.fa
- Maximum HSP Distinace : 30000
- Output type : tab
- Matrix : BLOSUM62
- Expect : 1e-8
- Word Size : 3
- Gap Open : 11
- Gap Extend : 1
>DIAMOND Functional
- version : 0.9.22
- Protein Data Set : 20180827_trinity_geoduck.fasta.transdecoder.fa
Overall, this annotation is much more believable than the previous MAKER annotations, due to the fact that GenSAS actually predicts genes to exist on all the scaffolds (unlike MAKER)! As such, this will likely become the canonical P.generosa genome going forward. With that being said, we still should manually curate this when we have the time to see how well the predictions line up with the evidence.