In our various attempts to get the Panopea generosa genome annotated in such a manner that we’re comfortable with (the previous annotation attempts we’re lacking any annotations in almost all of the largest scaffolds, which didn’t seem right), Steven stumbled across GenSAS, a web/GUI-based genome annotation program, so we gave it a shot.
This version of the genome annotation will be referred to as:
- Panopea-generosa-vv0.71.a2
I uploaded the following to the GenSAS website to potentially use as “evidence files”:
Transcriptome FastA files (links to notebook entries):
TransDecoder protein FastA files (links to notebook entries)
Repeats Files
RESULTS
This took way longer than I was expecting! This took nearly an entire month (the majority of that time was running Augustus ab initio gene prediction, which took ~3 weeks):
Output folder:
Feature counts:
awk 'NR>3 { print $3 }' Panopea-generosa-v1.0.a2-merged-2019-08-29-15-28-54.gff3 | sort | uniq -c
264153 CDS
264153 exon
56167 gene
56167 mRNA
BUSCO assessment:
- 80.7% complete BUSCOs present in predicted genes
Individual feature GFFs were made with the following shell commands:
features_array=(CDS exon gene mRNA)
for feature in ${features_array[@]}
do
output="Panopea-generosa-v1.0.a2.${feature}.gff3"
input="Panopea-generosa-v1.0.a2-merged-2019-08-29-15-28-54.gff3"
head -n 3 Panopea-generosa-v1.0.a2-merged-2019-08-29-15-28-54.gff3 \
>> ${output}
awk -v feature="$feature" '$3 == feature {print}' ${input} \
>> ${output}
done
SwissProt functional annotations (tab-delimited text):
- BLASTp
- DIAMOND
Pfam annotations (tab-delimited text):
Grabbed the top 10 most abundant Pfam Accessions to see how things looked:
Feature Count | Pfam Accession | Pfam |
---|---|---|
364 | PF00643.19 | B-box zinc finger |
293 | PF07690.11 | Major facilitator family |
228 | PF00001.16 | Rhodopsin-like receptors |
220 | PF12796.2 | Ankyrin repeat |
209 | PF00651.26 | BTB/POZ domain |
206 | PF00069.20 | Protein kinase domain |
180 | PF00067.17 | Cytochrome P450 |
175 | PF02931.18 | Ligand-gated ion channel |
174 | PF00400.27 | WD40 repeat |
174 | PF00059.16 | C-type lectin |
A rhodopsin protein family appears in the Top 10 most abundant Pfams?! Proteins in this family are involved in light detection…
InterProScan annotations (tab-delimited text):
Panopea-generosa-v1.0.a2.5d65aa8055fad-interproscan.tab
- Contains gene ontology (GO) terms
Project Summary file (TEXT):
=================================
Project Summary
---------------------------------
# Project Information
Project Name : Pgenerosa_v071
Create Date : 2019-07-10 12:30:24
# Project Properties
Genus : Panopea
Species : generosa
Project Type : invertebrate
Prefix : PGEN_
Common Name : Pacific geoduck
Genetic Code : Standard Code
# Input FASTA
Filename : Pgenerosa_v071.fasta
Filesize : 1.32 GB
Number of Sequence : 14014
=================================
Job Information
---------------------------------
# Official Gene Set
>PASA Refinement
- version : 2.3.3
- Transcripts FASTA file : Trinity.fasta
# The source Job of the refinement job
>Augustus-01
- version : 3.3.1
- Species : fly
- Report genes on : both
- Allowed gene structure : partial
- cDNA (transcripts) sequences : Trinity.fasta
- Protein sequences : 20180827_trinity_geoduck.fasta.transdecoder.fa
# The consensus mask Job
>Masked Repeat Consensus
# The source jobs for consensus mask job
>RepeatMasker
>RepeatModeler
# Family copy number summary
Family Copy Numbers
DNA 85
DNA/Academ 264
DNA/Crypton 200
DNA/Kolobok-T2 188
DNA/MuLE-MuDR 94
DNA/PIF-Harbinger 482
DNA/Sola 122
DNA/TcMar-Mariner 599
DNA/TcMar-Tc1 1266
DNA/hAT-Tip100 808
DNA/hAT-Tip100? 255
Type:DNA 4363
LINE 2153
LINE/CR1 4122
LINE/CR1-Zenon 1717
LINE/I-Nimb 72
LINE/Jockey 510
LINE/L1-Tx1 967
LINE/L2 1896
LINE/Penelope 735
LINE/Proto2 155
LINE/R2-Hero 211
LINE/RTE-X 2275
LINE/Tad1 97
Type:LINE 14910
Type:SINE 0
LTR/DIRS 192
LTR/Gypsy 1420
LTR/Ngaro 533
LTR/Pao 146
Type:LTR 2291
Type:EVERYTHING_TE 21564
Type:Simple_repeat 107
Type:Unknown 115322
# The functional Jobs on the OGS
>BLAST protein vs protein (blastp)_SP01
- version : 2.7.1
- Protein Data Set : SwissProt
- Maximum HSP Distinace : 30000
- Output type : tab
- Matrix : BLOSUM62
- Expect : 1e-8
- Word Size : 3
- Gap Open : 11
- Gap Extend : 1
>DIAMOND Functional SP01
- version : 0.9.22
- Protein Data Set : SwissProt
>BLAST protein vs protein (blastp)
- version : 2.7.1
- Protein Data Set : 20180827_trinity_geoduck.fasta.transdecoder.fa
- Maximum HSP Distinace : 30000
- Output type : tab
- Matrix : BLOSUM62
- Expect : 1e-8
- Word Size : 3
- Gap Open : 11
- Gap Extend : 1
>DIAMOND Functional
- version : 0.9.22
- Protein Data Set : 20180827_trinity_geoduck.fasta.transdecoder.fa
>InterProScan
- version : 5.29-68.0
>Pfam
- version : 1.6
- E-value Sequence : 1
- E-value Domain : 10
>SignalP
- version : 4.1
- Organism group : euk
- Method : best
- D-cutoff for SignalP-noTM networks : 0.45
- D-cutoff for SignalP-TM networks : 0.50
- Minimal predicted signal peptide length : 10
- Truncate to sequence length : 70
Overall, this annotation is much more believable than the previous MAKER annotations, due to the fact that GenSAS actually predicts genes to exist on all the scaffolds (unlike MAKER)! Will be interesting to compare to the GenSAS Panopea-generosa-vv0.74.a3 annotation.