Steven asked me to create a canonical genome annotation file (GitHub Issue). I needed/wanted to create a file containing gene IDs, SwissProt (SP) IDs, gene names, gene descriptions, and gene ontology (GO) accessions. To do so, I utilized the NCBI BLAST and DIAMOND BLAST annotations generated by our GenSas P.generosa genome annotation. Per Steven’s suggestion, I used the best match (i.e. lowest e-value
) for any given gene between the two files.
After doing that, I used the SPIDs to perform a batch retrieval from UniProt databases. This batch retrieval pulls down a lengthy annotation file which needed to be parsed to retrieve just the info we want for each SPID. The desired info was then joined with the list of Panopea generosa (Pacific geoduck) genes and the corresponding SPID from the NCBI BLAST and DIAMOND BLAST annotations generated by our GenSas P.generosa genome annotation.
This data wrangling also identified 8 SPID annotations that were obsolete/deleted from UniProt or which redirect to a different SPID. Those SPIDs which are obsolete/deleted were not dealt with further. For those that redirect to a different SPID, the notebook was run to identify these SPIDs and then modified to update the SPID list with these new SPIDs. The notebook was then run a second time to incorporate these updated SPIDs.
The output file is tab-delimited, with the following columns:
gene_ID
: Gene ID from our Panopea generosa (Pacific geoduck) genome.SPIDs
: Comma-delimited list of SPIDs from UniProt. One SPID in this list is a match corresponding to the our original BLAST annotations.UniProt_gene_ID
: Gene accession from UniProt.gene
: Abbreviated gene name from UniProt.gene_description
: Human-readable gene description from UniProt.alternate_gene_description
: Human-readable alternate gene description from UniProt.GO_IDs
: GO IDs from UniProt.
All analysis was run in the following Jupyter Notebook:
20220419-pgen-gene_annotation_mapping.ipynb (NBViewer)
RESULTS
Output folder:
20220419-pgen-gene_annotation_mapping/
Annotation file (3.4MB; text)
Intermediate files, as well as the batch retrieval data from UniProt are available as well. See directory README file for more info.