Steven tasked me with updating our P.generosa genome annotation file (GitHub Issue) a while back and I finally managed to get it all figured out. Although I wanted to perform most of this using the GSEAbase package (PDF), as this package is geared towards storage/retrieval of gene set data, I eventually decided to abondon this approach due to the time it was taking and my lack of familiarity/understanding of how to manipulate objects in R. Despite that, GSEAbase was still utilized for its very simple use for identifying GOlims (IDs and Terms).

I also struggled with the UniProt API. They updated the API since I had previously created the initial annotation file on 20220419 and the update/change to the API rendered the previous API usage inoperable! I tried for a bit to get the new API figured out, but evenutally said “F it!” and used the data I had previously downloaded from UniProt (which, when I started, I didn’t actually realize I had kept).

Then, after all this, I decided to integrate the entire thing into an R Project. This kept things a bit more cohesive, as it didn’t need to bounc between a Jupyter Notebook and then into R.

Produced a tab-delimited file which added columns grouping GO IDs by Biological Process (BP), Cellular Component (CC), and Molecular Function (MF). Also added a column grouping BP GOslims and corresponding BP GOslim terms for each gene.

See Results section below for file and layout.

RESULTS

Output folder:

20230328-pgen-gene_annotation-update (GiHub; R Project)

Annotation file
- 20230329-pgen-annotations-SwissProt-GO-BP_GOslim.tab (7.8MB; tab-delimited)
Table layout:

gene	accessions	gene_id	gene_name	gene_description	alt_gene_description	all_GO_ids	BP_GO_ids	CC_GO_ids	MF_GO_ids	GOslim	Term

RESULTS

Annotation file