Data Wrangling - S.salar Gene Annotations from NCBI RefSeq GCF_000233375.1_ICSASG_v2_genomic.gff for Shelly

Shelly posted a GitHub Issue asking if I could create a file of S.salar genes with their UniProt annotations (e.g. gene name, UniProt accession, GO terms).

Here’s the approach I took:

  1. Use GFFutils to pull out only gene features, along with:
  • chromosome name

  • start position

  • end position

  • Dbxref attribute (which, in this case, is the NCBI gene ID)

  1. Submit the NCBI gene IDs to UniProt to map the NCBI gene IDs to UniProt accessions. Accomplished using the Perl batch submission script provided by UniProt.

  2. Parse out the stuff we were interested in.

  3. Join it all together!

All of this is documented in the Jupyter Notebook below:

Jupyter Notebook (GitHub):

Jupyter Notebook (NBviewer):



RESULTS

Output folder:

Parsing of the UniProt Perl batch retrieval results file (~7.2M lines!) took ~6.5hrs!