Shelly posted a GitHub Issue asking if I could create a file of S.salar genes with their UniProt annotations (e.g. gene name, UniProt accession, GO terms).
Here’s the approach I took:
- Use GFFutils to pull out only gene features, along with:
chromosome name
start position
end position
Dbxref attribute (which, in this case, is the NCBI gene ID)
Submit the NCBI gene IDs to UniProt to map the NCBI gene IDs to UniProt accessions. Accomplished using the Perl batch submission script provided by UniProt.
Parse out the stuff we were interested in.
Join it all together!
All of this is documented in the Jupyter Notebook below:
Jupyter Notebook (GitHub):
Jupyter Notebook (NBviewer):
RESULTS
Output folder:
Parsing of the UniProt Perl batch retrieval results file (~7.2M lines!) took ~6.5hrs!
20210601_ssal_gff-annotations/
Final tab-delimited file (10MB):
It’s organized in the following fashion:
chromosome NCBI gene ID start end UniProt accession gene abbreviation/name gene description GO IDs Other files:
20210601_ssal_accession-gene_id-gene-gene_description-go_ids.csv (8.0M)
- MD5:
e7d970782d7f531967dbfce01e5df549
- MD5:
20210601_ssal_chrom-start-end-Dbxref.csv (2.9M)
- MD5:
f4182e5129978328b0e9ae2b07d0bbf7
- MD5:
20210601_ssal_gene-list.txt (772K)
- MD5:
0d330da91260189090ba2fac1ca0340f
- MD5:
20210601_ssal_uniprot_batch_results.txt (350M)
- MD5:
81f63345d2f2cfbabdc8d60c3326ba66
- MD5:
checksums.md5 (4.0K)