I finally had some time to tackle this GitHub Issue and create a canonical genes FastA file using the MAKER IDs, instead of the original contig IDs from our Olympia oyster genome assembly - https://owl.fish.washington.edu/halfshell/genomic-databank/Olurida_v081.fa (FastA; 1.1GB).
Everything was documented in a Jupyter Notebook (see link below), but here’s the skinny on how I did it:
Pull existing FastA-formatted sequences from the fully annotated GFF (GFF; 2.9GB; MAKER appended the FastAs to the end of the GFF).
Use ‘bedTools fastaFromBed’ to create FastA for all genes using gene GFF coordinates and generate unique FastA headers for each sequence.
Use
sed
to do a substitution using the MAKER IDs and thebedTools fastaFromBed
IDs.
Jupyter Notebook (GitHub):
RESULTS
This ran for a surprisingly long time - a bit over 17 hours just for a find/replace. I think I could’ve speeded things up if the last sed
command looked only at lines beginning with “>
”, instead of scanning each line for each possible match. Oh well.
Output folder:
Renamed FastA ():
Renamed FastA Index (txt):
Will add to Genomic Resources wiki.