I finally had some time to tackle this GitHub Issue and create a canonical genes FastA file using the MAKER IDs, instead of the original contig IDs from our Olympia oyster genome assembly - https://owl.fish.washington.edu/halfshell/genomic-databank/Olurida_v081.fa (FastA; 1.1GB).
Everything was documented in a Jupyter Notebook (see link below), but here’s the skinny on how I did it:
Pull existing FastA-formatted sequences from the fully annotated GFF (GFF; 2.9GB; MAKER appended the FastAs to the end of the GFF).
Use ‘bedTools fastaFromBed’ to create FastA for all genes using gene GFF coordinates and generate unique FastA headers for each sequence.
sedto do a substitution using the MAKER IDs and the
Jupyter Notebook (GitHub):
This ran for a surprisingly long time - a bit over 17 hours just for a find/replace. I think I could’ve speeded things up if the last
sed command looked only at lines beginning with “
>”, instead of scanning each line for each possible match. Oh well.
Renamed FastA ():
Renamed FastA Index (txt):
Will add to Genomic Resources wiki.