Steven wanted me to generate FastA files (GitHub Issue) for Panopea generosa (Pacific geoduck) coding sequences (CDS), genes, and mRNAs. One of the primary needs, though, was to have an ID that could be used for downstream table joining/mapping. I ended up using a combination of GFFutils and bedtools getfasta
. I took advantage of being able to create a custom name
column in BED files to generate the desired FastA description line having IDs that could identify, and map, CDS, genes, and mRNAs across FastAs and GFFs.
This was all documented in a Jupyter Notebook:
GitHub:
NB Viewer:
RESULTS
Output folder:
-
MD5 checksums for all files (text):
- checksums.md5 (4.0K)
FastA files and FastA index files:
Panopea-generosa-v1.0.a4.CDS.fasta (67M)
- MD5:
fb192eab0aefd5d3ba5bebef2a012f15
- MD5:
Panopea-generosa-v1.0.a4.CDS.fasta.fai (26M)
- MD5:
f2266a449290ea0383d2eb98eb3ed426
- MD5:
Panopea-generosa-v1.0.a4.gene.fasta (362M)
- MD5:
7c956b1c27d14bd91959763403f81265 588d18f5fe0e4f2259a25586349fc244
- MD5:
Panopea-generosa-v1.0.a4.gene.fasta.fai (2.4M)
- MD5:
588d18f5fe0e4f2259a25586349fc244
- MD5:
Panopea-generosa-v1.0.a4.mRNA.fasta (475M)
- MD5:
1823be75694cf70f0ea6f1abc072ba16 e120b4c1d3bb0917868e72cd22507bbc
- MD5:
Panopea-generosa-v1.0.a4.mRNA.fasta.fai (3.4M)
- MD5:
e120b4c1d3bb0917868e72cd22507bbc
- MD5:
CDS FastA description lines look like this:
>PGEN_.00g000010.m01.CDS01|PGEN_.00g000010.m01|PGEN_.00g000010::Scaffold_01:2-125
Explanation for CDS:
PGEN_.00g000010.m01.CDS01
: Unique sequence ID.PGEN_.00g000010.m01
: “Parent” ID. Corresponds to unique mRNA ID.PGEN_.00g000010
: “Parent” ID. Corresponds to unique gene ID.Scaffold_01
: Originating scaffold.2-125
: Sequence coordinates from scaffold mentioned above.
mRNA FastA description looks like this:
PGEN_.00g000030.m01|PGEN_.00g000030::Scaffold_01:49248-52578
Explanation for mRNA:
PGEN_.00g000030.m01
: Unique sequence ID.PGEN_.00g000030
: “Parent” ID. Corresponds to unique gene ID.Scaffold_01
: Originating scaffold.49248-52578
: Sequence coordinates from scaffold mentioned above.