I’ve been reviewing some of the CEABIGR (GitHub repo) data I’ve generated; specifically transcript count data/calcs. As part of that, I feel like we need/should annotate the transcripts to be able to make some more informed conclusions. Steven had previously performed annotations (Notebook), as well as Yaamini (GitHub Issue). However, there are shortcomings to both of the approaches each one utilized. Steven’s annotation relied only on coding sequences (CDS), while Yaamini’s utilized only mRNAs.

In order to get a more robust annotation for all transcripts/genes (including long, non-coding RNAs (lncRNAs)), I opted to extract all gene sequences (as FastA) for subsequent BLASTx and gene ontology (GO) annotation. In order to extract FastA, I used gffread and the NCBI Crassostrea virginica (Eastern oyster) genome, GCF_002022765.2_C_virginica-3.0_genomic.fna, along with the genes BED file C_virginica-3.0_Gnomon_genes.bed (available in Genomic Resources Handbook page). All work was run on Raven in the following Jupyter Notebook:

20230726-cvir-genes_bed-to-fasta.ipynb (GitHub)
20230726-cvir-genes_bed-to-fasta.ipynb (NB Viewer)

RESULTS

Now, on to BLASTing.

Output folder:

20230726-cvir-genes_bed-to-fasta

FastA
- GCF_002022765.2_C_virginica-3.0-genes.fasta (408M)
  - MD5: a0546fd42642673d80b3071089a6711b
FastA Index
- GCF_002022765.2_C_virginica-3.0-genes.fasta.fai (1.5M)
  - MD5: e69ecc217c2e695a6dab7e599984d592

RESULTS

FastA

FastA Index