I’ve been reviewing some of the CEABIGR (GitHub repo) data I’ve generated; specifically transcript count data/calcs. As part of that, I feel like we need/should annotate the transcripts to be able to make some more informed conclusions. Steven had previously performed annotations (Notebook), as well as Yaamini (GitHub Issue). However, there are shortcomings to both of the approaches each one utilized. Steven’s annotation relied only on coding sequences (CDS), while Yaamini’s utilized only mRNAs.
In order to get a more robust annotation for all transcripts/genes (including long, non-coding RNAs (lncRNAs)), I opted to extract all gene sequences (as FastA) for subsequent BLASTx and gene ontology (GO) annotation. In order to extract FastA, I used gffread and the NCBI Crassostrea virginica (Eastern oyster) genome, GCF_002022765.2_C_virginica-3.0_genomic.fna
, along with the genes BED file C_virginica-3.0_Gnomon_genes.bed
(available in Genomic Resources Handbook page). All work was run on Raven in the following Jupyter Notebook:
20230726-cvir-genes_bed-to-fasta.ipynb (NB Viewer)
RESULTS
Now, on to BLASTing.
Output folder:
20230726-cvir-genes_bed-to-fasta
FastA
GCF_002022765.2_C_virginica-3.0-genes.fasta (408M)
- MD5:
a0546fd42642673d80b3071089a6711b
- MD5:
FastA Index
GCF_002022765.2_C_virginica-3.0-genes.fasta.fai (1.5M)
- MD5:
e69ecc217c2e695a6dab7e599984d592
- MD5: