When working to identify differentially expressed transcripts (DETs) and genes (DEGs) for our Crassostrea virginica (Eastern oyster) RNAseq/DNA methylation comparison of changes across sex and ocean acidification conditions (https://github.com/epigeneticstoocean/2018_L18-adult-methylation), I realized that the DEG tables I was generating had excessive gene counts due to the fact that the analysis (and, in turn, the genome coordinates), were tied to transcripts. Thus, genes were counted multiple times due to the existence of multiple transcripts for a given gene, and the analysis didn’t list gene coordinate data - only transcript coordinates.
In order to identify just gene coordinates, I needed a BED file to use for merging with the DEG data. As it turns out, we didn’t have an existing BED file with just gene coordinates and gene names. So, I used GFFutils to extract just genes from the NCBI GCF_002022765.2 GFF (links to NCBI directory - not directly to GFF file).
All the data wrangling is documented in the Jupyter Notebook below.
Jupyter Notebook (GitHub):
RESULTS
Output folder:
-
BED file (1.7MB)
20211209_cvir_gff-to-bed/20211209_cvir_GCF_002022765.2_genes.bed
MD5 checksum:
c8f203de591c0608b96f4299c0f847dc
The resulting BED file was renamed to C_virginica-3.0_Gnomon_genes.bed
for consistency, added to the common storage location for C.virginica genome tracks (http://eagle.fish.washington.edu/Cvirg_tracks/https://eagle.fish.washington.edu/Cvirg_tracks/C_virginica-3.0_Gnomon_genes.bed), and added to the Roberts Lab Handbook - Genomic Resources page.