Metagenomics - Taxonomic Diversity from Geoduck Water with BLASTp and Krona plots

We’re working on getting the metagenomics sequencing project written up as a manuscript and Steven asked me to provide an overview of the taxonomic makeup of our metagenome assembly in this GitHub Issue.

I previously assembled all of the sequencing data in to a single assembly (i.e. did not assemble by experimental treatments):

Subsequently, I ran some gene prediction software to help refine the assembly in to a more conservative representation, in hopes of getting a more realistic view of biologically relevant DNA (i.e. analyzing sequenced DNA that actually has putative functions, as opposed to random eDNA that may have been floating around in the water):

For getting taxonomic info, I took the MetaGeneMark proteins FastA file and ran BLASTp against the NCBI SwissProt database (v5) to get taxonomic IDs. See this Jupyter Notebook (GitHub):

This was followed up by using Krona to plot the data in an interactive fashion, according to NCBI taxonomic ID abundance (see Results below).

Here’s how the sample names breakdown:

Sample Develomental Stage (days post-fertilization) pH Treatment
MG1 13 8.2
MG2 17 8.2
MG3 6 7.1
MG5 10 8.2
MG6 13 7.1
MG7 17 7.1


Output folder:

As a brief overview, the initial Megahit assembly generated:

  • 2,276,153 contigs.

MetaGeneMark predicted:

  • 3,296,610 genes.

BLASTp resulted in:

  • 1,346,325 SwissProt matches

The Krona plot provides a pretty nice way to view the breakdown of the data and, as such, I won’t provide a written summary of how it all shakes out.

Next, for curiosity sake, I’ll run BLASTn and see how things compare.