Data Analysis - C.virginica MBD Sequencing Coverage

A while ago, Steven tasked me with assessing some questions related to the sequencing coverage we get doing MBD-BSseq in this GitHub issue. At the heart of it all was really to try to get an idea of how much usable data we actually get when we do an MBD-BSseq project. Yaamini happened to have done an MBD-BSseq project relatively recently, and it’s one she’s actively working on analyzing, so we went with that data set.

Data was initially trimmed:

Subsequently, the data was concatenated, subset, and aligned using Bismark:

Today, I finally found the time to dedicate to evaluating alignment coverage of each of the Bismark sequence subsets. It was done in a Jupyter Notebook and solely with Bash/Python! I used this as project as an excuse to dive into using Python a bit more, instead of using R. For what I needed to accomplish, I just felt like this approach was simpler (instead of creating an R project and all of that).


Plot of the various “Percent Methylation” coverages in each of the Bismark subsets. The lines/bars are redundant. I added the lines to provide a better visual guide to see how the percent methylation coverage changes from one Bismark subset to the next. A quick guide to X axis labeling:

  • half_avg_reads: Subset of 50% of the average number of read pairs.

  • avg_reads: Subset of the average number of read pairs.

  • half_total_reads: Subset of 50% of the total number of read pairs.

  • total_reads: All read pairs.

Bar/line plot showing differences in percent CpG coverage of each Bismark subset

Summary Table
Bismark Subset No. of Read pairs Trimmed read length (bp) No. of Bases (Gbp) Genome size (Gbp) Genome coverage (x fold) Mean CpG Coverage Percent CpG Coverage Percent 5x CpG Coverage Percent 10x CpG Coverage
total_reads_CpG_coverage 275914272 80 22.07 0.684741 32.24 7.3 45.3 14.7 9.1
half_total_reads_CpG_coverage 137957136 80 11.04 0.684741 16.12 3.5 33.5 9.2 5.7
avg_reads_CpG_coverage 27591427 80 2.21 0.684741 3.22 0.7 15.3 3.2 1.8
half_avg_reads_CpG_coverage 13795713 80 1.10 0.684741 1.61 0.5 11.2 2.3 1.2

After looking at this a bit, it’s surprising how little coverage we actually end up with. This is data for all of the samples combined (n = 10), 100bp PE sequencing. This helps inform us on how much sequencing we actually need to do in order to achieve a target coverage.

However, this analysis doesn’t tell us anything about methylated CpG coverage (although, it could if I tweaked things a tiny bit). Yaamini has been working on that, so she should have those answers. Her analysis and my analysis will likely be used to contribute to our BS-seq decision tree.