In getting things prepared for the manuscript we’re writing about the Olympia oyster genotype-by-sequencing data from BGI, I felt we needed to provide a FastQC analysis of the raw data (since these two files are what we submitted to the NCBI short read archive) to provide support for the Technical Validation section of the manuscript.
Below, is the Jupyter notebook I used to run the FastQC analysis on the two files. I’ve embedded for quick viewing, but it might be easier to view the notebook via the GitHub link.
Well, I realized that running FastQC on the raw data might not reveal anything all too helpful. The reason for this is that the adaptor and barcode sequences are still present on all the reads. This will lead to over-representation of these sequences in all of the samples, which, in turn, will skew FastQC’s intepretation of the read qualities. For comparison, I’ll run FastQC on the demultiplexed data provided by BGI and see what the FastQC report looks like on trimmed data.
However, I’ll need to discuss with Steven about whether or not providing the FastQC analysis is worthwhile as part of the “technical validation” aspect of the manuscript. I guess it can’t hurt to provide it, but I’m not entirely sure that the FastQC report provides any real information regarding the quality of the sequencing reads that we received…
Jupyter notebook (GitHub): 20170301_docker_fastqc_nondemultiplexed_bgi_oly_gbs.ipynb