Data Management - Tracking O.lurida FASTQ File Corruption

UPDATE 20170104 - These two corrupt files have been replaced with non-corrupt files.


Sean identified an issue with one of the original FASTQ files provided to use by BGI. Additionally, Steven had (unknowingly) identified the same corrupt file, as well as a second corrupt file in the set of FASTQ files. The issue is discussed here: https://github.com/sr320/LabDocs/issues/334

Steven noticed the two files when he ran the program FASTQC and two files generated no output (but no error message!).

The two files in question are:

  • 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz

  • 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz

This post is an attempt to document where things went wrong, but having glanced through this data a bit already, it won’t provide any answers.

I originally downloaded the data on 20160127 to my home folder on Owl (this is detailed in the Jupyter notebook in that post) and generated/compared MD5 checksum values. The values matched at that time.

So, let’s investigate a bit further…

Launch Docker container

<code>docker run - p 8888:8888 -v /Users/sam/data/:/data -v /Users/sam/owl_home/:/owl_home -v /Users/sam/owl_web/:owl_web -v /Users/sam/gitrepos/LabDocs/jupyter_nbs/sam/:/jupyter_nbs -it 0ba43904567e</code>

The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files accessible to the Docker container.

Once the container was started, started Jupyter Notebook with the following command inside the Docker container:

jupyter notebook

This command is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888.

Jupyter notebook file: 20161117_docker_oly_genome_fastq_corruption.ipynb

I’ve embedded the notebook below, but it’s much easier to view (there are many lengthy commands/filenames that wrap lines in the embedded version below) the actual file linked above.