In my pursuit to identify which contigs/scaffolds of our “C.bairdi” genome assembly from 20200917 correspond to interesting taxa, based on taxonomic assignments produced by MEGAN6 on 20200928, I used MEGAN6 to extract taxa-specific reads from cbai_genome_v1.01 on 20201007 - the output is only available in FastA format. Since I want the original reads in FastQ format, I will use the FastA sequence IDs (from the FastA index file) and provide that to seqtk to extract the FastQ reads for each sample and corresponding taxa.

This was run on my personal computer (mephisto) and documented in a Jupyter Notebook:

Jupyter Notebook (GitHub):

20201013_mephisto_cbai_seqtk_megan-fastq-read-extractions.ipynb

RESULTS

Output folders/files:

FastQ files end with the .fq suffix.
The ID list supplied to seqtk ends with the suffix seqtk-read-id-list. It is a simple text file.

201002558-2729-Q7 (Hematodinium-free C.bairdi muscle)

20201013_201002558-2729-Q7_megan-reads/

6129-403-26-Q7 (Hematodinium-infected C.bairdi hemolymph)

20201013_6129-403-26-Q7_megan-reads/

Next up, use Minimap2 to map these reads to the cbai_genome_v1.01.fasta (18MB). After that’s complete I want to see which contigs/scaffolds generated in the assembly using Flye on 20200917 have reads mapped to them by each of the taxa.

Admittedly, I’m not entirely sure where I’m going with this, or if there’s even a point any more. However, it’s an interesting exercise in bioinformatics stuff (new tools/software, data “munging” practice).