In my pursuit to identify which contigs/scaffolds of our “C.bairdi” genome assembly from 20200917 correspond to interesting taxa, based on taxonomic assignments produced by MEGAN6 on 20200928, I used MEGAN6 to extract taxa-specific reads from
cbai_genome_v1.01 on 20201007 - the output is only available in FastA format. Since I want the original reads in FastQ format, I will use the FastA sequence IDs (from the FastA index file) and provide that to
seqtk to extract the FastQ reads for each sample and corresponding taxa.
This was run on my personal computer (mephisto) and documented in a Jupyter Notebook:
Jupyter Notebook (GitHub):
FastQ files end with the
The ID list supplied to
seqtkends with the suffix
seqtk-read-id-list. It is a simple text file.
201002558-2729-Q7 (Hematodinium-free C.bairdi muscle)
6129-403-26-Q7 (Hematodinium-infected C.bairdi hemolymph)
Next up, use Minimap2 to map these reads to the
cbai_genome_v1.01.fasta (18MB). After that’s complete I want to see which contigs/scaffolds generated in the assembly using Flye on 20200917 have reads mapped to them by each of the taxa.
Admittedly, I’m not entirely sure where I’m going with this, or if there’s even a point any more. However, it’s an interesting exercise in bioinformatics stuff (new tools/software, data “munging” practice).