Data Wrangling - Arthropoda and Alveolata Taxonomic RNAseq FastQ Extractions

After using MEGAN6 to extract Arthropoda and Alveolata reads from our RNAseq data on 20200114 (for reference, these include RNAseq data using a newly established “shorthand”: 2018, 2019), I realized that the FastA headers were incomplete and did not distinguish between paired reads. Here’s an example:

R1 FastQ header:

@A00147:37:HG2WLDMXX:1:1101:5303:1000 1:N:0:AGGCGAAG+AGGCGAAG

R2 FastQ header:

@A00147:37:HG2WLDMXX:1:1101:5303:1000 2:N:0:AGGCGAAG+AGGCGAAG

However, the reads extracted via MEGAN have FastA headers like this:


Those are a set of paired reads, but there’s no way to distinguish between R1/R2. This may not be an issue, but I’m not sure how downstream programs (i.e. Trinity) will handle duplicate FastA IDs as inputs. To avoid any headaches, I’ve decided to parse out the corresponding FastQ reads which have the full header info.

Here’s a brief rundown of the approach:

  1. Create list of unique read headers from MEGAN6 FastA files.

  2. Use list with seqtk program to pull out corresponding FastQ reads from the trimmed FastQ R1 and R2 files.

The entire procedure is documented in a Jupyter Notebook below.

Jupyter notebook (GitHub):


Output folders:

We now have to distinct sets of RNAseq reads to create separate transcriptome assemblies from C.bairdi (Arhtropoda) and Hematodinium (Alveolata)! Will get de novo assemblies with Trinity going on Mox.