Data Wrangling - Arthropoda and Alveolata Day and Treatment Taxonomic RNAseq FastQ Extractions

After using MEGAN6 to extract Arthropoda and Alveolata reads from our RNAseq data on 20200114, I had then extracted taxonomic-specific reads and aggregated each into basic Read 1 and Read 2 FastQs to simplify transcriptome assembly for C.bairdi and for Hematodinium. That was fine and all, but wasn’t fully thought through.

For gene expression analysis, I need the FastQs based on infection status and sample days. So, I need to modify the read extraction procedure to parse reads based on those conditions. I could’ve/should’ve done this originally, as I could’ve just assembled the transcriptome from the FastQs I’m going to generate now. Oh well.

For reference, these include RNAseq data using a newly established “shorthand”: 2019)

As a reminder, the reason I’m doing this is that I realized that the FastA headers were incomplete and did not distinguish between paired reads. Here’s an example:

R1 FastQ header:

@A00147:37:HG2WLDMXX:1:1101:5303:1000 1:N:0:AGGCGAAG+AGGCGAAG

R2 FastQ header:

@A00147:37:HG2WLDMXX:1:1101:5303:1000 2:N:0:AGGCGAAG+AGGCGAAG

However, the reads extracted via MEGAN have FastA headers like this:


Those are a set of paired reads, but there’s no way to distinguish between R1/R2. This may not be an issue, but I’m not sure how downstream programs (i.e. Trinity) will handle duplicate FastA IDs as inputs. To avoid any headaches, I’ve decided to parse out the corresponding FastQ reads which have the full header info.

Anyway, here’s a brief rundown of the approach:

  1. Create list of unique read headers from MEGAN6 FastA files.

  2. Use list with seqtk program to pull out corresponding FastQ reads from the trimmed FastQ R1 and R2 files.

The entire procedure is documented in a Jupyter Notebook below.

Jupyter notebook (GitHub):


Output folders:

We now have two distinct sets of RNAseq reads from C.bairdi (Arhtropoda) and Hematodinium (Alveolata), split by infection status and sample day! Will get some gene expression analysis going.

Also of note, and this is a nice bit of confirmation, there are no reads present in the Hematodinium extractions in either of the uninfected samples (D12 or D26). So, the only comparisons to be performed for them will be comparing D12 vs D26.