Data Wrangling - Arthropoda and Alveolata D26 Pool RNAseq FastQ Extractions

After using MEGAN6 to extract Arthropoda and Alveolata reads from our RNAseq data on 20200114, I had then extracted taxonomic-specific reads and aggregated each into basic Read 1 and Read 2 FastQs to simplify transcriptome assembly for C.bairdi and for Hematodinium. That was fine and all, but wasn’t fully thought through.

For completeness, I realized that I had not run this taxonomic extraction on the 2018 RNAseq data.

For reference, these only include RNAseq data using a newly established “shorthand”: 2018)

As a reminder, the reason I’m doing this is that I realized that the FastA headers were incomplete and did not distinguish between paired reads. Here’s an example:

R1 FastQ header:

@A00147:37:HG2WLDMXX:1:1101:5303:1000 1:N:0:AGGCGAAG+AGGCGAAG

R2 FastQ header:

@A00147:37:HG2WLDMXX:1:1101:5303:1000 2:N:0:AGGCGAAG+AGGCGAAG

However, the reads extracted via MEGAN have FastA headers like this:


Those are a set of paired reads, but there’s no way to distinguish between R1/R2. This may not be an issue, but I’m not sure how downstream programs (i.e. Trinity) will handle duplicate FastA IDs as inputs. To avoid any headaches, I’ve decided to parse out the corresponding FastQ reads which have the full header info.

Anyway, here’s a brief rundown of the approach:

  1. Create list of unique read headers from MEGAN6 FastA files.

  2. Use list with seqtk program to pull out corresponding FastQ reads from the trimmed FastQ R1 and R2 files.

The entire procedure is documented in a Jupyter Notebook below.

Jupyter notebook (GitHub):


Output folders:

We now have two distinct sets of RNAseq reads from C.bairdi (Arhtropoda) and Hematodinium (Alveolata).

I’ll use these to supplement/update our existing species-specific transcriptomes, since it takes very little time/effort to generate them and run them through the assembly/annotation pipeline.