In preparation for
FastQC and trimming of the E5 coral sRNA-seq data, I noticed that my “default” trimming settings didn’t produce the results I expected. Specifically, since these are sRNAs and the NEBNext® Multiplex Small RNA Library Prep Set for Illumina (PDF) protocol indicates that the sRNAs should be ~21 - 30bp, it seemed odd that I was still ending up with read lengths of 150bp. So, I tried a couple of quick trimming comparisons on just a single pair of sRNA FastQs to use as examples to get feeback on how trimming should proceed.
Trimming was done with the
flexbar. As an aside, I might begin using this trimmer instead of
fastp going forward.
fastp has some odd “quirks” in it’s order of operations that sometimes require two rounds of trimming. Also, it’s annoying that
fastp limits the number of threads to 16;
flexbar has no such limitation. Perhaps this is moot, as I’m not sure if there’s truly a performance increase or not. The biggest trade off, though, is that
fastp automatically generates HTML reports for trimming, which include pre- and post-trimming plots/data. These are very useful and are also interpreted by
This was all done on Raven using a Jupyter Notebook.
Jupyter Notebook (GitHub):
Jupyter Notebook (NB Viewer):
MultiQC Report (HTML)
Adapter Trim Only FastQC Reports (HTML)
Adapter and 50bp length trim FastQC Reports (HTML)
Let’s take a brief look at the data:
Adapter trimming only
FastQC of adapter trim only still shows read lengths of 150bp. Additionally, the bulk of the 3’ end of the reads show extensive poly-G signals. Admittedly,
flexbar doesn’t have a default poly-G trimming option. However, using
fastp, which does have a poly-G trimming option, still showed similar results (data not shown - not comparing trimmers, just highlighting persistence of long reads).
Adapter and length trimming
FastQC of adapter trim and trimming to a length of 50bp (from the 3’ end). As expected, performing length trimming removed all reads longer than 50bp, which also resulted in removal of poly-G sequence. Also shows an increase in heterogeneity (i.e. more drastic spikes in plots) after ~30bp. This is probably expected, as the NEBNext® Multiplex Small RNA Library Prep Set for Illumina (PDF) manual indicates that miRNA should be ~21bp and piRNAs ~31bp. Thus, the sequence after that could be something else.
Will share with E5 group to get feedback.