sRNA-seq Paired Read Merging Exploratios - Coral E5 Deep Dive
2024
E5
sRNA-seq
merge
bbmerge
fastp
Author
Sam White
Published
February 2, 2024
INTRO
Our previous sRNA-seq data analysis for this project revealed that the R2 reads didn’t seem to match with anything (specifically, no clade assignments when using miRTrace), which seemed odd. Since the insert sizes should be small for these libraries and the read lengths kind of long (150bp PE), we’d expect the resulting reads to overlap. And, in turn, R2 reads should be reverse complements of R1 reads (and vice versa). After discussing things with Azenta (company which performed library construction and sequencing), they indicated that their in-house pipeline discards the R2 reads because they come from the opposite strand and most miRNAs are found on the R1 strand.
It seemed like a bit of waste to discard all the data that exists in the R2 reads, so I wanted to explore merging the reads to see if we could preserve the R2 data. Additionally, I came across XICRA (GitHub repo) (Sanchez Herrero et al. 2021) which describes using paired-end sequence data to identify miRNA isoforms (isomiRs), suggesting that we should try to retain the R2 reads to improve our miRNA analyses.
I decided to explore merging/trimming using two different sets of software:
BBMerge seemed like a good tool, as it’s specifically designed to handle merging of Illumina PE sequencing reads. Plus, it’s part of the BBTols suite, which has a number of other tools, and packages like this usually have nice cross-compatibility between the different tools.
fastp has both trimming and merging built into the tools itself, which I hadn’t previously realized (but, had never explored, since this isn’t how we usually work with PE sequencing data). I was simply going to use fastp for trimming, as it’s the trimmer thta I’m most familiar with. I honestly thought I’d trim with fastp, and then merge with BBMerge, but the fact that fastp had it built-in could make my life easier.
So, with all of that out of the way, let’s start the analysis. This will only use a single set of PE sequencing reads, as an initial exploration. We’ll decide on the best approach and then use that approach to run all the sequencing data through the chosen “pipeline.”
SETUP
Create a Bash variables file
This allows usage of Bash variables across R Markdown chunks.
{echo"#### Assign Variables ####"echo""echo"## Data URL"echo'export raw_reads_url="https://owl.fish.washington.edu/nightingales/P_evermanni/30-852430235/"'echo"# Data directories"echo'export fastqc_out_dir="./fastqc"'echo'export raw_reads_dir="./raw"'echo'export trimmed_fastqs_dir="./trimmed-reads"'echo""echo"## NEB nebnext-small-rna-library-prep-set-for-illumina adapters"echo'export first_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"'echo'export second_adapter="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT"'echo""echo"# Input/output files"echo'export NEB_adapters_fasta=NEB-adapters.fasta'echo""echo"# Paths to programs"echo'export programs_dir="/home/shared"'echo'export bbmerge="${programs_dir}/bbmap-39.06/bbmerge.sh"'echo'export fastp="${programs_dir}/fastp"'echo'export fastqc="${programs_dir}/FastQC-0.12.1/fastqc"'echo'export multiqc=/home/sam/programs/mambaforge/bin/multiqc'echo""echo"# Set number of CPUs to use"echo'export threads=46'echo""echo"# Initialize arrays"echo'export trimmed_fastqs_array=()'}> .bashvarscat .bashvars
#### Assign Variables ####
## Data URL
export raw_reads_url="https://owl.fish.washington.edu/nightingales/P_evermanni/30-852430235/"
# Data directories
export fastqc_out_dir="./fastqc"
export raw_reads_dir="./raw"
export trimmed_fastqs_dir="./trimmed-reads"
## NEB nebnext-small-rna-library-prep-set-for-illumina adapters
export first_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
export second_adapter="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT"
# Input/output files
export NEB_adapters_fasta=NEB-adapters.fasta
# Paths to programs
export programs_dir="/home/shared"
export bbmerge="${programs_dir}/bbmap-39.06/bbmerge.sh"
export fastp="${programs_dir}/fastp"
export fastqc="${programs_dir}/FastQC-0.12.1/fastqc"
export multiqc=/home/sam/programs/mambaforge/bin/multiqc
# Set number of CPUs to use
export threads=46
# Initialize arrays
export trimmed_fastqs_array=()
total 1.8G
-rw-r--r-- 1 sam sam 798 May 17 2023 checksums.md5
-rw-r--r-- 1 sam sam 899M May 17 2023 sRNA-POR-79-S1-TP2_R1_001.fastq.gz
-rw-r--r-- 1 sam sam 916M May 17 2023 sRNA-POR-79-S1-TP2_R2_001.fastq.gz
Verify raw read checksums
# Load bash variables into memorysource .bashvarscd"${raw_reads_dir}"# Checksums file contains other files, so this just looks for the sRNAseq files.grep"sRNA-POR-79" checksums.md5 |md5sum--check
sRNA-POR-79-S1-TP2_R1_001.fastq.gz: OK
sRNA-POR-79-S1-TP2_R2_001.fastq.gz: OK
Create adapters FastA
# Load bash variables into memorysource .bashvarsecho"Creating adapters FastA."echo""adapter_count=0# Check for adapters file first# Then create adapters file if doesn't existif[-f"${NEB_adapters_fasta}"];thenecho"${NEB_adapters_fasta} already exists. Nothing to do."elsefor adapter in"${first_adapter}""${second_adapter}"doadapter_count=$((adapter_count+1))printf">%s\n%s\n""adapter_${adapter_count}""${adapter}"done>>"${NEB_adapters_fasta}"fiecho""echo"Adapters FastA:"echo""cat"${NEB_adapters_fasta}"echo""
fastp has many built-in functions, including before/after trimming plots, and read 1 and read 2 merging.
Note
Setting --overlap_len_require 17 forces fastp to look at reads starting at base 17, instead of the default 30 and results in more accurate analyses, since these are sRNA-seq libraries where the expected/desired read lengths will be 17 - 30bp in length.
Paired-end Trimming
The paired end trimming will examine how some of the trimming options impact sequencing read lengths, quality, and subsequent merging.
WARNING: fastp uses up to 16 threads although you specified 46
Read1 before filtering:
total reads: 19762559
total bases: 2964383850
Q20 bases: 2604771220(87.8689%)
Q30 bases: 2427856243(81.9009%)
Read2 before filtering:
total reads: 19762559
total bases: 2964383850
Q20 bases: 2619314070(88.3595%)
Q30 bases: 2408732242(81.2557%)
Read1 after filtering:
total reads: 16734573
total bases: 489646406
Q20 bases: 480139354(98.0584%)
Q30 bases: 466862298(95.3468%)
Read2 after filtering:
total reads: 16734573
total bases: 492483642
Q20 bases: 484765577(98.4328%)
Q30 bases: 469308001(95.2941%)
Filtering result:
reads passed filter: 33469146
reads failed due to low quality: 232788
reads failed due to too many N: 0
reads failed due to too short: 5823184
reads with adapter trimmed: 30461598
bases trimmed due to adapters: 2013836548
Duplication rate: 10.4184%
Insert size peak (evaluated by paired-end reads): 29
JSON report: fastp-adapters-polyG.json
HTML report: fastp-adapters-polyG.html
/home/shared/fastp --in1 ./raw/sRNA-POR-79-S1-TP2_R1_001.fastq.gz --in2 ./raw/sRNA-POR-79-S1-TP2_R2_001.fastq.gz --out1 ./trimmed-reads/fastp-adapters-polyG.R1.fq.gz --out2 ./trimmed-reads/fastp-adapters-polyG.R2.fq.gz --adapter_fasta NEB-adapters.fasta --trim_poly_g --overlap_len_require 17 --thread 46 --html fastp-adapters-polyG.html --json fastp-adapters-polyG.json --report_title fastp-adapters-polyG-only
fastp v0.23.2, time used: 31 seconds
PolyG Trimming Max Length 31
This can’t be run because we know from the polyG only trimming, that most read lengths are >89bp. As such, it will end up discarding all the reads…
Adapters & PolyG Trimming Max Length 31
Note
The max length is based on the fastp insert peak size from the adapter and polyG trimming results, and previous evaluation of mean read lengths via FastQC and MultiQC.
WARNING: fastp uses up to 16 threads although you specified 46
Read1 before filtering:
total reads: 19762559
total bases: 2964383850
Q20 bases: 2604771220(87.8689%)
Q30 bases: 2427856243(81.9009%)
Read2 before filtering:
total reads: 19762559
total bases: 2964383850
Q20 bases: 2619314070(88.3595%)
Q30 bases: 2408732242(81.2557%)
Read1 after filtering:
total reads: 13190855
total bases: 339210880
Q20 bases: 335529961(98.9149%)
Q30 bases: 327446980(96.532%)
Read2 after filtering:
total reads: 13190855
total bases: 338558013
Q20 bases: 335504115(99.098%)
Q30 bases: 325352134(96.0994%)
Filtering result:
reads passed filter: 26381710
reads failed due to low quality: 232788
reads failed due to too many N: 0
reads failed due to too short: 5354066
reads failed due to too long: 7556554
reads with adapter trimmed: 30461598
bases trimmed due to adapters: 2013836548
Duplication rate: 10.4184%
Insert size peak (evaluated by paired-end reads): 29
JSON report: fastp-adapters-polyG-31bp.json
HTML report: fastp-adapters-polyG-31bp.html
/home/shared/fastp --in1 ./raw/sRNA-POR-79-S1-TP2_R1_001.fastq.gz --in2 ./raw/sRNA-POR-79-S1-TP2_R2_001.fastq.gz --out1 ./trimmed-reads/fastp-adapters-polyG-31bp.R1.fq.gz --out2 ./trimmed-reads/fastp-adapters-polyG-31bp.R2.fq.gz --adapter_fasta NEB-adapters.fasta --trim_poly_g --overlap_len_require 17 --length_limit 31 --thread 46 --html fastp-adapters-polyG-31bp.html --json fastp-adapters-polyG-31bp.json --report_title fastp-adapters-polyG-31bp
fastp v0.23.2, time used: 31 seconds
Adapters & PolyG Trimming Max Length 31 With Merge
WARNING: fastp uses up to 16 threads although you specified 46
Read1 before filtering:
total reads: 19762559
total bases: 2964383850
Q20 bases: 2604771220(87.8689%)
Q30 bases: 2427856243(81.9009%)
Read2 before filtering:
total reads: 19762559
total bases: 2964383850
Q20 bases: 2619314070(88.3595%)
Q30 bases: 2408732242(81.2557%)
Merged and filtered:
total reads: 12259576
total bases: 324356716
Q20 bases: 320917135(98.9396%)
Q30 bases: 313148708(96.5445%)
Filtering result:
reads passed filter: 26506174
reads failed due to low quality: 179320
reads failed due to too many N: 0
reads failed due to too short: 5774946
reads failed due to too long: 7064678
reads with adapter trimmed: 30461598
bases trimmed due to adapters: 2013836548
reads corrected by overlap analysis: 484376
bases corrected by overlap analysis: 766038
Duplication rate: 10.4184%
Insert size peak (evaluated by paired-end reads): 29
Read pairs merged: 12259576
% of original read pairs: 62.0344%
% in reads after filtering: 100%
JSON report: fastp-adapters-polyG-31bp-merged.json
HTML report: fastp-adapters-polyG-31bp-merged.html
/home/shared/fastp --in1 ./raw/sRNA-POR-79-S1-TP2_R1_001.fastq.gz --in2 ./raw/sRNA-POR-79-S1-TP2_R2_001.fastq.gz --adapter_fasta NEB-adapters.fasta --trim_poly_g --overlap_len_require 17 --length_limit 31 --merge --merged_out ./trimmed-reads/fastp-adapters-polyG-31bp-merged.fq.gz --thread 46 --html fastp-adapters-polyG-31bp-merged.html --json fastp-adapters-polyG-31bp-merged.json --report_title fastp-adapters-polyG-31bp-merged
fastp v0.23.2, time used: 44 seconds
Single-end Trimming - R1 Reads Only
Using just the R1 reads, per Azenta’s recommendation.
Detecting adapter sequence for read1...
>Illumina TruSeq Adapter Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
WARNING: fastp uses up to 16 threads although you specified 46
Read1 before filtering:
total reads: 19762559
total bases: 2964383850
Q20 bases: 2604771220(87.8689%)
Q30 bases: 2427856243(81.9009%)
Read1 after filtering:
total reads: 16984592
total bases: 540469705
Q20 bases: 522006375(96.5838%)
Q30 bases: 504657126(93.3738%)
Filtering result:
reads passed filter: 16984592
reads failed due to low quality: 30923
reads failed due to too many N: 1
reads failed due to too short: 2747043
reads with adapter trimmed: 18545632
bases trimmed due to adapters: 1166285709
Duplication rate (may be overestimated since this is SE data): 29.621%
JSON report: fastp-R1-auto_adapters-polyG.json
HTML report: fastp-R1-auto_adapters-polyG.html
/home/shared/fastp --in1 ./raw/sRNA-POR-79-S1-TP2_R1_001.fastq.gz --out1 ./trimmed-reads/fastp-R1-auto_adapters-polyG.fq.gz --trim_poly_g --thread 46 --html fastp-R1-auto_adapters-polyG.html --json fastp-R1-auto_adapters-polyG.json --report_title fastp-R1-auto_adapters-polyG
fastp v0.23.2, time used: 30 seconds
Auto-adapters & PolyG Trimming Max Length 31
Note
The max length is based on the fastp insert peak size from the adapter and polyG trimming results, and previous evaluation of mean read lengths via FastQC and MultiQC.
Detecting adapter sequence for read1...
>Illumina TruSeq Adapter Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
WARNING: fastp uses up to 16 threads although you specified 46
Read1 before filtering:
total reads: 19762559
total bases: 2964383850
Q20 bases: 2604771220(87.8689%)
Q30 bases: 2427856243(81.9009%)
Read1 after filtering:
total reads: 12801426
total bases: 328647991
Q20 bases: 325069951(98.9113%)
Q30 bases: 317201339(96.517%)
Filtering result:
reads passed filter: 12801426
reads failed due to low quality: 30923
reads failed due to too many N: 1
reads failed due to too short: 2747043
reads failed due to too long: 4183166
reads with adapter trimmed: 18545632
bases trimmed due to adapters: 1166285709
Duplication rate (may be overestimated since this is SE data): 29.621%
JSON report: fastp-R1-auto_adapters-polyG-31bp.json
HTML report: fastp-R1-auto_adapters-polyG-31bp.html
/home/shared/fastp --in1 ./raw/sRNA-POR-79-S1-TP2_R1_001.fastq.gz --out1 ./trimmed-reads/fastp-R1-31bp-auto_adapters-polyG.fq.gz --trim_poly_g --length_limit 31 --thread 46 --html fastp-R1-auto_adapters-polyG-31bp.html --json fastp-R1-auto_adapters-polyG-31bp.json --report_title fastp-R1-auto_adapters-polyG-31bp
fastp v0.23.2, time used: 30 seconds
BBMerge
Raw reads
BBMerge documentations actually indicates that merging performs best using raw, untrimmed reads, so we’ll try that out first…
Running BBMerge on just raw reads yielded poor results (>99% of reads could not be merged). Additionally, the average insert size was determined to be 131.5bp, which is not going to be useful for sRNA-seq.
I know that there’s a very large stretch of polyG sequence in the reads (likely due to the small insert size and “long” read length - Gs get added as a “placeholder” base each cycle that exceeds the insert size), so let’s try merging reads after polyG sequence has been trimmed.
The results from using polyG-trimmed reads are certainly much better than just merging raw reads (12,649,853 reads could be joined this time). Mean insert size is now determined to be 30bp, which is what we might expect. However, the insert range (17 - 289) is still concerning, as reads >30bp likely aren’t usable. Adapter sequences are still present, which are certainly contributing to some of the inability to merge, as well as the large read lengths. So, let’s try merging reads that have been trimmed of both polyG and adapter sequences.
Well, trimming adapters certainly made a difference, as well! Now, 15,533,284 reads get joined (as opposed to 12,649,853 when using just polyG-trimmed reads). The average insert size changes very little, but, oddly, the insert range remains the same (17 - 289).
For kicks, let’s throw the reads that have been trimmed to 31bp into BBMerge and see what happens.
This resulted in 12,346,728 reads being joined. This is on par with the 12,649,853 reads joined when using just polyG-trimmed reads. Additionally, we see the average insert size is 26.3bp, which is even more inline with what we’d expect for miRNAs (22-24bp), and the insert range is also inline with what we’d expect for these libraries (17 - 35bp). However, this is about 20% fewer reads than the 15,533,284 reads when using polyG- and adapter-trimmed reads.
Maybe I’ll test out the subsequent downstream analyses using these different merging results…
FastQC
# Load bash variables into memorysource .bashvars# Populate array with FastQ filesfastq_array=(./trimmed-reads/*.fq.gz)# Pass array contents to new variablefastqc_list=$(echo"${fastq_array[*]}")# Run FastQC# NOTE: Do NOT quote ${fastqc_list}${fastqc}\--threads ${threads}\--quiet \--outdir ./ \${fastqc_list}
Bushnell, Brian, Jonathan Rood, and Esther Singer. 2017. “BBMerge Accurate Paired Shotgun Read Merging via Overlap.” Edited by Patrick Jon Biggs. PLOS ONE 12 (10): e0185056. https://doi.org/10.1371/journal.pone.0185056.
Chen, Shifu. 2023. “Ultrafast One-Pass FASTQ Data Preprocessing, Quality Control, and Deduplication Using Fastp.”iMeta 2 (2). https://doi.org/10.1002/imt2.107.
Chen, Shifu, Yanqing Zhou, Yaru Chen, and Jia Gu. 2018. “Fastp: An Ultra-Fast All-in-One FASTQ Preprocessor.”Bioinformatics 34 (17): i884–90. https://doi.org/10.1093/bioinformatics/bty560.
Sanchez Herrero, Jose Francisco, Raquel Pluvinet, Antonio Luna de Haro, and Lauro Sumoy. 2021. “Paired-End Small RNA Sequencing Reveals a Possible Overestimation in the isomiR Sequence Repertoire Previously Reported from Conventional Single Read Data Analysis.”BMC Bioinformatics 22 (1). https://doi.org/10.1186/s12859-021-04128-1.