Trimming/FastQC/MultiQC - P.generosa EPI FastQs with FASTP on Mox

Steven noticed that the M-Bias plots generated by Bismark from these files was a little wonky and asked that I try trimming them a bit more. The files were originally quality/adaptor trimmed with TrimGalore! on 20180516.

I decided to use some trimming software called fastp, since it had a lot of options (including trimming without the need for quality filtering/trimming, as well as being multi-threaded).

Looking at the M-Bias plots and the original FastQC assessment, I opted to trim the first 20bp from the 5’ end of all reads.

I followed this up with FastQC and MultiQC.

Use the following Bash script to initiate file transfer to Mox and call the SBATCH script:

#!/bin/bash

## Rsync P.generosa EPI files and then run SBATCH script for hard trimming.

rsync -av --progress owl:/volume1/web/Athaliana/20180516_geoduck_trimgalore_rrbs/*.fq.gz .


sbatch 20190923_pgen_fastp_EPI_trimming.sh

SBATCH script (GitHub):

#!/bin/bash
## Job Name
#SBATCH --job-name=pgen_fastp_trimming_EPI
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20190923_pgen_fastp_EPI_trimming


# This script is called by 20190923_pgen_EPI_rsync.sh. That script transfers the FastQ files
# to the working directory from: https://owl.fish.washington.edu/Athaliana/20180516_geoduck_trimgalore_rrbs/


# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

# Set number of CPUs to use
threads=28

# Set number of nucleotides to hard trim
num_nucs_trim=20


# Paths to programs
fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp


## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
R1_names_array=()
R2_names_array=()

# Create array of fastq R1 files
for fastq in *R1*.gz
do
  fastq_array_R1+=("${fastq}")
done

# Create array of fastq R2 files
for fastq in *R2*.gz
do
  fastq_array_R2+=("${fastq}")
done


# Create array of sample names
## Uses awk to parse out sample name from filename
for R1_fastq in *R1*.gz
do
  R1_names_array+=($(echo "${R1_fastq}" | awk -F"." '{print $1}'))
done

# Create array of sample names
## Uses awk to parse out sample name from filename
for R2_fastq in *R2*.gz
do
  R2_names_array+=($(echo "${R2_fastq}" | awk -F"." '{print $1}'))
done

# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in *.gz
do
  echo "${fastq}" >> fastq.list.txt
done

# Run fastp on files and trim set number of nucleotides from 5' end of reads
for index in "${!fastq_array_R1[@]}"
do
  R1_sample_name=$(echo "${R1_names_array[index]}")
	R2_sample_name=$(echo "${R2_names_array[index]}")
	${fastp} \
	--in1 "${fastq_array_R1[index]}" \
	--in2 "${fastq_array_R2[index]}" \
	--disable_quality_filtering \
	--disable_length_filtering \
	--disable_adapter_trimming \
	--trim_front1 ${num_nucs_trim} \
	--trim_front2 ${num_nucs_trim} \
	--thread ${threads} \
	--out1 "${R1_sample_name}".20bp-trim.fq.gz \
	--out2 "${R2_sample_name}".20bp-trim.fq.gz
	# Remove original FastQ files
	rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}"
done

RESULTS

This was pretty quick, ~2.75hrs:

fastp runtime screencap

Output folder:

fastp report (HTML):

MultiQC report (HTML):

Firstly, it turns out that fastp is pretty awesome! It automatically generates a summary HTML file with before and after trimming data, similar to that of FastQC. Had I realized this, I might not have bothered with FastQC…

Anyway, with that being said, this is how the before/after look for Read 1s via the fastp HTML:

BEFORE

fastp sequence quality plots before


AFTER

fastp sequence quality plots before