Steven asked me to trim (GitHub Issue) Ronit’s WGBS sequencing data we received on 20201110, according to Bismark guidelines for libraries made with the ZymoResearch Pico MethylSeq Kit.

I trimmed the files using fastp.

The trimming trims adapters and 10bp from both the 5’ and 3’ ends of each read.

I previously ran a trimming where I trimmed only from the 5’ end. Reading the Bismark documentation more carefully, the documentation suggests that a user “should probably perform 3’ trimming”. So, I’m doing that here.

Job was run on Mox.

SBATCH script (GitHub):

20201202_cgig_fastp-10bp-5-3-prime_ronit-ploidy-wgbs.sh

#!/bin/bash
## Job Name
#SBATCH --job-name=20201202_cgig_fastp-10bp-5-3-prime_ronit-ploidy-wgbs
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201202_cgig_fastp-10bp-5-3-prime_ronit-ploidy-wgbs


### Fastp trimming of Ronit's ploidy WGBS.

### Trims adapters, 10bp from 5' and 3' ends of reads

### Trimming is performed according to recommendation for use with Bismark
### for libraries created using ZymoResearch Pico MethylSeq Kit:
### https://github.com/FelixKrueger/Bismark/blob/master/Docs/README.md#ix-notes-about-different-library-types-and-commercial-kits


### Expects input filenames to be in format: zr3534_3_R1.fq.gz


###################################################################################
# These variables need to be set by user

## Assign Variables

# Set number of CPUs to use
threads=27

# Input/output files
trimmed_checksums=trimmed_fastq_checksums.md5
raw_reads_dir=/gscratch/srlab/sam/data/C_gigas/wgbs
fastq_checksums=raw_fastq_checksums.md5

# Paths to programs
fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp
multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc

## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
R1_names_array=()
R2_names_array=()


# Programs associative array
declare -A programs_array
programs_array=(
[fastp]="${fastp}" \
[multiqc]="${multiqc}"
)


###################################################################################

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability
module load intel-python3_2017

# Capture date
timestamp=$(date +%Y%m%d)

# Sync raw FastQ files to working directory
rsync --archive --verbose \
"${raw_reads_dir}"zr3534*.fq.gz .

# Create arrays of fastq R1 files and sample names
for fastq in *R1.fq.gz
do
  fastq_array_R1+=("${fastq}")
    R1_names_array+=("$(echo "${fastq}" | awk 'BEGIN {FS = "[_.]"; OFS = "_"} {print $1, $2, $3}')")
done

# Create array of fastq R2 files
for fastq in *R2.fq.gz
do
  fastq_array_R2+=("${fastq}")
    R2_names_array+=("$(echo "${fastq}" | awk 'BEGIN {FS = "[_.]"; OFS = "_"} {print $1, $2, $3}')")
done


# Run fastp on files
# Trim 10bp from 5' from each read
# Adds JSON report output for downstream usage by MultiQC
for index in "${!fastq_array_R1[@]}"
do
  R1_sample_name=$(echo "${R1_names_array[index]}")
    R2_sample_name=$(echo "${R2_names_array[index]}")
    ${fastp} \
    --in1 ${fastq_array_R1[index]} \
    --in2 ${fastq_array_R2[index]} \
    --detect_adapter_for_pe \
  --detect_adapter_for_pe \
  --trim_front1 10 \
  --trim_front2 10 \
  --trim_tail1 10 \
  --trim_tail2 10 \
    --thread ${threads} \
    --html "${R1_sample_name}".fastp-trim."${timestamp}".report.html \
    --json "${R1_sample_name}".fastp-trim."${timestamp}".report.json \
    --out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \
    --out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz

    # Generate md5 checksums for newly trimmed files
    {
        md5sum "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz
        md5sum "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz
    } >> "${trimmed_checksums}"

  # Create list of fastq files used in analysis
  # Create MD5 checksum for reference
  echo "${fastq_array_R1[index]}" >> input.fastq.list.txt
  echo "${fastq_array_R2[index]}" >> input.fastq.list.txt
  md5sum "${fastq_array_R1[index]}" >> ${fastq_checksums}
  md5sum "${fastq_array_R2[index]}" >> ${fastq_checksums}

    # Remove original FastQ files
    rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}"
done

# Run MultiQC
${multiqc} .


# Capture program options
for program in "${!programs_array[@]}"
do
    {
  echo "Program options for ${program}: "
    echo ""
  # Handle samtools help menus
  if [[ "${program}" == "samtools_index" ]] \
  || [[ "${program}" == "samtools_sort" ]] \
  || [[ "${program}" == "samtools_view" ]]
  then
    ${programs_array[$program]}
  fi
    ${programs_array[$program]} -h
    echo ""
    echo ""
    echo "----------------------------------------------"
    echo ""
    echo ""
} &>> program_options.log || true

  # If MultiQC is in programs_array, copy the config file to this directory.
  if [[ "${program}" == "multiqc" ]]; then
    cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
  fi
done


# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

RESULTS

Runtime was actually faster than just the 10bp 5’ trimming from the other day; just over 2hrs:

NOTE: The report files (MultiQC and fastp) all suffer from a naming error, but do contain data for both read 1 (R1) and read 2 (R2).

Output folder:

20201202_cgig_fastp-10bp-5-3-prime_ronit-ploidy-wgbs/
MultiQC report (HTML; open with web browser):
- 20201202_cgig_fastp-10bp-5-3-prime_ronit-ploidy-wgbs/multiqc_report.html
[fastp] Reports (HTML; open in web browser):
Trimmed FastQ files (gzipped):
- zr3534_10_R1.fastp-trim.20201202.fq.gz (4.0G)
  - MD5: a16658d4c034361963a6def0ff266189
- zr3534_10_R2.fastp-trim.20201202.fq.gz (4.0G)
  - MD5: 8a994705dada067c8440aba9fe9d23f4
- zr3534_1_R1.fastp-trim.20201202.fq.gz (3.9G)
  - MD5: 9c0a247865d2c4f508f285c7835c09e4
- zr3534_1_R2.fastp-trim.20201202.fq.gz (4.0G)
  - MD5: aa4b80519c65404867c91aa037bb7aa0
- zr3534_2_R1.fastp-trim.20201202.fq.gz (3.9G)
  - MD5: c0bb9cf83cec7d2e52c1e43300da6d4e
- zr3534_2_R2.fastp-trim.20201202.fq.gz (4.0G)
  - MD5: 4208221b6ad68f1f75c031a4b376a68d
- zr3534_3_R1.fastp-trim.20201202.fq.gz (4.0G)
  - MD5: 5362e40e1a021655116d6419bcfd0e8c
- zr3534_3_R2.fastp-trim.20201202.fq.gz (4.1G)
  - MD5: df82bdae0560fbb879dbcb1820072df5
- zr3534_4_R1.fastp-trim.20201202.fq.gz (4.0G)
  - MD5: c26217872bb0c67b7fc2c117aa455f6c
- zr3534_4_R2.fastp-trim.20201202.fq.gz (3.9G)
  - MD5: d59224bf610b1cf55515fe19b0d3acc0
- zr3534_5_R1.fastp-trim.20201202.fq.gz (4.3G)
  - MD5: e585a649d232f24ac12d272cff970eaf
- zr3534_5_R2.fastp-trim.20201202.fq.gz (4.4G)
  - MD5: 876bcdc27fed414a3d130a0061973fac
- zr3534_6_R1.fastp-trim.20201202.fq.gz (3.8G)
  - MD5: 3347544555a559f8e5b263d71635f525
- zr3534_6_R2.fastp-trim.20201202.fq.gz (4.0G)
  - MD5: b57dc21e6d77f1b838398b8bccad6d73
- zr3534_7_R1.fastp-trim.20201202.fq.gz (3.5G)
  - MD5: 79c2fd1f561254546a2472a5576d0d1d
- zr3534_7_R2.fastp-trim.20201202.fq.gz (3.5G)
  - MD5: 8f46aa267de3f0cbd43636220187a034
- zr3534_8_R1.fastp-trim.20201202.fq.gz (5.0G)
  - MD5: df74f3d43e3e9c695f7cb2f5aca4dedb
- zr3534_8_R2.fastp-trim.20201202.fq.gz (4.6G)
  - MD5: 88ed7c05649cf08e0cf74ce5db5bdb2a
- zr3534_9_R1.fastp-trim.20201202.fq.gz (3.8G)
  - MD5: c95c26e4b9cc9cc09265ddff41a9b32f
- zr3534_9_R2.fastp-trim.20201202.fq.gz (3.8G)
  - MD5: f3b82d620ebda578a5e70314bcf2bcdb