Steven asked me to trim (GitHub Issue) Ronit’s WGBS sequencing data we received on 20201110, according to Bismark
guidelines for libraries made with the ZymoResearch Pico MethylSeq Kit.
I trimmed the files using fastp
.
The trimming trims adapters and 10bp from both the 5’ and 3’ ends of each read.
I previously ran a trimming where I trimmed only from the 5’ end. Reading the Bismark
documentation more carefully, the documentation suggests that a user “should probably perform 3’ trimming”. So, I’m doing that here.
Job was run on Mox.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=20201202_cgig_fastp-10bp-5-3-prime_ronit-ploidy-wgbs
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201202_cgig_fastp-10bp-5-3-prime_ronit-ploidy-wgbs
### Fastp trimming of Ronit's ploidy WGBS.
### Trims adapters, 10bp from 5' and 3' ends of reads
### Trimming is performed according to recommendation for use with Bismark
### for libraries created using ZymoResearch Pico MethylSeq Kit:
### https://github.com/FelixKrueger/Bismark/blob/master/Docs/README.md#ix-notes-about-different-library-types-and-commercial-kits
### Expects input filenames to be in format: zr3534_3_R1.fq.gz
###################################################################################
# These variables need to be set by user
## Assign Variables
# Set number of CPUs to use
threads=27
# Input/output files
trimmed_checksums=trimmed_fastq_checksums.md5
raw_reads_dir=/gscratch/srlab/sam/data/C_gigas/wgbs
fastq_checksums=raw_fastq_checksums.md5
# Paths to programs
fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp
multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc
## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
R1_names_array=()
R2_names_array=()
# Programs associative array
declare -A programs_array
programs_array=(
[fastp]="${fastp}" \
[multiqc]="${multiqc}"
)
###################################################################################
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Capture date
timestamp=$(date +%Y%m%d)
# Sync raw FastQ files to working directory
rsync --archive --verbose \
"${raw_reads_dir}"zr3534*.fq.gz .
# Create arrays of fastq R1 files and sample names
for fastq in *R1.fq.gz
do
fastq_array_R1+=("${fastq}")
R1_names_array+=("$(echo "${fastq}" | awk 'BEGIN {FS = "[_.]"; OFS = "_"} {print $1, $2, $3}')")
done
# Create array of fastq R2 files
for fastq in *R2.fq.gz
do
fastq_array_R2+=("${fastq}")
R2_names_array+=("$(echo "${fastq}" | awk 'BEGIN {FS = "[_.]"; OFS = "_"} {print $1, $2, $3}')")
done
# Run fastp on files
# Trim 10bp from 5' from each read
# Adds JSON report output for downstream usage by MultiQC
for index in "${!fastq_array_R1[@]}"
do
R1_sample_name=$(echo "${R1_names_array[index]}")
R2_sample_name=$(echo "${R2_names_array[index]}")
${fastp} \
--in1 ${fastq_array_R1[index]} \
--in2 ${fastq_array_R2[index]} \
--detect_adapter_for_pe \
--detect_adapter_for_pe \
--trim_front1 10 \
--trim_front2 10 \
--trim_tail1 10 \
--trim_tail2 10 \
--thread ${threads} \
--html "${R1_sample_name}".fastp-trim."${timestamp}".report.html \
--json "${R1_sample_name}".fastp-trim."${timestamp}".report.json \
--out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \
--out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz
# Generate md5 checksums for newly trimmed files
{
md5sum "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz
md5sum "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz
} >> "${trimmed_checksums}"
# Create list of fastq files used in analysis
# Create MD5 checksum for reference
echo "${fastq_array_R1[index]}" >> input.fastq.list.txt
echo "${fastq_array_R2[index]}" >> input.fastq.list.txt
md5sum "${fastq_array_R1[index]}" >> ${fastq_checksums}
md5sum "${fastq_array_R2[index]}" >> ${fastq_checksums}
# Remove original FastQ files
rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}"
done
# Run MultiQC
${multiqc} .
# Capture program options
for program in "${!programs_array[@]}"
do
{
echo "Program options for ${program}: "
echo ""
# Handle samtools help menus
if [[ "${program}" == "samtools_index" ]] \
|| [[ "${program}" == "samtools_sort" ]] \
|| [[ "${program}" == "samtools_view" ]]
then
${programs_array[$program]}
fi
${programs_array[$program]} -h
echo ""
echo ""
echo "----------------------------------------------"
echo ""
echo ""
} &>> program_options.log || true
# If MultiQC is in programs_array, copy the config file to this directory.
if [[ "${program}" == "multiqc" ]]; then
cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
fi
done
# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
RESULTS
Runtime was actually faster than just the 10bp 5’ trimming from the other day; just over 2hrs:
NOTE: The report files (MultiQC
and fastp
) all suffer from a naming error, but do contain data for both read 1 (R1) and read 2 (R2).
Output folder:
MultiQC
report (HTML; open with web browser):[
fastp
] Reports (HTML; open in web browser):Trimmed FastQ files (gzipped):
zr3534_10_R1.fastp-trim.20201202.fq.gz (4.0G)
- MD5:
a16658d4c034361963a6def0ff266189
- MD5:
zr3534_10_R2.fastp-trim.20201202.fq.gz (4.0G)
- MD5:
8a994705dada067c8440aba9fe9d23f4
- MD5:
zr3534_1_R1.fastp-trim.20201202.fq.gz (3.9G)
- MD5:
9c0a247865d2c4f508f285c7835c09e4
- MD5:
zr3534_1_R2.fastp-trim.20201202.fq.gz (4.0G)
- MD5:
aa4b80519c65404867c91aa037bb7aa0
- MD5:
zr3534_2_R1.fastp-trim.20201202.fq.gz (3.9G)
- MD5:
c0bb9cf83cec7d2e52c1e43300da6d4e
- MD5:
zr3534_2_R2.fastp-trim.20201202.fq.gz (4.0G)
- MD5:
4208221b6ad68f1f75c031a4b376a68d
- MD5:
zr3534_3_R1.fastp-trim.20201202.fq.gz (4.0G)
- MD5:
5362e40e1a021655116d6419bcfd0e8c
- MD5:
zr3534_3_R2.fastp-trim.20201202.fq.gz (4.1G)
- MD5:
df82bdae0560fbb879dbcb1820072df5
- MD5:
zr3534_4_R1.fastp-trim.20201202.fq.gz (4.0G)
- MD5:
c26217872bb0c67b7fc2c117aa455f6c
- MD5:
zr3534_4_R2.fastp-trim.20201202.fq.gz (3.9G)
- MD5:
d59224bf610b1cf55515fe19b0d3acc0
- MD5:
zr3534_5_R1.fastp-trim.20201202.fq.gz (4.3G)
- MD5:
e585a649d232f24ac12d272cff970eaf
- MD5:
zr3534_5_R2.fastp-trim.20201202.fq.gz (4.4G)
- MD5:
876bcdc27fed414a3d130a0061973fac
- MD5:
zr3534_6_R1.fastp-trim.20201202.fq.gz (3.8G)
- MD5:
3347544555a559f8e5b263d71635f525
- MD5:
zr3534_6_R2.fastp-trim.20201202.fq.gz (4.0G)
- MD5:
b57dc21e6d77f1b838398b8bccad6d73
- MD5:
zr3534_7_R1.fastp-trim.20201202.fq.gz (3.5G)
- MD5:
79c2fd1f561254546a2472a5576d0d1d
- MD5:
zr3534_7_R2.fastp-trim.20201202.fq.gz (3.5G)
- MD5:
8f46aa267de3f0cbd43636220187a034
- MD5:
zr3534_8_R1.fastp-trim.20201202.fq.gz (5.0G)
- MD5:
df74f3d43e3e9c695f7cb2f5aca4dedb
- MD5:
zr3534_8_R2.fastp-trim.20201202.fq.gz (4.6G)
- MD5:
88ed7c05649cf08e0cf74ce5db5bdb2a
- MD5:
zr3534_9_R1.fastp-trim.20201202.fq.gz (3.8G)
- MD5:
c95c26e4b9cc9cc09265ddff41a9b32f
- MD5:
zr3534_9_R2.fastp-trim.20201202.fq.gz (3.8G)
- MD5:
f3b82d620ebda578a5e70314bcf2bcdb
- MD5: