INTRO
This notebook is part of the ceasmallr (GitHub repo) project, comparing DNA methylation in Crassostrea virginica (Eastern oyster) larvae and zygotes, via whole-genome bisulfite sequencing (WGBS).
Trimming was previously performed on 20241115 (Notebook entry).
Subsequently, the trimmed reads needed to aligned to the Crassostrea virginica (Eastern oyster) genome, using Bismark
. Alignments were initially performed on Raven, but were done using trimmed reads containing errors from 20220829 (Notebook entry), so needed to be re-trimmed and realigned. Additionally, the alignments took nearly two weeks! So, we explored a different approach to speed things up by utilizing node arrays on the Univ. of Washington’s HPC, Klone (Hyak).
The code originally run is here, and is still useful for those without access to HPC resources:
- 02.00-bismark-bowtie2-alignment.Rmd (GitHub repo)
METHODS
When launching the SLURM job, you can initiate the array to determine the appropriate number of nodes by using Bash calculations.
The file fastq_pairs.txt
has to exist before execution!
sbatch --array=0-$(($$(wc -l < fastq_pairs.txt) - 1)) 02.02-bismark-SLURM-job.sh
This approach required two scripts:
SLURM script
#!/bin/bash #SBATCH --job-name=bismark_job_array #SBATCH --account=coenv #SBATCH --partition=cpu-g2 #SBATCH --output=bismark_job_%A_%a.out #SBATCH --error=bismark_job_%A_%a.err #SBATCH --ntasks=1 #SBATCH --cpus-per-task=20 #SBATCH --mem=100G #SBATCH --time=72:00:00 ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/gitrepos/ceasmallr/output/02.01-bismark-bowtie2-alignment-SLURM-array/ # Execute Roberts Lab bioinformatics container # Binds home directory # Binds /gscratch directory # Directory bindings allow outputs to be written to the hard drive. # Executes Bismark alignment using 02.01-bismark-bowtie2-alignment-SLURM-array.sh script. # To execture this SLURM script as an array, start the script with the following command: # sbatch --array=0-$(($$(wc -l < fastq_pairs.txt) - 1)) 02.02-bismark-SLURM-job.sh # IMPORTANT: Requires fastq_pairs.txt to exist prior to submission! apptainer exec \ "$PWD" \ --home \ --bind /mmfs1/home/ \ --bind /gscratch \ /gscratch/srlab/sr320/srlab-bioinformatics-container-586bf21.sif /gscratch/scrubbed/samwhite/gitrepos/ceasmallr/code/02.01-bismark-bowtie2-alignment-SLURM-array.sh
Job script:
- 02.01-bismark-bowtie2-alignment-SLURM-array.sh (GitHub repo)
#!/bin/bash # This script is designed to be called by a SLURM script which # runs this script across an array of HPC nodes. ################################################################################### # These variables need to be set by user ## Assign Variables # INPUT FILES repo_dir="/gscratch/scrubbed/samwhite/gitrepos/ceasmallr" trimmed_fastqs_dir="${repo_dir}/output/00.00-trimming-fastp" bisulfite_genome_dir="${repo_dir}/data/Cvirginica_v300" # OUTPUT FILES output_dir_top="${repo_dir}/output/02.01-bismark-bowtie2-alignment-SLURM-array" # PARAMETERS bowtie2_min_score="L,0,-0.6" # CPU threads # Bismark already spawns multiple instances and additional threads are multiplicative." bismark_threads=15 ################################################################################### # Print name of container echo "Running in Apptainer container: ${APPTAINER_CONTAINER}" echo "" # Make output directory, if it doesn't exist mkdir --parents "02.01-bismark-bowtie2-alignment-SLURM-array"${output_dir_top}"" ## CREATE LIST OF PAIRED READS ## cd "${trimmed_fastqs_dir}" if [[ -f "${output_dir_top}"/fastq_pairs.txt ]]; then rm "${output_dir_top}"/fastq_pairs.txt fi # Find all _R1_ files and match them with their corresponding _R2_ files for R1_file in *_R1_*.fq.gz; do R2_file="${R1_file/_R1_/_R2_}" if [[ -f "$R2_file" ]]; then echo "$R1_file $R2_file" >> "${output_dir_top}"/fastq_pairs.txt else echo "Warning: No matching R2 file for $R1_file" fi done ## SET ARRAY TASKS ## cd "${output_dir_top}" # Get the FastQ file pair for this task pair=$(sed -n "${SLURM_ARRAY_TASK_ID}p" fastq_pairs.txt) R1=$(echo $pair | awk '{print $1}') R2=$(echo $pair | awk '{print $2}') # Check if R1 and R2 are not empty if [ -z "$R1" ] || [ -z "$R2" ]; then echo "Error: R1 or R2 is empty. Exiting." exit 1 fi # Get just the sample name (excludes the _R[12]_001*) sample_name="${R1%%_*}" # Check if sample_name is not empty if [ -z "$sample_name" ]; then echo "Error: sample_name is empty. Exiting." exit 1 fi ## RUN BISMARK ALIGNMENTS ## bismark \ ${bisulfite_genome_dir} \ --genome "${bowtie2_min_score}" \ --score_min "${bismark_threads}" \ --parallel \ --non_directional \ --gzip "${bismark_threads}" \ -p ${trimmed_fastqs_dir}/${R1} \ -1 ${trimmed_fastqs_dir}/${R2} \ -2 "${output_dir_top}" \ --output_dir > "${output_dir_top}"/${sample_name}-${SLURM_ARRAY_TASK_ID}-bismark_summary.txt 2
RESULTS
Outputs
The expected outputs will be:
*_R1_001.fastp-trim_bismark_bt2_PE_report.txt
: A text file summarizing the alignment input and results. Despite theR1
naming, these reports are based on paired reads; theR1
naming is a quirk of Bismark.*_R1_001.fastp-trim_bismark_bt2_pe.bam
:A BAM alignment. Despite theR1
naming, these BAMs are paired reads; theR1
naming is a quirk of Bismark.bismark_summary.txt
: An overall summary report of the alignment process. Essentially, this is all of the individual*report.txt
files combined into a single file.multiqc_report.html
: A summary report of the alignment results generated by MultiQC, in HTML format.
Due to the large file sizes of BAMS, these cannot be hosted in the ceasmallr GitHub repo. As such these files are available for download here: