INTRO

This notebook is part of the coral E5 timeseries_molecular project (GitHub repo). Due to the desire to have some preliminary data to use for a presentation, Steven wanted the alignment of A.pulchra UNTRIMMED reads(GitHub Issue). However, when initially getting code setup/tweaked, I initiated trimming in the meantime on, so just ended up using the trimmed reads. To aid with speed, performed alignments using Bismark on the Hyak computing cluster running an array of nodes on the coenv cluster. This is technically frowned upon, but seeing as this was during a holiday week, it seemed like there was no other activity, so I just went for it!

METHODS

Important

When launching the SLURM job, you can initiate the array to automatically calculate the appropriate number of nodes by using Bash calculations.

The file fastq_pairs.txt has to exist before execution!

Here’s how the array jobs were launched:

sbatch --array=1-$(wc -l < ../output/01.20-D-Apul-WGBS-trimming-fastp-FastQC-MultiQC/fastq_pairs.txt) 02.20-D-Apul-WGBS-alignment-SLURM-job.sh

This approach required two scripts:

Job script
- 02.20-D-Apul-WGBS-alignment-SLURM_array-bismark.sh (GitHub repo; commit 0474649)
  - This script is executed by each node so that each node operates on unique FastQ pairs.

#!/bin/bash

# This script is designed to be called by a SLURM script which
# runs this script across an array of HPC nodes.

### IMPORTANT ###

# INPUT FILES
repo_dir="/gscratch/srlab/sam/gitrepos/urol-e5/timeseries_molecular"
trimmed_fastqs_dir="${repo_dir}/D-Apul/output/01.20-D-Apul-WGBS-trimming-fastp-FastQC-MultiQC"
bisulfite_genome_dir="${repo_dir}/D-Apul/data"

# OUTPUT FILES
output_dir_top="${repo_dir}/D-Apul/output/02.20-D-Apul-WGBS-alignment-SLURM_array-bismark"

# PARAMETERS
bowtie2_min_score="L,0,-0.6"

# CPU threads
# Bismark already spawns multiple instances and additional threads are multiplicative."
bismark_threads=5

###################################################################################


## SET ARRAY TASKS ##
cd "${output_dir_top}"

# Get the FastQ file pair for this task
# the `p` sets the line number to process
# which corresponds to the array task ID
pair=$(sed -n "${SLURM_ARRAY_TASK_ID}p" "${trimmed_fastqs_dir}/fastq_pairs.txt")

echo "Contents of pair:"
echo "${pair}"
echo ""

R1_file=$(echo $pair | awk '{print $1}')
R2_file=$(echo $pair | awk '{print $2}')

# Get just the sample name (excludes the _R[12]_001*)
sample_name=$(echo "$R1_file" | awk -F"_" '{print $1}')

# Check if R1_file and R2_file are not empty
if [ -z "$R1_file" ] || [ -z "$R2_file" ]; then
  echo "Error: R1_file or R2_file is empty. Exiting."
  exit 1
fi

# Check if sample_name is not empty
if [ -z "$sample_name" ]; then
  echo "Error: sample_name is empty. Exiting."
  exit 1
fi

echo "Contents of sample_name: ${sample_name}"
echo ""


## RUN BISMARK ALIGNMENTS ##
bismark \
--genome ${bisulfite_genome_dir} \
--score_min "${bowtie2_min_score}" \
--parallel "${bismark_threads}" \
--non_directional \
--gzip \
-p "${bismark_threads}" \
-1 ${trimmed_fastqs_dir}/${R1_file} \
-2 ${trimmed_fastqs_dir}/${R2_file} \
--output_dir "${output_dir_top}" \
2> "${output_dir_top}"/${sample_name}-${SLURM_ARRAY_TASK_ID}-bismark_summary.txt

SLURM script:
- 02.20-D-Apul-WGBS-alignment-SLURM-job.sh (GitHub repo; commit 9d9b9b4)
  - This is the SLURM script which controls the SLURM job parameters.

#!/bin/bash
#SBATCH --job-name=bismark_job_array
#SBATCH --account=coenv
#SBATCH --partition=cpu-g2
#SBATCH --output=bismark_job_%A_%a.out
#SBATCH --error=bismark_job_%A_%a.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=100G
#SBATCH --time=72:00:00
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/srlab/sam/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/02.20-D-Apul-WGBS-alignment-SLURM_array-bismark/

# Execute Roberts Lab bioinformatics container
# Binds home directory
# Binds /gscratch directory
# Directory bindings allow outputs to be written to the hard drive.

# Executes Bismark alignment using 02.01-bismark-bowtie2-alignment-SLURM-array.sh script.

# To execute this SLURM script as an array, start the script with the following command:

# sbatch --array=1-$(wc -l < ../output/01.20-D-Apul-WGBS-trimming-fastp-FastQC-MultiQC/fastq_pairs.txt) 02.20-D-Apul-WGBS-alignment-SLURM-job.sh

# IMPORTANT: Requires fastq_pairs.txt to exist prior to submission!
apptainer exec \
--home "$PWD" \
--bind /mmfs1/home/ \
--bind /gscratch \
/gscratch/srlab/sr320/srlab-bioinformatics-container-586bf21.sif \
/gscratch/srlab/sam/gitrepos/urol-e5/timeseries_molecular/D-Apul/code/02.20-D-Apul-WGBS-alignment-SLURM_array-bismark.sh