Yaamini asked me to run the epidiverse/snp pipeline (GitHub Issue) on her Haws Crassostrea gigas (Pacific oyster) Hawaii bisuflite sequencing BAMs for SNP identification.

This was run using BAMs found here:

https://gannet.fish.washington.edu/spartina/project-oyster-oa/Haws/bismark-2/r3644*.deduplicated.sorted.bam

Genome FastA was a version of the cgigas_uk_roslin_v1 genome in which Yaamini appended the mitochondrial sequences:

cgigas_uk_roslin_v1_genomic-mito.fa (FastA; 626MB)

As part of this, I decided to mess around with the EpiDivers/snp base config file to try to speed things up a bit. I modified it to try to use maximum CPUs and memory (28 and 500GB, respectively) for each step, while running on Mox. I’ll run a duplicate using he original base config file to compare runtimes. A link to the modified config file is linked in the RESULTS section below.

UPDATE 20221215: The comparison job using the base config runs much faster!

As mentioned, the job was run on Mox.

SBATCH script (GitHub):

20221214-cgig-nextflow-epidiverse-snp-haws-hawaii.sh

#!/bin/bash
## Job Name
#SBATCH --job-name=20221214-cgig-nextflow-epidiverse-snp-haws-hawaii
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=12-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20221214-cgig-nextflow-epidiverse-snp-haws-hawaii

# Run EpiDiverse/snp on C.gigas Bismark BAMs generated by Yaamini for Haws Hawaii project.
# Requires a FastA file with extension: .fa
# Requires a FastA index file to be in same directory as FastA.

###################################################################################
# These variables need to be set by user

## Directory with BAM(s)
bams_dir="/gscratch/scrubbed/samwhite/data/C_gigas/BSseq"

## Location of EpiDiverse/snp pipeline directory
epi_snp="/gscratch/srlab/programs/epidiverse-pipelines/snp"

## FastA file is required to end with .fa
## Requires FastA index file to be present in same directory as FastA
genome_fasta="/gscratch/srlab/sam/data/C_gigas/genomes/cgigas_uk_roslin_v1_genomic-mito.fa"

## Location of Nextflow
nextflow="/gscratch/srlab/programs/nextflow-21.10.6-all"

## Specify desired/needed version of Nextflow
nextflow_version="20.07.1"


###################################################################################


# Exit script if a command fails
set -e

# Load Anaconda
# Uknown why this is needed, but Anaconda will not run if this line is not included.
. "/gscratch/srlab/programs/anaconda3/etc/profile.d/conda.sh"

# Activate NF-core conda environment
conda activate epidiverse-snp_env

# Copy config file to this directory (for tracking what was used)
cp "${epi_snp}"/config/base-srlab_500GB_node.config .


## Run EpiDiverse/snp
NXF_VER=${nextflow_version} \
${nextflow} run \
${epi_snp} \
-config ./base-srlab_500GB_node.config \
--input ${bams_dir} \
--reference ${genome_fasta} \
--variants \
--clusters

###################################################################################
# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
  echo "Logging program options..."
  for program in "${!programs_array[@]}"
  do
    {
    echo "Program options for ${program}: "
    echo ""
    # Handle samtools help menus
    if [[ "${program}" == "samtools_index" ]] \
    || [[ "${program}" == "samtools_sort" ]] \
    || [[ "${program}" == "samtools_view" ]]
    then
      ${programs_array[$program]}

    # Handle DIAMOND BLAST menu
    elif [[ "${program}" == "diamond" ]]; then
      ${programs_array[$program]} help

    # Handle NCBI BLASTx menu
    elif [[ "${program}" == "blastx" ]]; then
      ${programs_array[$program]} -help
    fi
    ${programs_array[$program]} -h
    echo ""
    echo ""
    echo "----------------------------------------------"
    echo ""
    echo ""
  } &>> program_options.log || true

    # If MultiQC is in programs_array, copy the config file to this directory.
    if [[ "${program}" == "multiqc" ]]; then
      cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
    fi
  done
  echo "Finished logging programs options."
  echo ""
fi


# Document programs in PATH (primarily for program version ID)
echo "Logging system \$PATH..."
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
echo "Finished logging system $PATH."

RESULTS

Runtime was ~31hrs, which was longer than I expected - having modified the base config file to use more resources per task:

Screencapture showing runtime of 20221214-cgig-nextflow-epidiverse-snp-haws-hawaii job on Mox

Output folder:

20221214-cgig-nextflow-epidiverse-snp-haws-hawaii/snps/
- Modified config file (text):
  - 20221214-cgig-nextflow-epidiverse-snp-haws-hawaii/base-srlab_500GB_node.config
- Variant Call Format (VCF) files and index files:
  - 20221214-cgig-nextflow-epidiverse-snp-haws-hawaii/snps/vcf/

RESULTS

Modified config file (text):

Variant Call Format (VCF) files and index files: