INTRO

Steven and Laura tasked me with running BWA alignments on the Pacific cod heatwave data (GitHub Issue) that Laura previously received. She also previously trimmed that data, but I don’t have a link to the code she used to do that.

This is part of the RobertsLab/project-cod-temperature (GitHub).

The alignment process was run on Raven and took weeks! I opted for this approach, instead of attempting to run the process on Hyak using arrays of nodes because I thought I’d save time by being able to quickly run the job with simpler code on Raven. To run on Hyak, I was going to need to add software to our Singularity container, re-build it, test it, and then adapt code for running node arrays. As it turns out, I should have spent a day or two doing that! The process would’ve been significantly faster splitting it across multiple computing nodes!!

Note

Code below was knitted from 14.00-heatwave-genetics-trimmed-bwa-alignments.Rmd (GitHub)

1 BACKGROUND

This Rmd file will perform the following:

Create BWA genome index using BWA (Li 2013), if needed.
Align trimmed FastQs (previously performed by Laura Spencer) to the G.macrocepahlus genome (NCBI GCF_031168955.1_ASM3116895v1) using BWA (Li 2013).
Use picard (“Picard Toolkit” 2019) to remove duplicates.
Use bamUtil (Jun et al. 2015) to remove overlaps.
Use samtools (Danecek et al. 2021) to compute depth.

2 INPUTS

Genome FastA
BWA genome FastA index

3 OUTPUTS

Deduplicated, sorted, clipped BAM files
samtools depth file
BAM index files
samtools stats file for BWA alignments
samtools flagstat file for BWA alignments

NOTE: Due to the size of the output files, they are not included in the repository. Instead, you can find them in the here:

https://gannet.fish.washington.edu/gitrepos/project-cod-temperature/output/14.00-heatwave-genetics-trimmed-bwa-alignments/

4 VARIABLES

# DIRECTORIES
top_output_dir <- file.path("..", "output")
output_dir <- file.path(top_output_dir, "14.00-heatwave-genetics-trimmed-bwa-alignments")
data_dir <- file.path("..", "data")
trimmed_reads_dir <- file.path(data_dir, "trimmed-fastqs-heatwave-genetics")

# PROGRAMS
programs_dir <- file.path("", "home", "shared")
bamUtil_dir <- file.path(programs_dir, "bamUtil-1.0.15", "bin")
bwa <- file.path(programs_dir, "bwa-0.7.17", "bwa")
conda_env_name <- c("base")
conda_path <- c("/home/sam/programs/mambaforge/condabin/conda")
mean_depths <- c("/home/shared/16TB_HDD_01/sam/gitrepos/WGSfqs-to-genolikelihoods/mean_cov_ind.py")
multiqc <- file.path("", "home", "sam", "programs", "mambaforge", "bin", "multiqc")
picard_dir <- file.path(programs_dir, "picard-3.4.0-6-g80ca98e2a", "build", "libs")
picard <- file.path(picard_dir, "picard.jar")
R1_suffix <- "_trimmed_clipped_R1_paired.fq.gz"
R2_suffix <- "_trimmed_clipped_R2_paired.fq.gz"
samtools_dir <- file.path(programs_dir, "samtools-1.12")
samtools <- file.path(samtools_dir, "samtools")



# FILES
genome_fasta <- file.path(data_dir, "GCF_031168955.1_ASM3116895v1_genomic.fna")
bwa_index <- file.path(data_dir, "bwa-GCF_031168955.1_ASM3116895v1_genomic")

# THREADS
threads <- "40"


# FORMATTING
line <- "-----------------------------------------------"

# Export these as environment variables for bash chunks.
Sys.setenv(
  bamUtil_dir = bamUtil_dir,
  bwa = bwa,
  bwa_index = bwa_index,
  data_dir = data_dir,
  genome_fasta = genome_fasta,
  line = line,
  mean_depths = mean_depths,
  multiqc = multiqc,
  output_dir = output_dir,
  picard_dir = picard_dir,
  picard = picard,
  top_output_dir = top_output_dir,
  R1_suffix = R1_suffix,
  R2_suffix = R2_suffix,
  samtools = samtools,
  threads = threads,
  trimmed_reads_dir = trimmed_reads_dir
)

5 Load Conda environment

This is needed to have access to Python3.

If this is successful, the first line of output should show that Python being used.

E.g.

python: /home/sam/programs/mambaforge/envs/mirmachine_env/bin/python

use_condaenv(condaenv = conda_env_name, conda = conda_path)

# Check successful env loading
py_config()

python:         /home/sam/programs/mambaforge/bin/python
libpython:      /home/sam/programs/mambaforge/lib/libpython3.10.so
pythonhome:     /home/sam/programs/mambaforge:/home/sam/programs/mambaforge
version:        3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
numpy:          /home/sam/programs/mambaforge/lib/python3.10/site-packages/numpy
numpy_version:  1.24.3

NOTE: Python version was forced by use_python() function

6 ALIGNMENTS

6.1 Create BWA Index

# Check if BWA index file exists
if [[ ! -f "bwa-GCF_031168955.1_ASM3116895v1_genomic.bwt" ]]; then
    echo "BWA index file not found. Creating BWA index..."
    ${bwa} index -p ${data_dir}/${bwa_index} ${data_dir}/${genome_fasta}
else
    echo "BWA index file already exists. Skipping index creation."
fi

6.2 Check FastQs


ls -1 "${trimmed_reads_dir}/"*.fq.gz | wc -l

6.3 Align with BWA

This step will produced sorted BAM files.

Alignment is done with the -M option, which does the following:

Mark shorter split hits as secondary (for Picard compatibility).

samtools view is run with the -F 4. This will exclude unmapped reads.

See https://broadinstitute.github.io/picard/explain-flags.html and https://www.htslib.org/doc/samtools-view.html for more explanation.

# Make output directory, if it doesn't exist
mkdir --parents ${output_dir}

# Declare arrays
R1_array=()
R2_array=()

# Populate arrays
R1_array=("${trimmed_reads_dir}"/*"${R1_suffix}")
R2_array=("${trimmed_reads_dir}"/*"${R2_suffix}")

# Run BWA
time \
for fastq in "${!R1_array[@]}"
do
  # Remove path from FastQ
  sample_no_path=${R1_array[fastq]##*/}
  
  # Capture just samples name (string before first `_`)
  sample=${sample_no_path%%_*}
  
  echo "${fastq}"
  echo "${sample}"
  
  # Run BWA
  ${bwa} mem \
  -M \
  -t ${threads} \
  ${bwa_index} \
  ${R1_array[fastq]} \
  ${R2_array[fastq]} \
  2> ${output_dir}/${sample}-bwa.stderr \
  | samtools view -@ 20 -bS -F 4 \
  | samtools sort -@ 20 -o ${output_dir}/${sample}.sorted.bam
done

7 MARK DUPLICATES

7.1 Set read groups

Downstream analysis with Picard requires BAM files to have read groups, but these were not set earlier.

for bam in ${output_dir}/*.sorted.bam
do
  # Remove path from BAM
  sample_no_path=${bam##*/}
  
  # Capture just samples name (string before first `.`)
  sample=${sample_no_path%%.*}
  
  # Check read group
  echo ""
  echo "Read group for ${bam} before samtools:"
  ${samtools} view -H ${bam} | grep '@RG'
  echo ""
  
  ${samtools} addreplacerg \
  -r "@RG\tID:RG1\tSM:${sample}" \
  -o ${output_dir}/${sample}-RG.sorted.bam \
  --threads ${threads} \
  ${bam}
  
  echo "Read group for ${output_dir}/${sample}-RG.sorted.bam:"
  ${samtools} view -H ${output_dir}/${sample}-RG.sorted.bam | grep '@RG'
  echo ""
done

7.2 Picard MarkDuplicates

Use Picard to mark duplicates

for bam in ${output_dir}/*RG.sorted.bam
do
  # Remove path from BAM
  sample_no_path=${bam##*/}
  
  # Capture just samples name (string before first `.`)
  sample=${sample_no_path%%.*}
  
  java -jar ${picard} \
  MarkDuplicates \
  I=${bam} \
  O=${output_dir}/${sample}.sorted.dedup.bam \
  M=${output_dir}/${sample}-picard_dups.log \
  VALIDATION_STRINGENCY=SILENT \
  REMOVE_DUPLICATES=true \
  2> ${output_dir}/${sample}-picard_dups.stderr
done

8 CLIP OVERLAPS

for bam in ${output_dir}/*.sorted.dedup.bam
do
  # Remove path from BAM
  sample_no_path=${bam##*/}
  
  # Capture just samples name (string before first `.`)
  sample=${sample_no_path%%.*}
  
  # Run BamUtils to clip overlaps
  ${bamUtil_dir}/bam \
  clipOverlap \
  --in ${bam} \
  --out ${output_dir}/${sample}.sorted.dedup.clipped.bam \
  --stats \
  2> ${output_dir}/${sample}.sorted.dedup.clipped-bamUtil.stats
done

9 DEPTHS

9.1 samtools depth

for bam in ${output_dir}/*.sorted.dedup.clipped.bam
do
  # Remove path from BAM
  sample_no_path=${bam##*/}
  
  # Capture filename without .bam (string before last `.`)
  sample=${sample_no_path%.*}
  
  # Run BamUtils to clip overlaps
  ${samtools} depth \
  -aa \
  ${bam} \
  | cut -f 3 \
  | gzip \
  > ${output_dir}/${sample}.depth.gz

done

9.2 Mean depths

Uses mean_cov_ind.py Python script from https://github.com/letimm/WGSfqs-to-genolikelihoods/tree/main


cd ${output_dir}

for depth_file in *-RG.sorted.dedup.clipped.depth.gz
do
  ${mean_depths} \
  --infile ${depth_file} \
  --outfile mean-coverage.tab
done

head mean-coverage.tab

MHW001-RG.sorted.dedup.clipped.depth.gz 1.990901406579994
MHW002-RG.sorted.dedup.clipped.depth.gz 2.4325807984518892
MHW003-RG.sorted.dedup.clipped.depth.gz 10.735153890842785
MHW004-RG.sorted.dedup.clipped.depth.gz 8.894250537520717
MHW005-RG.sorted.dedup.clipped.depth.gz 3.3115601503387313
MHW006-RG.sorted.dedup.clipped.depth.gz 2.671577399430869
MHW007-RG.sorted.dedup.clipped.depth.gz 8.092257537557492
MHW008-RG.sorted.dedup.clipped.depth.gz 7.779096441098513
MHW009-RG.sorted.dedup.clipped.depth.gz 11.925061778018813
MHW010-RG.sorted.dedup.clipped.depth.gz 12.765145869286487

10 INDEX BAMS


cd ${output_dir}

for bam in *.sorted.dedup.clipped.bam
do
  ${samtools} index \
  ${bam} \
  --threads ${threads}

done

11 STATS FILES

Generate samtools stats and samtools flagstat for BWA alignments.


for bam in ${output_dir}/*-RG.sorted.bam
do
  # Remove path from BAM
  sample_no_path=${bam##*/}
  
  # Capture just samples name (string before first `.`)
  sample=${sample_no_path%%.*}
  
  # samtools stats
  ${samtools} stats \
  ${bam} \
  -@ ${threads} \
  > ${output_dir}/"${sample_no_path}.stats"
  
  # samtools flagstat
  ${samtools} flagstat \
  ${bam} \
  -@ {threads} \
  > ${output_dir}/"${sample_no_path}.flagstat"
done

11.1 MultiQC on Picard MarkdDuplicates and samtools stats/flagstat files


cd ${output_dir}

${multiqc} --interactive .

  /// MultiQC 🔍 | v1.14

|           multiqc | MultiQC Version v1.30 now available!
|           multiqc | Search path : /home/shared/16TB_HDD_01/sam/gitrepos/RobertsLab/project-cod-temperature/output/14.00-heatwave-genetics-trimmed-bwa-alignments
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 4131/4131  
|            picard | Found 344 MarkDuplicates reports
|          samtools | Found 344 stats reports
|          samtools | Found 344 flagstat reports
|           multiqc | Compressing plot data
|           multiqc | Report      : multiqc_report.html
|           multiqc | Data        : multiqc_data
|           multiqc | MultiQC complete

12 REFERENCES

Danecek, Petr, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, et al. 2021. “Twelve Years of SAMtools and BCFtools.” GigaScience 10 (2). https://doi.org/10.1093/gigascience/giab008.

Jun, Goo, Mary Kate Wing, Gonçalo R. Abecasis, and Hyun Min Kang. 2015. “An Efficient and Scalable Analysis Framework for Variant Extraction and Refinement from Population-Scale DNA Sequence Data.” Genome Research 25 (6): 918–25. https://doi.org/10.1101/gr.176552.114.

Li, Heng. 2013. “Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM.” https://doi.org/10.48550/ARXIV.1303.3997.

“Picard Toolkit.” 2019. Broad Institute, GitHub Repository. https://broadinstitute.github.io/picard/; Broad Institute.