INTRO
Steven and Laura tasked me with running BWA alignments on the Pacific cod heatwave data (GitHub Issue) that Laura previously received. She also previously trimmed that data, but I don’t have a link to the code she used to do that.
This is part of the RobertsLab/project-cod-temperature
(GitHub).
The alignment process was run on Raven and took weeks! I opted for this approach, instead of attempting to run the process on Hyak using arrays of nodes because I thought I’d save time by being able to quickly run the job with simpler code on Raven. To run on Hyak, I was going to need to add software to our Singularity container, re-build it, test it, and then adapt code for running node arrays. As it turns out, I should have spent a day or two doing that! The process would’ve been significantly faster splitting it across multiple computing nodes!!
Code below was knitted from 14.00-heatwave-genetics-trimmed-bwa-alignments.Rmd
(GitHub)
1 BACKGROUND
This Rmd file will perform the following:
Create BWA genome index using BWA (Li 2013), if needed.
Align trimmed FastQs (previously performed by Laura Spencer) to the G.macrocepahlus genome (NCBI GCF_031168955.1_ASM3116895v1) using BWA (Li 2013).
Use picard (“Picard Toolkit” 2019) to remove duplicates.
Use bamUtil (Jun et al. 2015) to remove overlaps.
Use samtools (Danecek et al. 2021) to compute depth.
2 INPUTS
- Genome FastA
- BWA genome FastA index
3 OUTPUTS
- Deduplicated, sorted, clipped BAM files
- samtools depth file
- BAM index files
- samtools stats file for BWA alignments
- samtools flagstat file for BWA alignments
NOTE: Due to the size of the output files, they are not included in the repository. Instead, you can find them in the here:
4 VARIABLES
# DIRECTORIES
<- file.path("..", "output")
top_output_dir <- file.path(top_output_dir, "14.00-heatwave-genetics-trimmed-bwa-alignments")
output_dir <- file.path("..", "data")
data_dir <- file.path(data_dir, "trimmed-fastqs-heatwave-genetics")
trimmed_reads_dir
# PROGRAMS
<- file.path("", "home", "shared")
programs_dir <- file.path(programs_dir, "bamUtil-1.0.15", "bin")
bamUtil_dir <- file.path(programs_dir, "bwa-0.7.17", "bwa")
bwa <- c("base")
conda_env_name <- c("/home/sam/programs/mambaforge/condabin/conda")
conda_path <- c("/home/shared/16TB_HDD_01/sam/gitrepos/WGSfqs-to-genolikelihoods/mean_cov_ind.py")
mean_depths <- file.path("", "home", "sam", "programs", "mambaforge", "bin", "multiqc")
multiqc <- file.path(programs_dir, "picard-3.4.0-6-g80ca98e2a", "build", "libs")
picard_dir <- file.path(picard_dir, "picard.jar")
picard <- "_trimmed_clipped_R1_paired.fq.gz"
R1_suffix <- "_trimmed_clipped_R2_paired.fq.gz"
R2_suffix <- file.path(programs_dir, "samtools-1.12")
samtools_dir <- file.path(samtools_dir, "samtools")
samtools
# FILES
<- file.path(data_dir, "GCF_031168955.1_ASM3116895v1_genomic.fna")
genome_fasta <- file.path(data_dir, "bwa-GCF_031168955.1_ASM3116895v1_genomic")
bwa_index
# THREADS
<- "40"
threads
# FORMATTING
<- "-----------------------------------------------"
line
# Export these as environment variables for bash chunks.
Sys.setenv(
bamUtil_dir = bamUtil_dir,
bwa = bwa,
bwa_index = bwa_index,
data_dir = data_dir,
genome_fasta = genome_fasta,
line = line,
mean_depths = mean_depths,
multiqc = multiqc,
output_dir = output_dir,
picard_dir = picard_dir,
picard = picard,
top_output_dir = top_output_dir,
R1_suffix = R1_suffix,
R2_suffix = R2_suffix,
samtools = samtools,
threads = threads,
trimmed_reads_dir = trimmed_reads_dir
)
5 Load Conda environment
This is needed to have access to Python3.
If this is successful, the first line of output should show that Python being used.
E.g.
python: /home/sam/programs/mambaforge/envs/mirmachine_env/bin/python
use_condaenv(condaenv = conda_env_name, conda = conda_path)
# Check successful env loading
py_config()
python: /home/sam/programs/mambaforge/bin/python
libpython: /home/sam/programs/mambaforge/lib/libpython3.10.so
pythonhome: /home/sam/programs/mambaforge:/home/sam/programs/mambaforge
version: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
numpy: /home/sam/programs/mambaforge/lib/python3.10/site-packages/numpy
numpy_version: 1.24.3
NOTE: Python version was forced by use_python() function
6 ALIGNMENTS
6.1 Create BWA Index
# Check if BWA index file exists
if [[ ! -f "bwa-GCF_031168955.1_ASM3116895v1_genomic.bwt" ]]; then
echo "BWA index file not found. Creating BWA index..."
${bwa} index -p ${data_dir}/${bwa_index} ${data_dir}/${genome_fasta}
else
echo "BWA index file already exists. Skipping index creation."
fi
6.2 Check FastQs
ls -1 "${trimmed_reads_dir}/"*.fq.gz | wc -l
6.3 Align with BWA
This step will produced sorted BAM files.
Alignment is done with the -M
option, which does the following:
Mark shorter split hits as secondary (for Picard compatibility).
samtools view
is run with the -F 4
. This will exclude unmapped reads.
See https://broadinstitute.github.io/picard/explain-flags.html and https://www.htslib.org/doc/samtools-view.html for more explanation.
# Make output directory, if it doesn't exist
mkdir --parents ${output_dir}
# Declare arrays
R1_array=()
R2_array=()
# Populate arrays
R1_array=("${trimmed_reads_dir}"/*"${R1_suffix}")
R2_array=("${trimmed_reads_dir}"/*"${R2_suffix}")
# Run BWA
time \
"${!R1_array[@]}"
for fastq in do
# Remove path from FastQ
sample_no_path=${R1_array[fastq]##*/}
# Capture just samples name (string before first `_`)
sample=${sample_no_path%%_*}
echo "${fastq}"
echo "${sample}"
# Run BWA
${bwa} mem \
-M \
-t ${threads} \
${bwa_index} \
${R1_array[fastq]} \
${R2_array[fastq]} \
2> ${output_dir}/${sample}-bwa.stderr \
| samtools view -@ 20 -bS -F 4 \
| samtools sort -@ 20 -o ${output_dir}/${sample}.sorted.bam
done
7 MARK DUPLICATES
7.1 Set read groups
Downstream analysis with Picard requires BAM files to have read groups, but these were not set earlier.
for bam in ${output_dir}/*.sorted.bam
do
# Remove path from BAM
sample_no_path=${bam##*/}
# Capture just samples name (string before first `.`)
sample=${sample_no_path%%.*}
# Check read group
echo ""
echo "Read group for ${bam} before samtools:"
${samtools} view -H ${bam} | grep '@RG'
echo ""
${samtools} addreplacerg \
-r "@RG\tID:RG1\tSM:${sample}" \
-o ${output_dir}/${sample}-RG.sorted.bam \
--threads ${threads} \
${bam}
echo "Read group for ${output_dir}/${sample}-RG.sorted.bam:"
${samtools} view -H ${output_dir}/${sample}-RG.sorted.bam | grep '@RG'
echo ""
done
7.2 Picard MarkDuplicates
Use Picard to mark duplicates
for bam in ${output_dir}/*RG.sorted.bam
do
# Remove path from BAM
sample_no_path=${bam##*/}
# Capture just samples name (string before first `.`)
sample=${sample_no_path%%.*}
java -jar ${picard} \
\
MarkDuplicates ${bam} \
I=${output_dir}/${sample}.sorted.dedup.bam \
O=${output_dir}/${sample}-picard_dups.log \
M=\
VALIDATION_STRINGENCY=SILENT \
REMOVE_DUPLICATES=true 2> ${output_dir}/${sample}-picard_dups.stderr
done
8 CLIP OVERLAPS
for bam in ${output_dir}/*.sorted.dedup.bam
do
# Remove path from BAM
sample_no_path=${bam##*/}
# Capture just samples name (string before first `.`)
sample=${sample_no_path%%.*}
# Run BamUtils to clip overlaps
${bamUtil_dir}/bam \
\
clipOverlap --in ${bam} \
--out ${output_dir}/${sample}.sorted.dedup.clipped.bam \
--stats \
2> ${output_dir}/${sample}.sorted.dedup.clipped-bamUtil.stats
done
9 DEPTHS
9.1 samtools depth
for bam in ${output_dir}/*.sorted.dedup.clipped.bam
do
# Remove path from BAM
sample_no_path=${bam##*/}
# Capture filename without .bam (string before last `.`)
sample=${sample_no_path%.*}
# Run BamUtils to clip overlaps
${samtools} depth \
-aa \
${bam} \
| cut -f 3 \
| gzip \
> ${output_dir}/${sample}.depth.gz
done
9.2 Mean depths
Uses mean_cov_ind.py
Python script from https://github.com/letimm/WGSfqs-to-genolikelihoods/tree/main
cd ${output_dir}
for depth_file in *-RG.sorted.dedup.clipped.depth.gz
do
${mean_depths} \
--infile ${depth_file} \
--outfile mean-coverage.tab
done
head mean-coverage.tab
MHW001-RG.sorted.dedup.clipped.depth.gz 1.990901406579994
MHW002-RG.sorted.dedup.clipped.depth.gz 2.4325807984518892
MHW003-RG.sorted.dedup.clipped.depth.gz 10.735153890842785
MHW004-RG.sorted.dedup.clipped.depth.gz 8.894250537520717
MHW005-RG.sorted.dedup.clipped.depth.gz 3.3115601503387313
MHW006-RG.sorted.dedup.clipped.depth.gz 2.671577399430869
MHW007-RG.sorted.dedup.clipped.depth.gz 8.092257537557492
MHW008-RG.sorted.dedup.clipped.depth.gz 7.779096441098513
MHW009-RG.sorted.dedup.clipped.depth.gz 11.925061778018813
MHW010-RG.sorted.dedup.clipped.depth.gz 12.765145869286487
10 INDEX BAMS
cd ${output_dir}
for bam in *.sorted.dedup.clipped.bam
do
${samtools} index \
${bam} \
--threads ${threads}
done
11 STATS FILES
Generate samtools stats
and samtools flagstat
for BWA alignments.
for bam in ${output_dir}/*-RG.sorted.bam
do
# Remove path from BAM
sample_no_path=${bam##*/}
# Capture just samples name (string before first `.`)
sample=${sample_no_path%%.*}
# samtools stats
${samtools} stats \
${bam} \
-@ ${threads} \
> ${output_dir}/"${sample_no_path}.stats"
# samtools flagstat
${samtools} flagstat \
${bam} \
-@ {threads} \
> ${output_dir}/"${sample_no_path}.flagstat"
done
11.1 MultiQC on Picard MarkdDuplicates and samtools stats/flagstat files
cd ${output_dir}
${multiqc} --interactive .
/// MultiQC 🔍 | v1.14
| multiqc | MultiQC Version v1.30 now available!
| multiqc | Search path : /home/shared/16TB_HDD_01/sam/gitrepos/RobertsLab/project-cod-temperature/output/14.00-heatwave-genetics-trimmed-bwa-alignments
| searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 4131/4131
| picard | Found 344 MarkDuplicates reports
| samtools | Found 344 stats reports
| samtools | Found 344 flagstat reports
| multiqc | Compressing plot data
| multiqc | Report : multiqc_report.html
| multiqc | Data : multiqc_data
| multiqc | MultiQC complete