Read Mapping - C.bairdi 201002558-2729-Q7 and 6129-403-26-Q7 Taxa-Specific NanoPore Reads to cbai_genome_v1.01.fasta Using Minimap2 on Mox

After extracting FastQ reads using seqtk on 20201013 from the various taxa I had been interested in, the next thing needed doing was mapping reads to the cbai_genome_v1.01 “genome” assembly from 20200917. I found that Minimap2 will map long reads (e.g. NanoPore), in addition to short reads, so I decided to give that a rip.

Minimap2 was run on Mox.

SBATCH script (GitHub):

#!/bin/bash
## Job Name
#SBATCH --job-name=20201014__cbai_minimap_nanopore-megan6-taxa-reads
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=15-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201014_cbai_minimap_nanopore-megan6-taxa-reads


###################################################################################
# These variables need to be set by user

## Assign Variables

# CPU threads to use
threads=27

# Genome FastA path
genome_fasta=/gscratch/srlab/sam/data/C_bairdi/genomes/cbai_genome_v1.01.fasta


# Paths to programs
minimap2="/gscratch/srlab/programs/minimap2-2.17_x64-linux/minimap2"
samtools="/gscratch/srlab/programs/samtools-1.10/samtools"



# Programs array
declare -A programs_array
programs_array=(
[minimap2]="${minimap2}" \
[samtools_sort]="${samtools} sort" \
[samtools_view]="${samtools} view"
)



###################################################################################

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Capture date
timestamp=$(date +%Y%m%d)

# Loop through each FastQ
for fastq in *.fq
do

  # Parse out sample name
  sample=$(echo "${fastq}" | awk -F"_" '{print $2}')

  # Caputure taxa
  taxa=$(echo "${fastq}" | awk -F"_" '{print $3}')

  # Capture filename prefix
  prefix="${timestamp}_${sample}_${taxa}"

  # Run Minimap2 with Oxford NanoPore Technologies (ONT) option
  # Using SAM output format (-a option)
  ${programs_array[minimap2]} \
  -ax map-ont \
  ${genome_fasta} \
  ${fastq} \
  | ${programs_array[samtools_sort]} --threads ${threads} \
  -O sam \
  > "${prefix}".sorted.sam


  # Capture FastA checksums for verification ()
  echo "Generating checksum for ${fastq}"
  md5sum "${fastq}" > fastq_checksums.md5
  echo "Finished generating checksum for ${fastq}"
  echo ""


done

# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

# Capture program options
## Note: Trinity util/support scripts don't have options/help menus
for program in "${!programs_array[@]}"
do
	{
  echo "Program options for ${program}: "
	echo ""
	${programs_array[$program]} --help
	echo ""
	echo ""
	echo "----------------------------------------------"
	echo ""
	echo ""
} &>> program_options.log || true
done

RESULTS

Although there aren’t that many total number of sequences to map, I was still surprised at how quick this was; ~5mins:

Minimap2 runtime on Mox for all taxa read mapping

Output folder and files:

I left the output files as SAM files (instead of compressing them to the standard BAM format) so that they would be human readable. I haven’t actually explored SAM/BAM manipulation too much in the past and felt that this was a good excuse and being able to view the SAM files without the need to use samtools seemed easier. Also, I knew these would be relatively small files, so compressing them wasn’t a huge priority.

Next up, messing around with these SAM files and identifying contigs/scaffolds where these various reads map.