After extracting FastQ reads using seqtk
on 20201013 from the various taxa I had been interested in, the next thing needed doing was mapping reads to the cbai_genome_v1.01
“genome” assembly from 20200917. I found that Minimap2 will map long reads (e.g. NanoPore), in addition to short reads, so I decided to give that a rip.
Minimap2 was run on Mox.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=20201014__cbai_minimap_nanopore-megan6-taxa-reads
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=15-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201014_cbai_minimap_nanopore-megan6-taxa-reads
###################################################################################
# These variables need to be set by user
## Assign Variables
# CPU threads to use
threads=27
# Genome FastA path
genome_fasta=/gscratch/srlab/sam/data/C_bairdi/genomes/cbai_genome_v1.01.fasta
# Paths to programs
minimap2="/gscratch/srlab/programs/minimap2-2.17_x64-linux/minimap2"
samtools="/gscratch/srlab/programs/samtools-1.10/samtools"
# Programs array
declare -A programs_array
programs_array=(
[minimap2]="${minimap2}" \
[samtools_sort]="${samtools} sort" \
[samtools_view]="${samtools} view"
)
###################################################################################
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Capture date
timestamp=$(date +%Y%m%d)
# Loop through each FastQ
for fastq in *.fq
do
# Parse out sample name
sample=$(echo "${fastq}" | awk -F"_" '{print $2}')
# Caputure taxa
taxa=$(echo "${fastq}" | awk -F"_" '{print $3}')
# Capture filename prefix
prefix="${timestamp}_${sample}_${taxa}"
# Run Minimap2 with Oxford NanoPore Technologies (ONT) option
# Using SAM output format (-a option)
${programs_array[minimap2]} \
-ax map-ont \
${genome_fasta} \
${fastq} \
| ${programs_array[samtools_sort]} --threads ${threads} \
-O sam \
> "${prefix}".sorted.sam
# Capture FastA checksums for verification ()
echo "Generating checksum for ${fastq}"
md5sum "${fastq}" > fastq_checksums.md5
echo "Finished generating checksum for ${fastq}"
echo ""
done
# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
# Capture program options
## Note: Trinity util/support scripts don't have options/help menus
for program in "${!programs_array[@]}"
do
{
echo "Program options for ${program}: "
echo ""
${programs_array[$program]} --help
echo ""
echo ""
echo "----------------------------------------------"
echo ""
echo ""
} &>> program_options.log || true
done
RESULTS
Although there aren’t that many total number of sequences to map, I was still surprised at how quick this was; ~5mins:
Output folder and files:
I left the output files as SAM files (instead of compressing them to the standard BAM format) so that they would be human readable. I haven’t actually explored SAM/BAM manipulation too much in the past and felt that this was a good excuse and being able to view the SAM files without the need to use samtools
seemed easier. Also, I knew these would be relatively small files, so compressing them wasn’t a huge priority.
Next up, messing around with these SAM files and identifying contigs/scaffolds where these various reads map.