In continuing to further improve our geoduck genome annotation, I’m attempting to figure out why Scaffold 1 of our assembly doesn’t have any annotations. As part of that I’ve decided to perform a series of genome comparisons and see how they match up, with an emphasis on Scaffold 1, using MUMmer (v4) (specifically, nucmer for nucleotide comparisons). This software is specifically designed to do this type of comparison.

Basically, MUMmer will take a query genome assembly (Pgenerosa_v074 in this case), align it to a reference genome, and determine contig similarities/differences. So, this should provide further insight on what is happening in Pgenerosa_v074 Scaffold 1, when compared to related (or different) species’ genomes.

INPUT FILES:

Query:

Pgenerosa_v074.fa

Reference genome (Homo sapiens - humans):

GCA_000001405.28_GRCh38.p13_genomic.fna.gz (NCBI)

This was run using MUMmer v4 on Mox using the SBATCH script below.

SBATCH script (GitHub):

20190807_pgen_mummer_pgen-v074_hsap-ncbi.sh

#!/bin/bash
## Job Name
#SBATCH --job-name=mummer_pgen074
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=4-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190807_pgen_mummer_pgen-v074_hsap-ncbi

# Exit if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Load Open MPI module for parallel, multi-node processing

module load icc_19-ompi_3.1.2

# SegFault fix?
export THREADS_DAEMON_MODEL=1

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log

### Set variables
# CPUs to use
threads=28

# Filename prefix
prefix="pgen-v074_hsap-ncbi"
pga1_coords="PGA_scaffold1.coords.txt"

# Program paths
nucmer="/gscratch/srlab/programs/mummer-4.0.0beta2/nucmer"
show_coords="/gscratch/srlab/programs/mummer-4.0.0beta2/show-coords"

# C.gigas NCBI FastA
hsap_fasta="/gscratch/srlab/sam/data/H_sapiens/genomes/GCA_000001405.28_GRCh38.p13_genomic.fna"

# P.generosa Pgenerosa_v074 FastA
pgen_v074_fasta="/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v074.fa"

### Run MUMmer (nucmer)
# Compares pgen_v074 (query) to hsap-ncbi (reference)
"${nucmer}" \
--prefix="${prefix}" \
--threads="${threads}" \
"${hsap_fasta}" \
"${pgen_v074_fasta}"

# Parse nucmer delta output into more userfriendly format
# -b useful for syteny - merges overlapping alignments
# -c show percent coverage info
# -q option sorts by query
"${show_coords}" \
-b \
-c \
-q \
"${prefix}".delta \
> "${prefix}".coords.txt

# Parse out PGA_scaffold1__77_contigs__length_89643857
head -n 5 "${prefix}".coords.txt > "${pga1_coords}"
grep "PGA_scaffold1__77_contigs__length_89643857" "${prefix}".coords.txt >> "${pga1_coords}"

RESULTS

This took ~3.5hrs.

pgen-v074 vs hsap-ncbi MUMmer runtime screencap

Output folder:

20190807_pgen_mummer_pgen-v074_hsap-ncbi/

MUMmer delta file (text):

20190807_pgen_mummer_pgen-v074_hsap-ncbi/pgen-v074_hsap-ncbi.delta

Delta file format is explained here: http://mummer.sourceforge.net/manual/#nucmeroutput

MUMmer coordinates file (text):

20190807_pgen_mummer_pgen-v074_hsap-ncbi/pgen-v074_hsap-ncbi.coords.txt

MUMmer PGA_scaffold1 coordinates file (text):

20190807_pgen_mummer_pgen-v074_hsap-ncbi/PGA_scaffold1.coords.txt

The coordinates files have a header like this:

[S1]	[E1]	[S2]	[E2]	[LEN 1]	[LEN 2]	[COV R]	[COV Q]	[TAGS]

The column labels represent the following:

[S1]    Start of the alignment region in the reference sequence.

[E1]    End of the alignment region in the reference sequence.

[S2]    Start of the alignment region in the query sequence.

[E2]    End of the alignment region in the query sequence.

[LEN 1] Length of the alignment region in the reference sequence,
    measured in nucleotides.

[LEN 2] Length of the alignment region in the query sequence, measured in nucleotides.

[LEN R] Length of the reference sequence.

[LEN Q] Length of the query sequence.

[COV R] Percent coverage of the alignment on the reference sequence, calculated as [LEN 1] / [LEN R].

[COV Q] Percent coverage of the alignment on the query sequence, calculated as [LEN 2] / [LEN Q].

I’m currently exploring some options for visualizing this data:

Dot: This is the current forerunner, due to simplicity and ease of understanding. Will probably be the fastest way to get some sort of information about these genome comparisons.
CIRCOS: This is very cool and designed for just this type of analysis. However, seems like the learning curve might be a bit steep (i.e. time consuming). I’ll probably play around with this at some point.
MUMmerplots with ggplot2: Uses R to reproduce the dotplots that can be generated with MUMmer. I’ve messed with this and, although easy to plug my data in to, the results seems to be confusing. May need more time with it, though.