In continuing to further improve our geoduck genome annotation, I’m attempting to figure out why Scaffold 1 of our assembly doesn’t have any annotations. As part of that I’ve decided to perform a series of genome comparisons and see how they match up, with an emphasis on Scaffold 1, using MUMmer 3.23 (specifically, promer
for protein level comparisons). This software is specifically designed to do this type of comparison.
Basically, MUMmer promer
will take a query genome assembly (Pgenerosa_v074 in this case), translate it into all six reading frames, align it to a reference genome, and determine contig similarities/differences. So, this should provide further insight on what is happening in Pgenerosa_v074 Scaffold 1, when compared to related (or different) species’ genomes.
I’ve previously run nucmer
(for nucleotide comparisons) and thought it might be a good idea to compare these sequences at the protein level, since there is usually a greater degree of sequence conservation at the protein level than at the nucleotide level - thus an increased possibility of alignment across these different species.
INPUT FILES:
Query:
Reference genome (Saccostrea glomerata - Sydney rock oyster):
This was run using MUMmer v3.23 on Mox using the SBATCH script below.
NOTE: The previous nucleotide comparison using nucmer
was performed using MUMmer4 nucmer
- however, MUMmer4 promer
throws an error every time I tried to run it (even after attempting to patch the promer.pl
script).
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=promer_pgen074_vs_sglo-ncbi
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20190813_pgen_mummer_promer_pgen-v074_sglo-ncbi
# Exit if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Load Open MPI module for parallel, multi-node processing
module load icc_19-ompi_3.1.2
# SegFault fix?
export THREADS_DAEMON_MODEL=1
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log
### Set variables
# Filename prefix
prefix="pgen-v074_sglo-ncbi"
pga1_coords="PGA_scaffold1.coords.txt"
# Program paths
promer="/gscratch/srlab/programs/MUMmer3.23/promer"
show_coords="/gscratch/srlab/programs/MUMmer3.23/show-coords"
# M.glomerata NCBI FastA
sglo_fasta="/gscratch/srlab/sam/data/S_glomerata/genomes/GCA_003671525.1_Sgl1.0_genomic.fna"
# P.generosa Pgenerosa_v074 FastA
pgen_v074_fasta="/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v074.fa"
### Run MUMmer (promer)
# Compares pgen_v074 (query) to sglo-ncbi (reference)
"${promer}" \
-p "${prefix}" \
"${sglo_fasta}" \
"${pgen_v074_fasta}"
# Parse promer delta output into more userfriendly format
# -b useful for syteny - merges overlapping alignments
# -c show percent coverage info
# -q option sorts by query
"${show_coords}" \
-b \
-c \
-q \
"${prefix}".delta \
> "${prefix}".coords.txt
# Parse out PGA_scaffold1__77_contigs__length_89643857
head -n 5 "${prefix}".coords.txt > "${pga1_coords}"
grep "PGA_scaffold1__77_contigs__length_89643857" "${prefix}".coords.txt >> "${pga1_coords}"
RESULTS
This took a bit over a 2hrs:
Output folder:
MUMmer delta file (text):
Delta file format is explained here: http://mummer.sourceforge.net/manual/#nucmeroutput
MUMmer coordinates file (text):
MUMmer PGA_scaffold1 coordinates file (text):
The coordinates files have a header like this:
[S1] | [E1] | [S2] | [E2] | [LEN 1] | [LEN 2] | [COV R] | [COV Q] | [TAGS] |
---|
The column labels represent the following:
[S1] Start of the alignment region in the reference sequence.
[E1] End of the alignment region in the reference sequence.
[S2] Start of the alignment region in the query sequence.
[E2] End of the alignment region in the query sequence.
[LEN 1] Length of the alignment region in the reference sequence,
measured in nucleotides.
[LEN 2] Length of the alignment region in the query sequence, measured in nucleotides.
[LEN R] Length of the reference sequence.
[LEN Q] Length of the query sequence.
[COV R] Percent coverage of the alignment on the reference sequence, calculated as [LEN 1] / [LEN R].
[COV Q] Percent coverage of the alignment on the query sequence, calculated as [LEN 2] / [LEN Q].
I’m currently exploring some options for visualizing this data:
Dot: This is the current forerunner, due to simplicity and ease of understanding. Will probably be the fastest way to get some sort of information about these genome comparisons.
CIRCOS: This is very cool and designed for just this type of analysis. However, seems like the learning curve might be a bit steep (i.e. time consuming). I’ll probably play around with this at some point.
MUMmerplots with ggplot2: Uses R to reproduce the dotplots that can be generated with MUMmer. I’ve messed with this and, although easy to plug my data in to, the results seems to be confusing. May need more time with it, though.