Gene Prediction - HiSeqX Metagenomics from Geoduck Water Using MetaGeneMark on Mox

After assembline the metagenomic data yesterday, I needed to predict some genes. I did this using MetaGeneMark (v.3.38) and ran it on Mox.

Input FastA(2.2GB):

SBATCH script (text):


#!/bin/bash
## Job Name
#SBATCH --job-name=busco
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=4-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190103_metagenomics_geo_metagenemark

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Load Open MPI module for parallel, multi-node processing

module load icc_19-ompi_3.1.2

# SegFault fix?
export THREADS_DAEMON_MODEL=1

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log

# Variables
gmhmmp=/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/gmhmmp
mgm_mod=/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/MetaGeneMark_v1.mod
assembly_fasta=/gscratch/scrubbed/samwhite/outputs/20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa
nuc_out=20190103-mgm-nucleotides.fa
gff_out=20190103-mgm.gff3
prot_out=20190103-mgm-proteins.fa

# Run MetaGeneMark
## Specifying the following:
### -a : output predicted proteins
### -A : write predicted proteins to designated file
### -d : output predicted nucleotides
### -D : write predicted protein to designated file
### -f 3 : Output format in GFF3
### -m : Model file (supplied with software)
### -o : write GFF3 to designated file
${gmhmmp} \
-a \
-A ${prot_out} \
-d \
-D ${nuc_out} \
-f 3 \
-m ${mgm_mod} \
${assembly_fasta} \
-o ${gff_out}

This will output predicted genes, both nucleotides and proteins, as FastA files, and a GFF3 file.


RESULTS

Whoa! This was ridiclously fast! It completed in ~5 minutes!

Output folder:

Nucleotide FastA (1.6GB):

Protein FastA (727MB):

GFF3 File (1.3GB):

A cursory glance at the FastA files (grep -c ">" fasta) indicate a total of 3,296,610 genes predicted.

Now, for some annotations using BLASTn and/or BLASTp…