Metagenome Assembly - P.generosa Water Sample Trimmed HiSeqX Data Using Megahit on Mox

Attempting an assembly of our geoduck metagenomics HiSeqX data using Megahit (v.1.1.4) on a Mox node.

Assembly was run using default settings.

Assembly coverage was assessed using the following BBmap (v.38.34) scripts:

  • bbwrap.sh (alignments)

  • pileup.sh (coverage)

SBATCH script is linked here and pasted in full below:

#!/bin/bash
## Job Name
#SBATCH --job-name=megahit
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190102_metagenomics_geo_megahit

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Load Open MPI module for parallel, multi-node processing

module load icc_19-ompi_3.1.2

# SegFault fix?
export THREADS_DAEMON_MODEL=1

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# variables
fastq_dir=/gscratch/srlab/sam/data/metagenomics/P_generosa
megahit=/gscratch/srlab/programs/megahit_v1.1.4_LINUX_CPUONLY_x86_64-bin/megahit
bbmap_dir=/gscratch/srlab/programs/bbmap_38.34

## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()

# Create array of fastq R1 files
for fastq in ${fastq_dir}/*R1*.gz; do
  fastq_array_R1+=(${fastq})
done

# Create array of fastq R2 files
for fastq in ${fastq_dir}/*R2*.gz; do
  fastq_array_R2+=(${fastq})
done

# Create comma-separated list of input files
R1_fastq_list=$(IFS=,; echo "${fastq_array_R1[*]}")
R2_fastq_list=$(IFS=,; echo "${fastq_array_R2[*]}")


# Run Megahit using paired-end reads
${megahit} \
-1 ${R1_fastq_list} \
-2 ${R2_fastq_list} \
--num-cpu-threads 28

# Determine coverage
## Align reads with BBmap BBwrap
${bbmap_dir}/bbwrap.sh \
ref=megahit_out/final.contigs.fa \
in1=${R1_fastq_list} \
in2=${R2_fastq_list} \
out=aln.sam.gz

## Output contig coverage
${bbmap_dir}/pileup.sh \
in=aln.sam.gz \
out=coverage.txt

RESULTS

This took a little less than a day to run:

Output folder:

Assembly (FastA):

Alignment (Gzipped SAM):

Coverage file (txt):

This assembly yielded 2,276,153 contigs (grep -c ">" final.contigs.fa).

Here’re some alignment stats from the BBmmap pileup.sh command (can be found at end of the slurm file (text)):

Reads:                               	163236894
Mapped reads:                        	143703254
Mapped bases:                        	16437148607
Ref scaffolds:                       	2276153
Ref bases:                           	2243059881

Percent mapped:                      	88.034
Percent proper pairs:                	76.768
Average coverage:                    	7.328
Standard deviation:                    	27.892
Percent scaffolds with any coverage: 	70.54
Percent of reference bases covered:  	56.07

Also, ran Quast to assess assembly:

python \
/home/sam/software/quast-4.5/quast.py \
--threads=20 \
--min-contig=100 \
--labels=meta_megahit_assembly \
/mnt/gannet/Atumefaciens/20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa

Quast output folder:

Quast report (HTML):

Quast report (text):

Quast report.txt:

All statistics are based on contigs of size >= 100 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

Assembly                    meta_megahit_assembly
# contigs (>= 0 bp)         2276153              
# contigs (>= 1000 bp)      496414               
# contigs (>= 5000 bp)      41075                
# contigs (>= 10000 bp)     12255                
# contigs (>= 25000 bp)     2797                 
# contigs (>= 50000 bp)     864                  
Total length (>= 0 bp)      2243059881           
Total length (>= 1000 bp)   1342986697           
Total length (>= 5000 bp)   487449312            
Total length (>= 10000 bp)  293688660            
Total length (>= 25000 bp)  154694058            
Total length (>= 50000 bp)  89125138             
# contigs                   2276153              
Largest contig              2035937              
Total length                2243059881           
GC (%)                      39.32                
N50                         1413                 
N75                         624                  
L50                         308641               
L75                         932685               
# N's per 100 kbp           0.00    

Now, should probably BLASTn, identify ORFs, translate and BLASTp those ORFs.