Metagenomics - BLASTx of Individual Water Sample MEGAHIT Assemblies on Mox

After a meeting on this project yesterda, we decided to try a few things to continue with various approaches to assessing the metagenome. One of the approaches is to run BLASTx on the individual water sample MEGAHIT assemblies from 20190327 and obtain taxonomy info for them, so that’s what I did here.

Samples breakdown like so, for reference:

  • pH=7.1: MG3, MG6, MG7

  • pH=8.2: MG1, MG2, MG5

SBATCH script (GitHub):

    #!/bin/bash
    ## Job Name
    #SBATCH --job-name=blastx_metagenomics
    ## Allocation Definition
    #SBATCH --account=coenv
    #SBATCH --partition=coenv
    ## Resources
    ## Nodes
    #SBATCH --nodes=1
    ## Walltime (days-hours:minutes:seconds format)
    #SBATCH --time=25-00:00:00
    ## Memory per node
    #SBATCH --mem=120G
    ##turn on e-mail notification
    #SBATCH --mail-type=ALL
    #SBATCH --mail-user=samwhite@uw.edu
    ## Specify the working directory for this job
    #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190516_metagenomics_pgen_blastx

    # Exit script if any command fails
    set -e

    # Load Python Mox module for Python module availability

    module load intel-python3_2017

    # Document programs in PATH (primarily for program version ID)

    date >> system_path.log
    echo "" >> system_path.log
    echo "System PATH for $SLURM_JOB_ID" >> system_path.log
    echo "" >> system_path.log
    printf "%0.s-" {1..10} >> system_path.log
    echo "${PATH}" | tr : \\n >> system_path.log


    threads=28

    # Paths to programs
    blast_dir="/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin"
    blastx="${blast_dir}/blastx"


    # Paths to blastdbs
    blastdb_dir="/gscratch/srlab/blastdbs/ncbi-sp-v5"
    blast_db="${blastdb_dir}/swissprot_v5"


    # Input files
    fasta_dir="/gscratch/srlab/sam/data/metagenomics/P_generosa/assemblies"


    ## Inititalize arrays
    fasta_array=()
    names_array=()

    # Export BLAST database directory
    export BLASTDB=${blastdb_dir}

    # Create array of FastA files
    # Create array of sample names
    ## Uses parameter substitution to strip leading path from filename
    for fasta in ${fasta_dir}/MG*.fa
    do
      fasta_array+=(${fasta})
      names_array+=($(echo "${fasta#${fasta_dir}/}" | awk -F"." '{print $1}'))
    done

    # Loop through arrays to customize sample names
    # and run BLASTx on each FastA
    for index in "${!names_array[@]}"
    do
      # Loops through sample names and appends appropriate treatment to each sample name
      sample_name=$(echo "${names_array[index]}")
      if [ "${sample_name}" == "MG1" ] \
      || [ "${sample_name}" == "MG2" ] \
      || [ "${sample_name}" == "MG5" ]
      then
        sample_name="${sample_name}"_pH82
      else
        sample_name="${sample_name}"_pH71
      fi

      # Create list of input FastA files
      echo "${fasta}" >> input.fasta.list.txt

      # Run BLASTx on each FastA
      ${blastx} \
      -query "${fasta_array[index]}" \
      -db ${blast_db} \
      -max_hsps 1 \
      -max_target_seqs 1 \
      -outfmt "6 std staxid ssciname" \
      -evalue 1e-10 \
      -num_threads ${threads} \
      > "${sample_name}".blastx.outfmt6
    done

RESULTS

Actually took longer than I thought it would (almost six days - essentially one day per sample):

screencap of job runtime

Output folder:

BLAST output files (Output format 6, plus taxids and scientific names):

I’ll tackle some analysis of this data in a different notebook entry.