Ran BUSCO on our completed annotation of the P.generosa v071 genome (GFF) (subset of sequences >10kbp). See this notebook entry for genome annotation info. This provides a nice metric on how “complete” a genome assembly (or transcriptome) is. Additionally, BUSCO is tied in with Augustus for gene prediction and generates ab initio gene models. With that said, since I just want to evaluate the completeness of this particular genome assembly, I’ll be using the annotated genome generated through two rounds of SNAP gene prediction. Otherwise, I’d use the initial MAKER annotations to generate an Augustus gene model that could be used in conjuction with the SNAP models (I’ll likely do this at a later date).
Firstly, I needed a FastA as input for BUSCO, so I extracted the FastA from the GFF with the following script:
#!/bin/env bash
# Script to extract FastA sequences from GFF3 (specifically, those produced by MAKER)
# User needs to set GFF path and desired output file name
#-----------------------------------
# Set path to GFF
gff=/gscratch/scrubbed/samwhite/outputs/20190213_geoduck_maker_genome_annotation/Pgenerosa_v071_genome_snap02.all.renamed.putative_function.domain_added.gff
# Set path to desired FastA file output
fasta_out=/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v071_genome_snap02.all.renamed.fasta
#-----------------------------------
# Determine total number of lines (records) in GFF
total_records=$(wc -l < ${gff})
# Determine line number of FastA demarcation
fasta_id_line=$(grep -n "##FASTA" ${gff} cut -d":" -f1)
# Add "1" to the fasta_id_line to establish starting line of first FastA record
begin_fastas_line=$(( ${fasta_id_line + 1 }))
# Print all lines from beginning of FastA records to the end of the file
awk 'BEGIN{min=${begin_fastas_line};max=total_records} \
{if (NR>=min) {if (NR<=max) print}}' ${gff} \
> ${fasta_out}
Then, I ran BUSCO using two different species with the same BUSCO database (metazoa_odb9
):
- Human
- fly (BUSCO default)
I simply did this to get an idea of what impact species selection might have on BUSCO analysis.
Here are the two SBATCH scripts submitted to Mox for each of the two species.
Human SBATCH script:
#!/bin/bash
## Job Name
#SBATCH --job-name=busco
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=15-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190228_pgen_busco_metazoa_augustus
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Load Open MPI module for parallel, multi-node processing
module load icc_19-ompi_3.1.2
# SegFault fix?
export THREADS_DAEMON_MODEL=1
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log
# Establish variables for more readable code
## Input files and settings
base_name=Pgenerosa_v071_genome_snap02.all.maker
busco_db=/gscratch/srlab/sam/data/databases/BUSCO/metazoa_odb9
genome_fasta=/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v071_genome_snap02.all.renamed.fasta
genome_index=${genome_fasta}.fai
augustus_species=human
## Save working directory
wd=$(pwd)
## Set program paths
augustus_bin=/gscratch/srlab/programs/Augustus-3.3.2/bin
augustus_scripts=/gscratch/srlab/programs/Augustus-3.3.2/scripts
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin/bedtools
blast_dir=/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin
busco=/gscratch/srlab/programs/busco-v3/scripts/run_BUSCO.py
hmm_dir=/gscratch/srlab/programs/hmmer-3.2.1/src
samtools=/gscratch/srlab/programs/samtools-1.9/samtools
## Augustus configs
augustus_dir=${wd}/augustus
augustus_config_dir=${augustus_dir}/config
augustus_orig_config_dir=/gscratch/srlab/programs/Augustus-3.3.2/config
## BUSCO configs
busco_config_default=/gscratch/srlab/programs/busco-v3/config/config.ini.default
busco_config_ini=${wd}/config.ini
# Export BUSCO config file location
export BUSCO_CONFIG_FILE="${busco_config_ini}"
# Export Augustus variable
export PATH="${augustus_bin}:$PATH"
export PATH="${augustus_scripts}:$PATH"
export AUGUSTUS_CONFIG_PATH="${augustus_config_dir}"
# Copy BUSCO config file
cp ${busco_config_default} ${busco_config_ini}
# Make Augustus directory if it doesn't exist
if [ ! -d ${augustus_dir} ]; then
mkdir --parents ${augustus_dir}
fi
# Copy Augustus config directory
cp --preserve -r ${augustus_orig_config_dir} ${augustus_dir}
# Edit BUSCO config file
## Set paths to various programs
### The use of the % symbol sets the delimiter sed uses for arguments.
### Normally, the delimiter that most examples use is a slash "/".
### But, we need to expand the variables into a full path with slashes, which screws up sed.
### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%").
sed -i "/^;cpu/ s/1/28/" "${busco_config_ini}"
sed -i "/^tblastn_path/ s%tblastn_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}"
sed -i "/^makeblastdb_path/ s%makeblastdb_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}"
sed -i "/^augustus_path/ s%augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}"
sed -i "/^etraining_path/ s%etraining_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}"
sed -i "/^gff2gbSmallDNA_path/ s%gff2gbSmallDNA_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}"
sed -i "/^new_species_path/ s%new_species_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}"
sed -i "/^optimize_augustus_path/ s%optimize_augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}"
sed -i "/^hmmsearch_path/ s%hmmsearch_path = /home/osboxes/BUSCOVM/hmmer/hmmer-3.1b2-linux-intel-ia32/binaries/%path = ${hmm_dir}%" "${busco_config_ini}"
# Run BUSCO/Augustus training
${busco} \
--in ${genome_fasta} \
--out ${base_name} \
--lineage_path ${busco_db} \
--mode genome \
--cpu 28 \
--long \
--species ${augustus_species} \
--tarzip \
--augustus_parameters='--progress=true'
Fly SBATCH script:
#!/bin/bash
## Job Name
#SBATCH --job-name=busco
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=15-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190301_pgen_busco_metazoa_fly_augustus
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Load Open MPI module for parallel, multi-node processing
module load icc_19-ompi_3.1.2
# SegFault fix?
export THREADS_DAEMON_MODEL=1
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log
# Establish variables for more readable code
## Input files and settings
base_name=Pgenerosa_v071_genome_snap02.all.maker
busco_db=/gscratch/srlab/sam/data/databases/BUSCO/metazoa_odb9
genome_fasta=/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v071_genome_snap02.all.renamed.fasta
genome_index=${genome_fasta}.fai
augustus_species=fly
## Save working directory
wd=$(pwd)
## Set program paths
augustus_bin=/gscratch/srlab/programs/Augustus-3.3.2/bin
augustus_scripts=/gscratch/srlab/programs/Augustus-3.3.2/scripts
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin/bedtools
blast_dir=/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin
busco=/gscratch/srlab/programs/busco-v3/scripts/run_BUSCO.py
hmm_dir=/gscratch/srlab/programs/hmmer-3.2.1/src
samtools=/gscratch/srlab/programs/samtools-1.9/samtools
## Augustus configs
augustus_dir=${wd}/augustus
augustus_config_dir=${augustus_dir}/config
augustus_orig_config_dir=/gscratch/srlab/programs/Augustus-3.3.2/config
## BUSCO configs
busco_config_default=/gscratch/srlab/programs/busco-v3/config/config.ini.default
busco_config_ini=${wd}/config.ini
# Export BUSCO config file location
export BUSCO_CONFIG_FILE="${busco_config_ini}"
# Export Augustus variable
export PATH="${augustus_bin}:$PATH"
export PATH="${augustus_scripts}:$PATH"
export AUGUSTUS_CONFIG_PATH="${augustus_config_dir}"
# Copy BUSCO config file
cp ${busco_config_default} ${busco_config_ini}
# Make Augustus directory if it doesn't exist
if [ ! -d ${augustus_dir} ]; then
mkdir --parents ${augustus_dir}
fi
# Copy Augustus config directory
cp --preserve -r ${augustus_orig_config_dir} ${augustus_dir}
# Edit BUSCO config file
## Set paths to various programs
### The use of the % symbol sets the delimiter sed uses for arguments.
### Normally, the delimiter that most examples use is a slash "/".
### But, we need to expand the variables into a full path with slashes, which screws up sed.
### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%").
sed -i "/^;cpu/ s/1/28/" "${busco_config_ini}"
sed -i "/^tblastn_path/ s%tblastn_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}"
sed -i "/^makeblastdb_path/ s%makeblastdb_path = /usr/bin/%path = ${blast_dir}%" "${busco_config_ini}"
sed -i "/^augustus_path/ s%augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}"
sed -i "/^etraining_path/ s%etraining_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/bin/%path = ${augustus_bin}%" "${busco_config_ini}"
sed -i "/^gff2gbSmallDNA_path/ s%gff2gbSmallDNA_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}"
sed -i "/^new_species_path/ s%new_species_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}"
sed -i "/^optimize_augustus_path/ s%optimize_augustus_path = /home/osboxes/BUSCOVM/augustus/augustus-3.2.2/scripts/%path = ${augustus_scripts}%" "${busco_config_ini}"
sed -i "/^hmmsearch_path/ s%hmmsearch_path = /home/osboxes/BUSCOVM/hmmer/hmmer-3.1b2-linux-intel-ia32/binaries/%path = ${hmm_dir}%" "${busco_config_ini}"
# Run BUSCO/Augustus training
${busco} \
--in ${genome_fasta} \
--out ${base_name} \
--lineage_path ${busco_db} \
--mode genome \
--cpu 28 \
--long \
--species ${augustus_species} \
--tarzip \
--augustus_parameters='--progress=true'
RESULTS
Both runs took ~16hrs each to complete.
Human output folder:
Fly output folder:
Human short summary:
# BUSCO version is: 3.0.2
# The lineage dataset is: metazoa_odb9 (Creation date: 2016-02-13, number of species: 65, number of BUSCOs: 978)
# To reproduce this run: python /gscratch/srlab/programs/busco-v3/scripts/run_BUSCO.py -i /gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v071_genome_snap02.all.renamed.fasta -o Pgenerosa_v071_genome_snap02.all.maker -l /gscratch/srlab/sam/data/databases/BUSCO/metazoa_odb9/ -m genome -c 28 --long -z -sp human --augustus_parameters '--progress=true'
#
# Summarized benchmarking in BUSCO notation for file /gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v071_genome_snap02.all.renamed.fasta
# BUSCO was run in mode: genome
C:76.3%[S:73.3%,D:3.0%],F:5.2%,M:18.5%,n:978
746 Complete BUSCOs (C)
717 Complete and single-copy BUSCOs (S)
29 Complete and duplicated BUSCOs (D)
51 Fragmented BUSCOs (F)
181 Missing BUSCOs (M)
978 Total BUSCO groups searched
Fly short summary:
# BUSCO version is: 3.0.2
# The lineage dataset is: metazoa_odb9 (Creation date: 2016-02-13, number of species: 65, number of BUSCOs: 978)
# To reproduce this run: python /gscratch/srlab/programs/busco-v3/scripts/run_BUSCO.py -i /gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v071_genome_snap02.all.renamed.fasta -o Pgenerosa_v071_genome_snap02.all.maker -l /gscratch/srlab/sam/data/databases/BUSCO/metazoa_odb9/ -m genome -c 28 --long -z -sp fly --augustus_parameters '--progress=true'
#
# Summarized benchmarking in BUSCO notation for file /gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v071_genome_snap02.all.renamed.fasta
# BUSCO was run in mode: genome
C:76.6%[S:73.3%,D:3.3%],F:5.2%,M:18.2%,n:978
749 Complete BUSCOs (C)
717 Complete and single-copy BUSCOs (S)
32 Complete and duplicated BUSCOs (D)
51 Fragmented BUSCOs (F)
178 Missing BUSCOs (M)
978 Total BUSCO groups searched
Overall, the species selection seems to have minimal impact on the analysis. Both sets of analyses yield nearly identical results, however, fly yields three additional complete BUSCOs, but they’re duplicated.
This version of the genome suggests it is ~76% complete. I’m curious how the original genome assembly (v070) will compare.