I previously converting our C.bairdi NanoPre sequencing data from the raw Fast5 format to FastQ format for our three sets of data:

Before proceeding with assembly and/or trying to tease out taxonomic differences (the C.bairdi-6129_403_26 is from an individual infected with Hematodinium), I want to get an idea of how the data looks. So, I’ve decided to process the sequencing_summary.txt file from each Fast5 conversion with NanoPlot. This software spits out some tables and some very nice visualizations to help get a better idea of how the sequencing runs look. This was run on Mox.

SBATCH script (GitHub):

20200914_cbai_nanoplot_nanopore-data.sh

#!/bin/bash
## Job Name
#SBATCH --job-name=cbai_nanoplot_nanopore-data
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200914_cbai_nanoplot_nanopore-data




###################################################################################
# These variables need to be set by user

# Load Anaconda
# Uknown why this is needed, but Anaconda will not run if this line is not included.
. "/gscratch/srlab/programs/anaconda3/etc/profile.d/conda.sh"


# Activate the NanoPlot Anaconda environment
conda activate nanoplot_env

# Set number of CPUs to use
threads=28

# Paths to reads
raw_reads_dir_array=(
"/gscratch/srlab/sam/data/C_bairdi/DNAseq/ont_FAL58500_04bb4d86_20102558-2729" \
"/gscratch/srlab/sam/data/C_bairdi/DNAseq/ont_FAL58500_94244ffd_20102558-2729" \
"/gscratch/srlab/sam/data/C_bairdi/DNAseq/ont_FAL86873_d8db260e_cbai_6129_403_26"
)

# Paths to programs
nanoplot=NanoPlot


###################################################################################


# Exit script if any command fails
set -e


# Capture this directory
wd=$(pwd)

# Inititalize array
programs_array=()


# Programs array
programs_array=("${nanoplot}")


# Loop through NanoPore data directories
# to run NanoPlot, FastQC, and MultiQC
for directory in "${!raw_reads_dir_array[@]}"
do


  # Capture NanoPore directory name
  dir_name=${raw_reads_dir_array[directory]##*/}

  # Make new directory and change to that directory
  mkdir "${dir_name}" && cd "$_"

  current_dir=$(pwd)


  # Run NanoPlot
  ## Sets readtype to 1D (default)
  ## Shows N50 on histograms
  ## Analysis perfomred using the sequencing summary file generated by
  ## guppy when converting from Fast5 to FastQ
  ${programs_array[nanoplot]} \
  --threads ${threads} \
  --outdir ${current_dir} \
  --readtype 1D \
  --N50 \
  --summary "${raw_reads_dir_array[directory]}"/sequencing_summary.txt

  # Change back to working directory
  cd "${wd}"

done


# Capture program options
for program in "${!programs_array[@]}"
do
    {
  echo "Program options for ${programs_array[program]}: "
    echo ""
    ${programs_array[program]} -h
    echo ""
    echo ""
    echo "----------------------------------------------"
    echo ""
    echo ""
} &>> program_options.log || true
done

# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

RESULTS

Very fast, 50s:

NanoPlot cumulative runtime for all three C.bairdi NanoPore data sets

Output folder:

20200914_cbai_nanoplot_nanopore-data/

Each of the NanoPore NanoPlot analyses are in individual folders, linked below. Although I’ve only linked to the full report (in HTML format) for each, the plots and data present in the full report are also available in individual images/files within those folders.

C.bairdi-20102558-2729_Run-01 (FAL58500_94244ffd)

Output folder:

20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_94244ffd_20102558-2729/

Full report (HTML):

20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_94244ffd_20102558-2729/NanoPlot-report.html

General summary
Active channels	491.0
Mean read length	821.6
Mean read quality	10.1
Median read length	533.0
Median read quality	10.6
Number of reads	78,720.0
Read length N50	1,254.0
Total bases	64,672,912.0

Number, percentage and megabases of reads above quality cutoffs:

Quality_cutoff	reads(count)	percentage	bases(count)
>Q5	73315	(93.1%)	60.5Mb
>Q7	67186	(85.3%)	56.5Mb
>Q10	46724	(59.4%)	41.4Mb
>Q12	20961	(26.6%)	20.6Mb
>Q15	966	(1.2%)	0.8Mb

Top 5 highest mean basecall quality scores and their read lengths:

Rank	mean_quality	read_length(bp)	bases(count)
1	22	(682)	60.5Mb
2	19.8	(471)	56.5Mb
3	19.1	(414)	41.4Mb
4	19.1	(1539)	20.6Mb
5	19	(234)	0.8Mb

Top 5 longest reads and their mean basecall quality score:

Rank	read_length(bp)	mean_quality_score
1	62629	(3.3)
2	51173	(3.5)
3	45909	(3.0)
4	44689	(3.4)
5	42633	(3.0)

FAL58500_94244ffd NanoPlot read density plot

C.bairdi-20102558-2729_Run-02 (FAL58500_04bb4d86)

Output folder:

20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_04bb4d86_20102558-2729/

Full report (HTML):

20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_04bb4d86_20102558-2729/NanoPlot-report.html

GENERAL SUMMARY
Active channels	424.0
Mean read length	845.5
Mean read quality	8.7
Median read length	521.0
Median read quality	9.3
Number of reads	21,519.0
Read length N50	1,406.0
Total bases	18,195,027.0

Number, percentage and megabases of reads above quality cutoffs:

Quality_cutoff	reads(count)	percentage	bases(count)
>Q5	17207	(80.0%)	15.4Mb
>Q7	14864	(69.1%)	13.9Mb
>Q10	8797	(40.9%)	8.9Mb
>Q12	3257	(15.1%)	3.6Mb
>Q15	121	(0.6%)	0.1Mb

Top 5 highest mean basecall quality scores and their read lengths:

Rank	mean_quality	read_length(bp)
1	18.3	(323)
2	17.8	(1064)
3	17.4	(201)
4	17.3	(761)
5	16.9	(642)

Top 5 longest reads and their mean basecall quality score:

Rank	read_length(bp)	mean_quality_score
1	31170	(3.4)
2	28158	(3.2)
3	27016	(3.3)
4	24304	(3.6)
5	15552	(2.9)

FAL58500_04bb4d86 NanoPlot read density plot

C.bairdi-6129_403_26 (FAL86873_d8db260e)

Output folder:

20200914_cbai_nanoplot_nanopore-data/ont_FAL86873_d8db260e_cbai_6129_403_26/

Full report (HTML):

20200914_cbai_nanoplot_nanopore-data/ont_FAL86873_d8db260e_cbai_6129_403_26/NanoPlot-report.html

GENERAL SUMMARY
Active channels	503
Mean read length	2254.5
Mean read quality	10.9
Median read length	965
Median read quality	11.2
Number of reads	506495
Read length N50	5232
Total bases	1141890358

Number, percentage and megabases of reads above quality cutoffs:

Quality_cutoff	reads(count)	percentage	bases(count)
>Q5	493598	(97.5%)	1124.9Mb
>Q7	465033	(91.8%)	1075.5Mb
>Q10	343296	(67.8%)	880.9Mb
>Q12	186709	(36.9%)	568.7Mb
>Q15	7394	(1.5%)	17.9Mb

Top 5 highest mean basecall quality scores and their read lengths:

Rank	mean_quality	read_length(bp)
1	20.6	(980)
2	20.2	(1228)
3	19.7	(902)
4	19.6	(466)
5	19.6	(757)

Top 5 longest reads and their mean basecall quality score:

Rank	read_length(bp)	mean_quality_score
1	58854	(11.8)
2	57315	(10.8)
3	51351	(11.5)
4	51326	(13.7)
5	49825	(12.9)

FAL86873_d8db260e NanoPlot read density plot

First, we should compare the two C.bairdi-20102558-2729 runs (degraded input DNA, no Hematodinium infection), as these were the same sample, run on the same flowcell.

Basically, the second run on the flowcell performs worse in every metric: fewer reads, lower quality scores, shorter read lengths, etc. This is despite the fact that this flowcell was run for nearly 4.5x longer period of time (72hrs vs. 16hrs). This isn’t terribly surprising, as the second run started with only 414 available pores, which is well below the ONT-suggested minimum of 800 pores for a “good” flowcell, but it is interesting to see the actual impacts on sequencing that this has.

The C.bairdi-6129_403_26 run (high quality input DNA, with Hematodinium infection) is a better example of what to expect when using non-degraded DNA. Every metric is substantially better than the C.bairdi-20102558-2729 runs.

Now, how to proceed? First, I think I’ll just get an assembly made, utilizing all three sets of data, despite the presence of Hematodinium in the C.bairdi-6129_403_26 run. I’ll tackle the assembly using Flye, as it seems straightforward, is specifically designed for long read assembly, and has built-in assembly “polishing”. This will be a start.

In regards to teasing out Hematodinium sequences, I’ll probably go through the DIAMOND BLASTx -> MEGAN6 process that I’ve used throughout our C.baird RNAseq project for transcriptome assemblies. MEGAN6 can handle long reads and makes taxonomic read extraction straightforward.