Steven asked me to try running Blob Tool Kit to identify potential contaminating sequence in our Panopea generosa (Pacific geoduck) genome assembly (v1.0). In preparation for running Blob Tool Kit, I needed to trim the 10x Genomics FastQ data used by Phase Genomics. Files were trimmed using fastp
on Mox.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=20210401_pgen_fastp_10x-genomics
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210401_pgen_fastp_10x-genomics
### Fastp 10x Genomics data used for P.generosa genome assembly by Phase Genomics.
### In preparation for use in BlobToolKit
### Expects input filenames to be in format: *.fastq.gz
###################################################################################
# These variables need to be set by user
## Assign Variables
# Set number of CPUs to use
threads=40
# Input/output files
trimmed_checksums=trimmed_fastq_checksums.md5
raw_reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/DNAseq
fastq_checksums=raw_fastq_checksums.md5
# Paths to programs
fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp
multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc
## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
# Programs associative array
declare -A programs_array
programs_array=(
[fastp]="${fastp}" \
[multiqc]="${multiqc}"
)
###################################################################################
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Capture date
timestamp=$(date +%Y%m%d)
# Sync raw FastQ files to working directory
rsync --archive --verbose \
"${raw_reads_dir}"*.fastq.gz .
# Create arrays of fastq R1 files and sample names
for fastq in *R1*.fastq.gz
do
fastq_array_R1+=("${fastq}")
done
# Create array of fastq R2 files
for fastq in *R2*.fastq.gz
do
fastq_array_R2+=("${fastq}")
done
# Run fastp on files
# Trim 10bp from 5' from each read
# Adds JSON report output for downstream usage by MultiQC
for index in "${!fastq_array_R1[@]}"
do
# Remove .fastq.gz from end of file names
R1_sample_name=$(echo "${fastq_array_R1[index]}" | sed 's/.fastq.gz//')
R2_sample_name=$(echo "${fastq_array_R2[index]}" | sed 's/.fastq.gz//')
# Get sample name without R1/R2 labels
sample_name=$(echo "${fastq_array_R1[index]}" | sed 's/_R[12].*//')
echo ""
echo "fastp started on ${sample_name} FastQs."
# Run fastp
# Specifies reports in HTML and JSON formats
${fastp} \
--in1 ${fastq_array_R1[index]} \
--in2 ${fastq_array_R2[index]} \
--detect_adapter_for_pe \
--thread ${threads} \
--html "${sample_name}".fastp-trim."${timestamp}".report.html \
--json "${sample_name}".fastp-trim."${timestamp}".report.json \
--out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \
--out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz
echo "fastp completed on ${sample_name} FastQs"
echo ""
# Generate md5 checksums for newly trimmed files
{
md5sum "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz
md5sum "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz
} >> "${trimmed_checksums}"
# Create MD5 checksum for reference
{
md5sum "${fastq_array_R1[index]}"
md5sum "${fastq_array_R2[index]}"
} >> ${fastq_checksums}
# Remove original FastQ files
rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}"
done
# Run MultiQC
${programs_array[multiqc]} .
###################################################################################
# Capture program options
echo "Logging program options..."
for program in "${!programs_array[@]}"
do
{
echo "Program options for ${program}: "
echo ""
# Handle samtools help menus
if [[ "${program}" == "samtools_index" ]] \
|| [[ "${program}" == "samtools_sort" ]] \
|| [[ "${program}" == "samtools_view" ]]
then
${programs_array[$program]}
# Handle DIAMOND BLAST menu
elif [[ "${program}" == "diamond" ]]; then
${programs_array[$program]} help
# Handle NCBI BLASTx menu
elif [[ "${program}" == "blastx" ]]; then
${programs_array[$program]} -help
fi
${programs_array[$program]} -h
echo ""
echo ""
echo "----------------------------------------------"
echo ""
echo ""
} &>> program_options.log || true
# If MultiQC is in programs_array, copy the config file to this directory.
if [[ "${program}" == "multiqc" ]]; then
cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
fi
done
# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
echo "Finished logging system PATH"
RESULTS
There’s a lot of data, so this took a pretty long time. Just under 7hrs!
Output folder:
20210401_pgen_fastp_10x-genomics/
List of input FastQ files (text):
MultiQC
Report (HTML; open in browser):Trimmed FastQ list/checksums (text):
Alrighty, I will use these trimmed FastQ files in a subsequent run of Blob Tool Kit with out Panopea generosa (Pacific geoduck) v1.0 genome assembly and see what we find!