After receiving the P.verrucosa RNA-seq data from Danielle Becker (Hollie Putnam’s Lab, Univ. of Rhode Island), I noticed that the trimmed reads didn’t appear to actually be trimmed. There was still adapter contamination (solely in R2 reads - suggesting the detect_adapter_for_pe
option had been omitted from the fastp
command?), but the reads had an average read length of 150bp - except when looking at the adapter content report!!??.
If they had been trimmed, then the average read length should be shorter than 150bp… Oddly, I experienced this with other coral sequencing data from the Putnam Lab back in January.
So, with all of this info, I decided to trim the raw FastQs using fastp
, and perform a 20bp 5’ end hard trim to all reads like I did with the other coral sequencing data from the Putnam Lab back in January.
Job was run on Mox using fastp
, FastQC
, and MultiQC
.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=20230215-pver-fastqc-fastp-multiqc-E5-RNAseq
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20230215-pver-fastqc-fastp-multiqc-E5-RNAseq
### FastQC and fastp trimming E5 coral project P.verrucosa RNA-seq data from Danielle Becker.
### fastp expects input FastQ files to be in format: *R[12]_001.fastq.gz
### E.g. C17_R1_001.fastq.gz
###################################################################################
# These variables need to be set by user
## Assign Variables
# Set FastQ filename patterns
fastq_pattern='*.fastq.gz'
R1_fastq_pattern='*R1_001.fastq.gz'
R2_fastq_pattern='*R2_001.fastq.gz'
# Set number of CPUs to use
threads=40
# Input/output files
trimmed_checksums=trimmed_fastq_checksums.md5
fastq_checksums=input_fastq_checksums.md5
# FastQC output directory
output_dir=$(pwd)
# Data directories
reads_dir=/gscratch/srlab/sam/data/hputnam-Becker_E5/Becker_RNASeq/data/raw
## Inititalize arrays
raw_fastqs_array=()
R1_names_array=()
R2_names_array=()
# Paths to programs
fastp=/gscratch/srlab/programs/fastp.0.23.1
fastqc=/gscratch/srlab/programs/fastqc_v0.11.9/fastqc
multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc
# Programs associative array
declare -A programs_array
programs_array=(
[fastqc]="${fastqc}"
[fastp]="${fastp}" \
[multiqc]="${multiqc}"
)
###################################################################################
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Capture date
timestamp=$(date +%Y%m%d)
# Sync raw FastQ files to working directory
echo ""
echo "Transferring files via rsync..."
rsync --archive --verbose \
"${reads_dir}"${fastq_pattern} .
echo ""
echo "File transfer complete."
echo ""
### Run FastQC ###
### NOTE: Do NOT quote raw_fastqc_list
# Create array of trimmed FastQs
raw_fastqs_array=(${fastq_pattern})
# Pass array contents to new variable as space-delimited list
raw_fastqc_list=$(echo "${raw_fastqs_array[*]}")
echo "Beginning FastQC on raw reads..."
echo ""
# Run FastQC
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${output_dir} \
${raw_fastqc_list}
echo "FastQC on raw reads complete!"
echo ""
### END FASTQC ###
# Create arrays of fastq R1 files and sample names
# Do NOT quote R1_fastq_pattern variable
for fastq in ${R1_fastq_pattern}
do
fastq_array_R1+=("${fastq}")
# Use parameter substitution to remove all text up to and including first "_" from
# right side of string.
R1_names_array+=("${fastq%_*}")
done
# Create array of fastq R2 files
# Do NOT quote R2_fastq_pattern variable
for fastq in ${R2_fastq_pattern}
do
fastq_array_R2+=("${fastq}")
# Use parameter substitution to remove all text up to and including first "_" from
# right side of string.
R2_names_array+=("${fastq%_*}")
done
# Create MD5 checksums for raw FastQs
for fastq in ${fastq_pattern}
do
echo "Generating checksum for ${fastq}"
md5sum "${fastq}" | tee --append ${fastq_checksums}
echo ""
done
### RUN FASTP ###
# Run fastp on files
# Adds JSON report output for downstream usage by MultiQC
# Trims 20bp from 5' end of all reads
# Trims poly G, if present
# Uses parameter substitution (e.g. ${R1_sample_name%%_*})to rm the _R[12] for report names.
echo "Beginning fastp trimming."
echo ""
for index in "${!fastq_array_R1[@]}"
do
R1_sample_name="${R1_names_array[index]}"
R2_sample_name="${R2_names_array[index]}"
${fastp} \
--in1 ${fastq_array_R1[index]} \
--in2 ${fastq_array_R2[index]} \
--detect_adapter_for_pe \
--trim_poly_g \
--trim_front1 20 \
--trim_front2 20 \
--thread ${threads} \
--html "${R1_sample_name%%_*}".fastp-trim."${timestamp}".report.html \
--json "${R1_sample_name%%_*}".fastp-trim."${timestamp}".report.json \
--out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \
--out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz
# Generate md5 checksums for newly trimmed files
{
md5sum "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz
md5sum "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz
} >> "${trimmed_checksums}"
# Remove original FastQ files
echo ""
echo " Removing ${fastq_array_R1[index]} and ${fastq_array_R2[index]}."
rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}"
done
echo ""
echo "fastp trimming complete."
echo ""
### END FASTP ###
### RUN FASTQC ###
### NOTE: Do NOT quote ${trimmed_fastqc_list}
# Create array of trimmed FastQs
trimmed_fastq_array=(*fastp-trim*.fq.gz)
# Pass array contents to new variable as space-delimited list
trimmed_fastqc_list=$(echo "${trimmed_fastq_array[*]}")
# Run FastQC
echo "Beginning FastQC on trimmed reads..."
echo ""
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${output_dir} \
${trimmed_fastqc_list}
echo ""
echo "FastQC on trimmed reads complete!"
echo ""
### END FASTQC ###
### RUN MULTIQC ###
echo "Beginning MultiQC..."
echo ""
${multiqc} .
echo ""
echo "MultiQC complete."
echo ""
### END MULTIQC ###
####################################################################
# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
echo "Logging program options..."
for program in "${!programs_array[@]}"
do
{
echo "Program options for ${program}: "
echo ""
# Handle samtools help menus
if [[ "${program}" == "samtools_index" ]] \
|| [[ "${program}" == "samtools_sort" ]] \
|| [[ "${program}" == "samtools_view" ]]
then
${programs_array[$program]}
# Handle DIAMOND BLAST menu
elif [[ "${program}" == "diamond" ]]; then
${programs_array[$program]} help
# Handle NCBI BLASTx menu
elif [[ "${program}" == "blastx" ]]; then
${programs_array[$program]} -help
fi
${programs_array[$program]} -h
echo ""
echo ""
echo "----------------------------------------------"
echo ""
echo ""
} &>> program_options.log || true
# If MultiQC is in programs_array, copy the config file to this directory.
if [[ "${program}" == "multiqc" ]]; then
cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
fi
done
fi
# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
RESULTS
Runtime was Just under two hours:
Output folder:
- 20230215-pver-fastqc-fastp-multiqc-E5-RNAseq/
Trimmed FastQs (FastQ)
- All trimmed FastQs follow this pattern:
*fastp-trim.20230215.fq.gz
- All trimmed FastQs follow this pattern:
Trimmed FastQ MD5 checksums (text)
MultiQC Report (HTML)
- 20230215-pver-fastqc-fastp-multiqc-E5-RNAseq/multiqc_report.html
- NOTE: The report is a bit confusing due to the fact that it has summarized the follwing in a single report:
- raw (e.g.
C17_R[12]_001
) - trimmed (e.g.
C17_R[12]
) - fastp (e.g.
C17
)
- raw (e.g.
Okay, the trimmed FastQs look better - no adapter contamination in R1 or R2 reads, and no “bumpy stuff” at 5’ ends of reads. Plus, average read length is 130bp, reflecting the trimming that was supposed to take place. Next steps will be to align the trimmed reads to the P.verrucosa genomes and the endosymbiont genome(s) in order to seperate reads matching to each genome.