In continued attempts to get a grasp on the geoduck transcriptome size, I decided to “compress” our various assemblies by clustering similar transcripts in each assembly in to a single “representative” transcript, using CD-Hit-est. Settings use to run it were taken from the Trinity FAQ regarding “too many transcripts”.
A bash script was used to rsync
files to Mox and then execute the SBATCH script.
Bash script (GitHub):
#!/usr/bin/bash
# Script to retrieve geoduck Trinity assemblies
# Assemblies will be used in SBATCH script called at end of this script.
# Script needs to be run within same directory as SBATCH script.
# Exit if any command fails
set -e
# Set rsync remote path
gannet="gannet:/volume2/web/Atumefaciens"
owl="owl:/volume1/web/Athaliana"
# Create array of directories for storing Trinity assemblies
assembly_dirs_array=(
/gscratch/srlab/sam/data/P_generosa/transcriptomes/20180827_assembly
/gscratch/srlab/sam/data/P_generosa/transcriptomes/ctenidia
/gscratch/srlab/sam/data/P_generosa/transcriptomes/gonad
/gscratch/srlab/sam/data/P_generosa/transcriptomes/heart
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI115
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI116
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI123
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI124
/gscratch/srlab/sam/data/P_generosa/transcriptomes/larvae/EPI99)
# Array of Trinity assemblies remote paths for rysnc-ing
assemblies_array=(
20180827_trinity_geoduck_RNAseq/Trinity.fasta
20190409_trinity_pgen_ctenidia_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_gonad_RNAseq/trinity_out_dir/Trinity.fasta
20190215_trinity_geoduck_heart_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_EPI115_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_EPI116_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_EPI123_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_EPI124_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_EPI99_RNAseq/trinity_out_dir/Trinity.fasta)
# Retrieve FastA files via rsync
for index in "${!assemblies_array[@]}"
do
# Remove everything after first slash
assembly=$(echo "${assemblies_array[index]%%/*}")
echo "Preparing to download ${assembly}..."
if [ "${assembly}" = "20180827_trinity_geoduck_RNAseq" ]; then
echo "Now syncing ${assembly} to ${assembly_dirs_array[index]}"
rsync \
--archive \
--progress \
"${owl}/${assemblies_array[index]}" \
"${assembly_dirs_array[index]}"
else
echo "Now syncing ${assembly} to ${assembly_dirs_array[index]}"
rsync \
--archive \
--progress \
"${gannet}/${assemblies_array[index]}" \
"${assembly_dirs_array[index]}"
fi
done
# Start SBATCH script to run CD-Hit on all transcriptome assemblies
sbatch 20190729_cdhit-est_pgen_transcriptomes.sh
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=cdhit_pgen
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190729_cdhit-est_pgen_transcriptomes
# This script is called by 20190729_cdhit_pgen_trinity_assemblies.sh.
# That script uses rsync to transfer files to Mox via the login node.
# This is required because Mox execute nodes don't have internet access.
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log
# Set CPU threads
threads=27
# Program paths
cd_hit_est="/gscratch/srlab/programs/cd-hit-v4.8.1-2019-0228/cd-hit-est"
# Create assembly paths array
assembly_dirs_array=(
/gscratch/srlab/sam/data/P_generosa/transcriptomes/20180827_assembly
/gscratch/srlab/sam/data/P_generosa/transcriptomes/ctenidia
/gscratch/srlab/sam/data/P_generosa/transcriptomes/gonad
/gscratch/srlab/sam/data/P_generosa/transcriptomes/heart
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI115
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI116
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI123
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI124
/gscratch/srlab/sam/data/P_generosa/transcriptomes/larvae/EPI99)
# Run cd-hit-est on each assembly
for index in "${!assembly_dirs_array[@]}"
do
# Store individual sample name by removing
# everything up to and including the last slash in path
sample_name=$(echo "${assembly_dirs_array[index]##*/}")
# Run cd-hit-est
"${cd_hit_est}" \
-o "${sample_name}".cdhit \
-c 0.98 \
-i "${assembly_dirs_array[index]}"/Trinity.fasta \
-p 1 \
-d 0 \
-b 3 \
-T "${threads}" \
-M 0
done
RESULTS
This was quick (~30mins)!:
Output folder:
The output from CD-Hit is a FastA file. A quick grep -c ">"
on the output files suggests that there were many “transcripts” that couldn’t be clustered together. That’s confirmed when taking a look at the end of the .cdhit.clstr files. Here’s an example (tail -n 25 ctenidia.cdhit.clst
):
>Cluster 325771
0 189nt, >TRINITY_DN58755_c0_g1_i5... *
>Cluster 325772
0 189nt, >TRINITY_DN25471_c0_g1_i8... *
>Cluster 325773
0 187nt, >TRINITY_DN7576_c0_g1_i14... *
>Cluster 325774
0 186nt, >TRINITY_DN26627_c0_g1_i8... *
>Cluster 325775
0 186nt, >TRINITY_DN41216_c0_g1_i3... *
>Cluster 325776
0 185nt, >TRINITY_DN34387_c0_g1_i1... *
>Cluster 325777
0 185nt, >TRINITY_DN6164_c1_g1_i10... *
>Cluster 325778
0 184nt, >TRINITY_DN7464_c1_g2_i3... *
>Cluster 325779
0 184nt, >TRINITY_DN2395_c0_g1_i1... *
>Cluster 325780
0 184nt, >TRINITY_DN557_c0_g1_i3... *
>Cluster 325781
0 182nt, >TRINITY_DN53036_c0_g1_i3... *
>Cluster 325782
0 180nt, >TRINITY_DN12928_c1_g1_i8... *
Each of these transcripts from the assembly are very short (<200bp) and they do not cluster with any other transcripts (likely due to their short length). So, these are retained as individual transcripts, Although I haven’t taken the time to assess things, there are probably thousands (maybe more?) of these transcripts that are small and don’t cluster.
Anyway, I ran this to provide additional information on our transcriptome assemblies, not really to explore the results of CD-Hit in-depth. I’m putting together a summary post of our assemblies that will compile a variety of “stats”; hopefully providing some insight on a more realistic assessment of the number of transcripts present in the geoduck transcriptome.