I’d previously assembled hemat_transcriptome_v1.0.fasta
on 20200122, hemat_transcriptome_v1.5.fasta
on 20200408, extracted hemat_transcriptome_v2.1.fasta
from an existing FastA on 20200605, as well as extracted hemat_transcriptome_v3.1.fasta
on 20200605.
All of the above transcriptomes were assembled with different combinations of the crab RNAseq data we generated. Here’s a link to an overview of the various assemblies (including the two generated today):
- hemat_transcriptome_comp (Google Sheet)
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=hemat_trinity_v1.6_v1.7
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=15-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210308_hemat_trinity_v1.6_v1.7
# Script to generate Hematodinium Trinity transcriptome assemblies:
# v1.6 libaries: 2018 2019 2020-GW 2020-UW
# v1.7 libraries: 2018 2019 2020-UW
# See corresponding FastQ list for each assembly to see FastQ used in each assembly.
###################################################################################
# These variables need to be set by user
# Assign Variables
script_path=/gscratch/scrubbed/samwhite/outputs/20210308_hemat_trinity_v1.6_v1.7/20210308_hemat_trinity_v1.6_v1.7.sh
reads_dir=/gscratch/srlab/sam/data/Hematodinium/RNAseq
transcriptomes_dir=/gscratch/srlab/sam/data/Hematodinium/transcriptomes
threads=28
# Carrot needed to limit grep to line starting with #SBATCH
# Avoids grep-ing the command below.
max_mem=$(grep "^#SBATCH --mem=" ${script_path} | awk -F [=] '{print $2}')
# Paths to programs
trinity_dir="/gscratch/srlab/programs/trinityrnaseq-v2.9.0"
samtools="/gscratch/srlab/programs/samtools-1.10/samtools"
# Array of the various comparisons to evaluate
# Each condition in each comparison should be separated by a "-"
transcriptomes_array=(
"${transcriptomes_dir}"/hemat_transcriptome_v1.6.fasta \
"${transcriptomes_dir}"/hemat_transcriptome_v1.7.fasta
)
# Programs array
declare -A programs_array
programs_array=(
[samtools_faidx]="${samtools} faidx" \
[trinity]="${trinity_dir}/Trinity" \
[trinity_stats]="${trinity_dir}/util/TrinityStats.pl" \
[trinity_gene_trans_map]="${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl" \
[trinity_fasta_seq_length]="${trinity_dir}/util/misc/fasta_seq_length.pl"
)
###################################################################################
# Exit script if any command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Set working directory
wd=$(pwd)
# Loop through each transcriptome
for transcriptome in "${!transcriptomes_array[@]}"
do
## Inititalize arrays
R1_array=()
R2_array=()
reads_array=()
# Variables
R1_list=""
R2_list=""
trinity_out_dir=""
transcriptome_name="${transcriptomes_array[$transcriptome]##*/}"
assembly_stats="${transcriptome_name}_assembly_stats.txt"
trinity_out_dir="${transcriptome_name}_trinity_out_dir"
# v1.6 libraries: 2018 2019 2020-GW 2020-UW
if [[ "${transcriptome_name}" == "hemat_transcriptome_v1.6.fasta" ]]; then
reads_array=("${reads_dir}"/*megan*.fq)
# Create array of fastq R1 files
R1_array=("${reads_dir}"/*megan*R1.fq)
# Create array of fastq R2 files
R2_array=("${reads_dir}"/*megan*R2.fq)
# v.17 libraries: 2018 2019 2020-UW
elif [[ "${transcriptome_name}" == "hemat_transcriptome_v1.7.fasta" ]]; then
reads_array=("${reads_dir}"/20200[145][13][189]*megan*.fq)
# Create array of fastq R1 files
R1_array=("${reads_dir}"/20200[145][13][189]*megan*R1.fq)
# Create array of fastq R2 files
R2_array=("${reads_dir}"/20200[145][13][189]*megan*R2.fq)
fi
# Create checksum list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
md5sum "${reads_array[@]}" >> "${transcriptome_name}".fastq-checksums.md5
# Create comma-separated lists of FastQ reads
R1_list=$(echo "${R1_array[@]}" | tr " " ",")
R2_list=$(echo "${R2_array[@]}" | tr " " ",")
if [[ "${transcriptome_name}" == "hemat_transcriptome_v1.6.fasta" ]]; then
# Run Trinity without stranded RNAseq option
${programs_array[trinity]} \
--seqType fq \
--max_memory ${max_mem} \
--CPU ${threads} \
--output ${trinity_out_dir} \
--left "${R1_list}" \
--right "${R2_list}"
else
# Run Trinity with stranded RNAseq option
${programs_array[trinity]} \
--seqType fq \
--max_memory ${max_mem} \
--CPU ${threads} \
--output ${trinity_out_dir} \
--SS_lib_type RF \
--left "${R1_list}" \
--right "${R2_list}"
fi
# Rename generic assembly FastA
mv "${trinity_out_dir}"/Trinity.fasta "${trinity_out_dir}"/"${transcriptome_name}"
# Assembly stats
${programs_array[trinity_stats]} "${trinity_out_dir}"/"${transcriptome_name}" \
> "${assembly_stats}"
# Create gene map files
${programs_array[trinity_gene_trans_map]} \
"${trinity_out_dir}"/"${transcriptome_name}" \
> "${trinity_out_dir}"/"${transcriptome_name}".gene_trans_map
# Create sequence lengths file (used for differential gene expression)
${programs_array[trinity_fasta_seq_length]} \
"${trinity_out_dir}"/"${transcriptome_name}" \
> "${trinity_out_dir}"/"${transcriptome_name}".seq_lens
# Create FastA index
${programs_array[samtools_faidx]} \
"${trinity_out_dir}"/"${transcriptome_name}"
# Copy files to transcriptomes directory
rsync -av \
"${trinity_out_dir}"/"${transcriptome_name}"* \
${transcriptomes_dir}
# Capture FastA checksums for verification
cd "${trinity_out_dir}"
echo "Generating checksum for ${transcriptome_name}"
md5sum "${transcriptome_name}" > "${transcriptome_name}".checksum.md5
echo "Finished generating checksum for ${transcriptome_name}"
echo ""
cd ${wd}
done
# Capture program options
echo "Logging program options..."
for program in "${!programs_array[@]}"
do
{
echo "Program options for ${program}: "
echo ""
# Handle samtools help menus
if [[ "${program}" == "samtools_index" ]] \
|| [[ "${program}" == "samtools_sort" ]] \
|| [[ "${program}" == "samtools_view" ]]
then
${programs_array[$program]}
# Handle DIAMOND BLAST menu
elif [[ "${program}" == "diamond" ]]; then
${programs_array[$program]} help
# Handle NCBI BLASTx menu
elif [[ "${program}" == "blastx" ]]; then
${programs_array[$program]} -help
fi
${programs_array[$program]} -h
echo ""
echo ""
echo "----------------------------------------------"
echo ""
echo ""
} &>> program_options.log || true
# If MultiQC is in programs_array, copy the config file to this directory.
if [[ "${program}" == "multiqc" ]]; then
cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
fi
done
echo ""
echo "Finished logging program options."
echo ""
echo ""
echo "Logging system PATH."
# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
echo "Finished logging system PATH"
RESULTS
Just under 5hrs to run both assemblies:
Output folder:
hemat_transcriptome_v1.6
Input FastQ list (text):
FastA (4.5MB):
FastA Index (text):
The following sets of files are useful for downstream gene expression and annotation using Trinity.
Trinity FastA Gene Trans Map (text):
Trinity FastA Sequence Lengths (text):
Assembly stats:
## Counts of transcripts, etc.
################################
Total trinity 'genes': 5395
Total trinity transcripts: 6176
Percent GC: 50.22
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 1889
Contig N20: 1490
Contig N30: 1212
Contig N40: 1015
Contig N50: 868
Median contig length: 582
Average contig: 700.74
Total assembled bases: 4327777
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 1818
Contig N20: 1423
Contig N30: 1160
Contig N40: 978
Contig N50: 835
Median contig length: 548
Average contig: 669.49
Total assembled bases: 3611911
hemat_transcriptome_v1.7
Input FastQ list (text):
FastA (1.9MB):
FastA Index (text):
The following sets of files are useful for downstream gene expression and annotation using Trinity.
Trinity FastA Gene Trans Map (text):
Trinity FastA Sequence Lengths (text):
Assembly stats:
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 3331
Total trinity transcripts: 3538
Percent GC: 50.78
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 1346
Contig N20: 1008
Contig N30: 820
Contig N40: 688
Contig N50: 588
Median contig length: 407
Average contig: 506.42
Total assembled bases: 1791711
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 1250
Contig N20: 959
Contig N30: 782
Contig N40: 661
Contig N50: 560
Median contig length: 392
Average contig: 488.48
Total assembled bases: 1627114