I previously assembled and annotated P.generosa ctenidia transcriptome (20190318) using just our HiSeq data from our Illumina collaboration. This was a an oversight, as I didn’t realize that we also had NovaSeq RNAseq data. So, I’ve initiated another de novo assembly using Trinity incorporating both sets of data.
NovaSeq data had been previously trimmed.
Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic
option.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=trin_ctenidia
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_ctenidia_RNAseq
# Exit script if a command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log
# User-defined variables
reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/ctenidia
threads=28
assembly_stats=assembly_stats.txt
# Paths to programs
trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3"
samtools="/gscratch/srlab/programs/samtools-1.9/samtools"
## Inititalize arrays
R1_array=()
R2_array=()
# Variables for R1/R2 lists
R1_list=""
R2_list=""
# Create array of fastq R1 files
R1_array=(${reads_dir}/*_R1_*.gz)
# Create array of fastq R2 files
R2_array=(${reads_dir}/*_R2_*.gz)
# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in ${reads_dir}/*.gz
do
echo ${fastq##*/} >> fastq.list.txt
done
# Create comma-separated lists of FastQ reads
R1_list=$(echo ${R1_array[@]} | tr " " ",")
R2_list=$(echo ${R2_array[@]} | tr " " ",")
# Run Trinity
${trinity_dir}/Trinity \
--trimmomatic \
--seqType fq \
--max_memory 120G \
--CPU ${threads} \
--left \
${R1_list} \
--right \
${R2_list}
# Assembly stats
${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \
> ${assembly_stats}
# Create gene map files
${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \
trinity_out_dir/Trinity.fasta \
> trinity_out_dir/Trinity.fasta.gene_trans_map
# Create FastA index
${samtools} faidx \
trinity_out_dir/Trinity.fasta
RESULTS
This took ~12hrs to complete.
I’ll pass this along to Steven/Christian, since this was done for Christian to use in some long, non-coding RNA (lncRNA) analysies.
I’ll also probably just take this through the annotation pipeline, since it’s not difficult, nor time consuming.
Output folder:
Trinity FastA:
Trinity FastA index file:
Trinity Gene Trans Map file:
Assembly stats (text):
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 216248
Total trinity transcripts: 349773
Percent GC: 35.70
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 5199
Contig N20: 3549
Contig N30: 2617
Contig N40: 1927
Contig N50: 1387
Median contig length: 400
Average contig: 785.61
Total assembled bases: 274785010
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 4118
Contig N20: 2682
Contig N30: 1852
Contig N40: 1291
Contig N50: 892
Median contig length: 338
Average contig: 612.80
Total assembled bases: 132517038
List of input FastQs (text):
Geoduck-ctenidia-RNA-1_S3_L001_R1_001.fastq.gz
Geoduck-ctenidia-RNA-1_S3_L001_R2_001.fastq.gz
Geoduck-ctenidia-RNA-2_S11_L002_R1_001.fastq.gz
Geoduck-ctenidia-RNA-2_S11_L002_R2_001.fastq.gz
Geoduck-ctenidia-RNA-3_S19_L003_R1_001.fastq.gz
Geoduck-ctenidia-RNA-3_S19_L003_R2_001.fastq.gz
Geoduck-ctenidia-RNA-4_S27_L004_R1_001.fastq.gz
Geoduck-ctenidia-RNA-4_S27_L004_R2_001.fastq.gz
Geoduck-ctenidia-RNA-5_S35_L005_R1_001.fastq.gz
Geoduck-ctenidia-RNA-5_S35_L005_R2_001.fastq.gz
Geoduck-ctenidia-RNA-6_S43_L006_R1_001.fastq.gz
Geoduck-ctenidia-RNA-6_S43_L006_R2_001.fastq.gz
Geoduck-ctenidia-RNA-7_S51_L007_R1_001.fastq.gz
Geoduck-ctenidia-RNA-7_S51_L007_R2_001.fastq.gz
Geoduck-ctenidia-RNA-8_S59_L008_R1_001.fastq.gz
Geoduck-ctenidia-RNA-8_S59_L008_R2_001.fastq.gz
NR012_S1_L001_R1_001_val_1_val_1.fq.gz
NR012_S1_L001_R2_001_val_2_val_2.fq.gz
NR012_S1_L002_R1_001_val_1_val_1.fq.gz
NR012_S1_L002_R2_001_val_2_val_2.fq.gz