I previously assembled and annotated P.generosa gonad transcriptome (20190318) using just our HiSeq data from our Illumina collaboration. This was a an oversight, as I didn’t realize that we also had NovaSeq RNAseq data. So, I’ve initiated another de novo assembly using Trinity incorporating both sets of data.
NovaSeq data had been previously trimmed.
Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic
option.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=trin_gonad
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_gonad_RNAseq
# Exit script if a command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log
# User-defined variables
reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/gonad
threads=28
assembly_stats=assembly_stats.txt
# Paths to programs
trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3"
samtools="/gscratch/srlab/programs/samtools-1.9/samtools"
## Inititalize arrays
R1_array=()
R2_array=()
# Variables for R1/R2 lists
R1_list=""
R2_list=""
# Create array of fastq R1 files
R1_array=(${reads_dir}/*_R1_*.gz)
# Create array of fastq R2 files
R2_array=(${reads_dir}/*_R2_*.gz)
# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in ${reads_dir}/*.gz
do
echo ${fastq##*/} >> fastq.list.txt
done
# Create comma-separated lists of FastQ reads
R1_list=$(echo ${R1_array[@]} | tr " " ",")
R2_list=$(echo ${R2_array[@]} | tr " " ",")
# Run Trinity
${trinity_dir}/Trinity \
--trimmomatic \
--seqType fq \
--max_memory 120G \
--CPU ${threads} \
--left \
${R1_list} \
--right \
${R2_list}
# Assembly stats
${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \
> ${assembly_stats}
# Create gene map files
${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \
trinity_out_dir/Trinity.fasta \
> trinity_out_dir/Trinity.fasta.gene_trans_map
# Create FastA index
${samtools} faidx \
trinity_out_dir/Trinity.fasta
RESULTS
Output folder:
Trinity FastA:
Trinity FastA index file:
Trinity Gene Trans Map file:
Assembly stats (text):
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 151263
Total trinity transcripts: 198748
Percent GC: 36.21
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3270
Contig N20: 2036
Contig N30: 1343
Contig N40: 908
Contig N50: 640
Median contig length: 315
Average contig: 522.32
Total assembled bases: 103810995
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 2267
Contig N20: 1319
Contig N30: 873
Contig N40: 626
Contig N50: 472
Median contig length: 288
Average contig: 439.32
Total assembled bases: 66453310
List of input FastQs (text):
Geoduck-gonad-RNA-1_S1_L001_R1_001.fastq.gz
Geoduck-gonad-RNA-1_S1_L001_R2_001.fastq.gz
Geoduck-gonad-RNA-2_S9_L002_R1_001.fastq.gz
Geoduck-gonad-RNA-2_S9_L002_R2_001.fastq.gz
Geoduck-gonad-RNA-3_S17_L003_R1_001.fastq.gz
Geoduck-gonad-RNA-3_S17_L003_R2_001.fastq.gz
Geoduck-gonad-RNA-4_S25_L004_R1_001.fastq.gz
Geoduck-gonad-RNA-4_S25_L004_R2_001.fastq.gz
Geoduck-gonad-RNA-5_S33_L005_R1_001.fastq.gz
Geoduck-gonad-RNA-5_S33_L005_R2_001.fastq.gz
Geoduck-gonad-RNA-6_S41_L006_R1_001.fastq.gz
Geoduck-gonad-RNA-6_S41_L006_R2_001.fastq.gz
Geoduck-gonad-RNA-7_S49_L007_R1_001.fastq.gz
Geoduck-gonad-RNA-7_S49_L007_R2_001.fastq.gz
Geoduck-gonad-RNA-8_S57_L008_R1_001.fastq.gz
Geoduck-gonad-RNA-8_S57_L008_R2_001.fastq.gz
NR006_S3_L001_R1_001_val_1_val_1.fq.gz
NR006_S3_L001_R2_001_val_2_val_2.fq.gz
NR006_S3_L002_R1_001_val_1_val_1.fq.gz
NR006_S3_L002_R2_001_val_2_val_2.fq.gz