Ran a de novo assembly on our HiSeq and NovaSeq data from Hollie’s juvenile EPI 116 sample “super low OA”. This was done for Christian to use in some long, non-coding RNA (lncRNA) analysis.
NovaSeq data had been previously trimmed.
Trimming of the HiSeq data was performed via Trinity, using the --trimmomatic
option.
SBATCH script (GitHub):
#!/bin/bash
## Job Name
#SBATCH --job-name=trin_epi116
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190409_trinity_pgen_EPI116_RNAseq
# Exit script if a command fails
set -e
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log
# User-defined variables
reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/epi_116
threads=28
assembly_stats=assembly_stats.txt
# Paths to programs
trinity_dir="/gscratch/srlab/programs/Trinity-v2.8.3"
samtools="/gscratch/srlab/programs/samtools-1.9/samtools"
## Inititalize arrays
R1_array=()
R2_array=()
# Variables for R1/R2 lists
R1_list=""
R2_list=""
# Create array of fastq R1 files
R1_array=(${reads_dir}/*_R1_*.gz)
# Create array of fastq R2 files
R2_array=(${reads_dir}/*_R2_*.gz)
# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in ${reads_dir}/*.gz
do
echo ${fastq##*/} >> fastq.list.txt
done
# Create comma-separated lists of FastQ reads
R1_list=$(echo ${R1_array[@]} | tr " " ",")
R2_list=$(echo ${R2_array[@]} | tr " " ",")
# Run Trinity
${trinity_dir}/Trinity \
--trimmomatic \
--seqType fq \
--max_memory 120G \
--CPU ${threads} \
--left \
${R1_list} \
--right \
${R2_list}
# Assembly stats
${trinity_dir}/util/TrinityStats.pl trinity_out_dir/Trinity.fasta \
> ${assembly_stats}
# Create gene map files
${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \
trinity_out_dir/Trinity.fasta \
> trinity_out_dir/Trinity.fasta.gene_trans_map
# Create FastA index
${samtools} faidx \
trinity_out_dir/Trinity.fasta
RESULTS
Output folder:
Trinity FastA:
Trinity FastA index file:
Trinity Gene Trans Map file:
Assembly stats (text):
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 268476
Total trinity transcripts: 434877
Percent GC: 36.27
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 6036
Contig N20: 4167
Contig N30: 3076
Contig N40: 2287
Contig N50: 1663
Median contig length: 433
Average contig: 882.19
Total assembled bases: 383643405
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 4752
Contig N20: 3103
Contig N30: 2179
Contig N40: 1532
Contig N50: 1050
Median contig length: 366
Average contig: 679.12
Total assembled bases: 182327573
List of input FastQs (text):
Geoduck-juvenile-OA-exposure-RNA-EPI-116-1_S5_L001_R1_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-1_S5_L001_R2_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-2_S13_L002_R1_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-2_S13_L002_R2_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-3_S21_L003_R1_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-3_S21_L003_R2_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-4_S29_L004_R1_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-4_S29_L004_R2_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-5_S37_L005_R1_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-5_S37_L005_R2_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-6_S45_L006_R1_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-6_S45_L006_R2_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-7_S53_L007_R1_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-7_S53_L007_R2_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-8_S61_L008_R1_001.fastq.gz
Geoduck-juvenile-OA-exposure-RNA-EPI-116-8_S61_L008_R2_001.fastq.gz
NR005_S4_L001_R1_001_val_1_val_1.fq.gz
NR005_S4_L001_R2_001_val_2_val_2.fq.gz
NR005_S4_L002_R1_001_val_1_val_1.fq.gz
NR005_S4_L002_R2_001_val_2_val_2.fq.gz