INTRO
In preparation for RNA-seq analysis for urol-e5/timeseries_molecular (GitHub repo) project, I previously indexed the genome using HISAT2
on 20250307.
This notebook details the RNA-seq alignment using HISAT2
and StringTie
to generate DESeq2-formatted count matrices. These will be linked on the timeseries_molecular
Wiki Page.
The contents below are from markdown knitted from 02.20-E-Peve-RNAseq-alignment-HiSat2.Rmd
(commit 3963266
).
1 Background
This notebook will align trimmed P.evermanni RNA-seq data to the P.evermanni genome using HISAT2 (Kim et al. 2019). Follwed by StringTie (Pertea et al. 2015, 2016) for transcript assembly/identification and count matrices for downstream expression analysis with DESeq2 and/or \[Ballgown\](https://github.com/alyssafrazee/ballgown.
Since the BAM files produced by this notebook are too large for GitHub, they can be accessed on our server here:
Input(s)
- Trimmed FastQ files, with format:
<colone_ID>-<timepoint>_*fastp-trim.fq.gz
- HISAT2 genome index:
Porites_evermanni_v1
- Genome GTF:
Porites_evermanni_validated.gtf
- Sample metadata:
M-multi-species/data/rna_metadata.csv
Outputs:
Primary:
checksums.md5
: MD5 checksum for all files in this directory. Excludes subdirectories.peve-gene_count_matrix.csv
: Gene count matrix for use in DESeq2.peve-transcript_count_matrix.csv
: Transcript count matrix for use in DESeq2.peve-transcript_count_matrix_with_gene_ids.csv
: Transcript count matrix which includes corresponding gene IDs.prepDE-sample_list.txt
: Sample file list provided as input to StringTie for DESeq2 count matrix generation. Also serves as documentation of which files were used for this step.Porites_evermanni_v1.stringtie.gtf
: Canonical StringTie GTF file compiled from all individual sample GTFs.sorted-bams-merged.bam
: Merged (and sorted) BAM consisting of all individual sample BAMs.sorted-bams-merged.bam.bai
: BAM index file. Useful for visualizing assemblies in IGV.sorted_bams.list
: List file needed for merging of BAMS with samtools. Also serves as documentation of which files were used for this step.multiqc_report.html
: MultiQC report aggregating all individual HISAT2 alignment stats and samtools flagstats.gtf_list.txt
: List file needed for merging of GTF files with StringTie. Also serves as documentation of which files were used for this step.
Individuals:
Each subdirectory is labelled based on sample name and each contains individual HISAT2 alignment and StringTie output files.
<sample_name>_checksums.md5
: MD5 checksums for all files in the directory.*.ctab
: Data tables formatted for import into Ballgown.<sample_name>.cov_refs.gtf
: StringTie genome reference sequnce coverage GTF.<sample_name>.gtf
: StringTie GTF.<sample_name>.sorted.bam
: HISAT2 assembly BAM.<sample_name>.sorted.bam.bai
: BAM index file. Useful for visualizing assemblies in IGV.<sample_name>-hisat2_output.flagstat
: samtools flagstat output file.<sample_name>_hisat2.stats
: HISAT2 assembly stats.input_fastqs_checksums.md5
: MD5 checksums of files used as input for assembly. Primarily serves as documentation to track/verify which files were actually used.
2 Create a Bash variables file
This allows usage of Bash variables across R Markdown chunks.
{
echo "#### Assign Variables ####"
echo ""
echo "# Data directories"
echo 'export timeseries_dir=/home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular'
echo 'export genome_dir="${timeseries_dir}/E-Peve/data"'
echo 'export genome_index_dir="${timeseries_dir}/E-Peve/output/02.10-E-Peve-RNAseq-genome-index-HiSat2"'
echo 'export output_dir_top="${timeseries_dir}/E-Peve/output/02.20-E-Peve-RNAseq-alignment-HiSat2"'
echo 'export trimmed_fastqs_dir="${timeseries_dir}/E-Peve/output/01.00-E-Peve-RNAseq-trimming-fastp-FastQC-MultiQC"'
echo ""
echo "# Location of Hisat2 index files"
echo "# Must keep variable name formatting, as it's used by HiSat2"
echo 'export HISAT2_INDEXES="${genome_index_dir}"'
echo "# Input files"
echo 'export genome_index_name="Porites_evermanni_v1"'
echo 'export genome_gff="${genome_dir}/Porites_evermanni_validated.gff3"'
echo 'export genome_fasta="${genome_dir}/Porites_evermanni_v1.fa"'
echo 'export transcripts_gtf="${genome_dir}/Porites_evermanni_validated.gtf"'
echo "# Output files"
echo 'export gtf_list="${output_dir_top}/gtf_list.txt"'
echo 'export merged_bam="${output_dir_top}/sorted-bams-merged.bam"'
echo ""
echo "# Paths to programs"
echo 'export programs_dir="/home/shared"'
echo 'export hisat2_dir="${programs_dir}/hisat2-2.2.1"'
echo 'export hisat2="${hisat2_dir}/hisat2"'
echo 'export multiqc=/home/sam/programs/mambaforge/bin/multiqc'
echo 'export samtools="${programs_dir}/samtools-1.12/samtools"'
echo 'export prepDE="${programs_dir}/stringtie-2.2.1.Linux_x86_64/prepDE.py3"'
echo 'export stringtie="${programs_dir}/stringtie-2.2.1.Linux_x86_64/stringtie"'
echo ""
echo "# Set FastQ filename patterns"
echo "export R1_fastq_pattern='*_R1_*.fq.gz'"
echo "export R2_fastq_pattern='*_R2_*.fq.gz'"
echo "export trimmed_fastq_pattern='*fastp-trim.fq.gz'"
echo ""
echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""
echo "# Set average read length - for StringTie prepDE.py"
echo 'export read_length=125'
echo ""
echo "## Initialize arrays"
echo 'export fastq_array_R1=()'
echo 'export fastq_array_R2=()'
echo 'export R1_names_array=()'
echo 'export R2_names_array=()'
echo "declare -A sample_timepoint_map"
echo ""
echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[hisat2]="${hisat2}" \'
echo '[multiqc]="${multiqc}" \'
echo '[prepDE]="${prepDE}" \'
echo '[samtools]="${samtools}" \'
echo '[stringtie]="${stringtie}" \'
echo ")"
echo ""
echo "# Print formatting"
echo 'export line="--------------------------------------------------------"'
echo ""
} > .bashvars
cat .bashvars
#### Assign Variables ####
# Data directories
export timeseries_dir=/home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular
export genome_dir="${timeseries_dir}/E-Peve/data"
export genome_index_dir="${timeseries_dir}/E-Peve/output/02.10-E-Peve-RNAseq-genome-index-HiSat2"
export output_dir_top="${timeseries_dir}/E-Peve/output/02.20-E-Peve-RNAseq-alignment-HiSat2"
export trimmed_fastqs_dir="${timeseries_dir}/E-Peve/output/01.00-E-Peve-RNAseq-trimming-fastp-FastQC-MultiQC"
# Location of Hisat2 index files
# Must keep variable name formatting, as it's used by HiSat2
export HISAT2_INDEXES="${genome_index_dir}"
# Input files
export genome_index_name="Porites_evermanni_v1"
export genome_gff="${genome_dir}/Porites_evermanni_validated.gff3"
export genome_fasta="${genome_dir}/Porites_evermanni_v1.fa"
export transcripts_gtf="${genome_dir}/Porites_evermanni_validated.gtf"
# Output files
export gtf_list="${output_dir_top}/gtf_list.txt"
export merged_bam="${output_dir_top}/sorted-bams-merged.bam"
# Paths to programs
export programs_dir="/home/shared"
export hisat2_dir="${programs_dir}/hisat2-2.2.1"
export hisat2="${hisat2_dir}/hisat2"
export multiqc=/home/sam/programs/mambaforge/bin/multiqc
export samtools="${programs_dir}/samtools-1.12/samtools"
export prepDE="${programs_dir}/stringtie-2.2.1.Linux_x86_64/prepDE.py3"
export stringtie="${programs_dir}/stringtie-2.2.1.Linux_x86_64/stringtie"
# Set FastQ filename patterns
export R1_fastq_pattern='*_R1_*.fq.gz'
export R2_fastq_pattern='*_R2_*.fq.gz'
export trimmed_fastq_pattern='*fastp-trim.fq.gz'
# Set number of CPUs to use
export threads=40
# Set average read length - for StringTie prepDE.py
export read_length=125
## Initialize arrays
export fastq_array_R1=()
export fastq_array_R2=()
export R1_names_array=()
export R2_names_array=()
declare -A sample_timepoint_map
# Programs associative array
declare -A programs_array
programs_array=(
[hisat2]="${hisat2}" \
[multiqc]="${multiqc}" \
[prepDE]="${prepDE}" \
[samtools]="${samtools}" \
[stringtie]="${stringtie}" \
)
# Print formatting
export line="--------------------------------------------------------"
3 Align reads using HISAT2
This requires usage of the rna_metadata.csv
This step has a lengthy, semi-complex workflow:
- Parse
rna_metadata.csv
for P.evermanni sample names and time point. This info will be used for downstream file naming and to assing the time point to the read group (SM:
) in the alignment file. - Loop through all samples and perform individual alignments using HISAT2.
- HISAT2 output is piped to through multiple samtools tools: flagstat (stats aggregation), sort (creates/sorts BAM), index (creates BAM index). Piping saves time and disk space, by avoiding the generation of large SAM files.
- Loop continues and runs StringTie on sorted BAM file to produce individual GTF file.
- Loop continues and adds GTF path/filename to a list file, which will be used downstream.
# Load bash variables into memory
source .bashvars
# Make output directories, if they don't exist
mkdir --parents "${output_dir_top}"
# Change to ouput directory
cd "${output_dir_top}"
# Create associative array with sample and timepoint
metadata="../../../M-multi-species/data/rna_metadata.csv"
# Declare the array
declare -A sample_timepoint_map
# Read the metadata file line by line
while IFS=',' read -r sample_number sample_name plate well_number azenta_sample_name colony_id timepoint sample_type species_strain SampleBuffer; do
# Check if the species is "Porites evermanni"
if [[ "${species_strain}" == "Porites evermanni" ]]; then
# Add the Azenta sample name as the key and Timepoint as the value in the associative array
sample_timepoint_map["${colony_id}-${timepoint}"]="${timepoint}"
fi
done < <(tail -n +2 "${metadata}") # Skip the header
## Populate trimmed reads arrays
fastq_array_R1=("${trimmed_fastqs_dir}"/${R1_fastq_pattern})
fastq_array_R2=("${trimmed_fastqs_dir}"/${R2_fastq_pattern})
############## BEGIN HISAT2 ALIGNMENTS ##############
# Loop through array using sample names (e.g. <colony_ID>-<timepoint>)
for sample in "${!sample_timepoint_map[@]}"
do
# Create and switch to dedicated sample directory
mkdir --parents "${sample}" && cd "$_"
# Create HISAT2 list of fastq R1 files
# and generated MD5 checksums file.
for fastq in "${fastq_array_R1[@]}"
do
# Parse sample name from FastQ filename
fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}')
# Process matching FastQ file, based on sample name
if [ "${fastq_sample}" == "${sample}" ]; then
# Generate checksum/list of input files used
md5sum "${fastq}" >> input_fastqs_checksums.md5
# Create comma-separated lists of FastQs for HISAT2
printf -v joined_R1 '%s,' "${fastq}"
fastq_list_R1=$(echo "${joined_R1%,}")
fi
done
# Create HISAT2 list of fastq R1 files
# and generated MD5 checksums file.
for fastq in "${fastq_array_R2[@]}"
do
# Parse sample name from FastQ filename
fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}')
# Process matching FastQ file, based on sample name
if [ "${fastq_sample}" == "${sample}" ]; then
# Generate checksum/list of input files used
md5sum "${fastq}" >> input_fastqs_checksums.md5
# Create comma-separated lists of FastQs for HISAT2
printf -v joined_R2 '%s,' "${fastq}"
fastq_list_R2=$(echo "${joined_R2%,}")
fi
done
# HISAT2 alignments
# Sets read group info (RG) using samples array
"${programs_array[hisat2]}" \
-x "${genome_index_name}" \
-1 "${fastq_list_R1}" \
-2 "${fastq_list_R2}" \
--threads "${threads}" \
--rg-id "${sample}" \
--rg "SM:""${sample_timepoint_map[$sample]}" \
2> "${sample}"_hisat2.stats \
| tee >(${programs_array[samtools]} flagstat - > "${sample}"-hisat2_output.flagstat) \
| ${programs_array[samtools]} sort - -@ "${threads}" -O BAM \
| tee "${sample}".sorted.bam \
| ${programs_array[samtools]} index - "${sample}".sorted.bam.bai
# Run stringtie on alignments
# Uses "-B" option to output tables intended for use in Ballgown
# Uses "-e" option; recommended when using "-B" option.
# Limits analysis to only reads alignments matching reference.
"${programs_array[stringtie]}" "${sample}".sorted.bam \
-p "${threads}" \
-o "${sample}".gtf \
-G "${genome_gff}" \
-C "${sample}.cov_refs.gtf" \
-B \
-e
# Add GTFs to list file, only if non-empty
# Identifies GTF files that only have header
gtf_lines=$(wc -l < "${sample}".gtf )
if [ "${gtf_lines}" -gt 2 ]; then
echo "$(pwd)/${sample}.gtf" >> "${gtf_list}"
fi
# Generate checksums
find ./ -type f -not -name "*.md5" -exec md5sum {} \; > ${sample}_checksums.md5
# Move up to orig. working directory
cd ..
done
3.1 Review HISAT2 Output
View the resulting directory structure of resulting from the HISAT2/StringTie process.
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
# Display HISAT2 output directory structure
# with directory (--du) and file sizes (-h)
tree --du -h
[1.6G] [01;34m.[0m
├── [ 592] [00mchecksums.md5[0m
├── [5.3K] [00mgtf_list.txt[0m
├── [301K] [01;34mmultiqc_data[0m
│ ├── [5.0K] [00mmultiqc_bowtie2.txt[0m
│ ├── [ 307] [00mmultiqc_citations.txt[0m
│ ├── [262K] [00mmultiqc_data.json[0m
│ ├── [3.1K] [00mmultiqc_general_stats.txt[0m
│ ├── [4.1K] [00mmultiqc.log[0m
│ ├── [7.7K] [00mmultiqc_samtools_flagstat.txt[0m
│ └── [ 15K] [00mmultiqc_sources.txt[0m
├── [1.1M] [00mmultiqc_report.html[0m
├── [4.6M] [00mpeve-gene_count_matrix.csv[0m
├── [4.3M] [00mpeve-transcript_count_matrix.csv[0m
├── [5.0M] [00mpeve-transcript_count_matrix_with_gene_ids.csv[0m
├── [ 42M] [01;34mPOR-216-TP1[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-216-TP1_checksums.md5[0m
│ ├── [ 450] [00mPOR-216-TP1-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOR-216-TP1_hisat2.stats[0m
│ ├── [1.2M] [00mPOR-216-TP1.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-216-TP2[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-216-TP2_checksums.md5[0m
│ ├── [ 449] [00mPOR-216-TP2-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOR-216-TP2_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-216-TP2.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-216-TP3[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-216-TP3_checksums.md5[0m
│ ├── [ 449] [00mPOR-216-TP3-hisat2_output.flagstat[0m
│ ├── [ 636] [00mPOR-216-TP3_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-216-TP3.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-216-TP4[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-216-TP4_checksums.md5[0m
│ ├── [ 450] [00mPOR-216-TP4-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-216-TP4_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-216-TP4.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-236-TP1[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-236-TP1_checksums.md5[0m
│ ├── [ 449] [00mPOR-236-TP1-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-236-TP1_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-236-TP1.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-236-TP2[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-236-TP2_checksums.md5[0m
│ ├── [ 450] [00mPOR-236-TP2-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-236-TP2_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-236-TP2.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-245-TP1[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-245-TP1_checksums.md5[0m
│ ├── [ 449] [00mPOR-245-TP1-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOR-245-TP1_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-245-TP1.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-245-TP2[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-245-TP2_checksums.md5[0m
│ ├── [ 449] [00mPOR-245-TP2-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOR-245-TP2_hisat2.stats[0m
│ ├── [1.0M] [00mPOR-245-TP2.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-245-TP3[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-245-TP3_checksums.md5[0m
│ ├── [ 450] [00mPOR-245-TP3-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-245-TP3_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-245-TP3.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-245-TP4[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-245-TP4_checksums.md5[0m
│ ├── [ 450] [00mPOR-245-TP4-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOR-245-TP4_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-245-TP4.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-260-TP1[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-260-TP1_checksums.md5[0m
│ ├── [ 450] [00mPOR-260-TP1-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOR-260-TP1_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-260-TP1.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-260-TP2[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-260-TP2_checksums.md5[0m
│ ├── [ 449] [00mPOR-260-TP2-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-260-TP2_hisat2.stats[0m
│ ├── [1.0M] [00mPOR-260-TP2.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-260-TP3[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-260-TP3_checksums.md5[0m
│ ├── [ 449] [00mPOR-260-TP3-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOR-260-TP3_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-260-TP3.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-260-TP4[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-260-TP4_checksums.md5[0m
│ ├── [ 449] [00mPOR-260-TP4-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-260-TP4_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-260-TP4.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-262-TP1[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-262-TP1_checksums.md5[0m
│ ├── [ 449] [00mPOR-262-TP1-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-262-TP1_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-262-TP1.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-262-TP2[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-262-TP2_checksums.md5[0m
│ ├── [ 449] [00mPOR-262-TP2-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOR-262-TP2_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-262-TP2.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-262-TP3[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-262-TP3_checksums.md5[0m
│ ├── [ 450] [00mPOR-262-TP3-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-262-TP3_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-262-TP3.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-262-TP4[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOR-262-TP4_checksums.md5[0m
│ ├── [ 449] [00mPOR-262-TP4-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-262-TP4_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-262-TP4.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-69-TP1[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-69-TP1_checksums.md5[0m
│ ├── [ 450] [00mPOR-69-TP1-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-69-TP1_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-69-TP1.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-69-TP2[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-69-TP2_checksums.md5[0m
│ ├── [ 450] [00mPOR-69-TP2-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOR-69-TP2_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-69-TP2.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-69-TP3[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-69-TP3_checksums.md5[0m
│ ├── [ 450] [00mPOR-69-TP3-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-69-TP3_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-69-TP3.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-69-TP4[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-69-TP4_checksums.md5[0m
│ ├── [ 449] [00mPOR-69-TP4-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-69-TP4_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-69-TP4.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-72-TP1[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-72-TP1_checksums.md5[0m
│ ├── [ 449] [00mPOR-72-TP1-hisat2_output.flagstat[0m
│ ├── [ 637] [00mPOR-72-TP1_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-72-TP1.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-72-TP2[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-72-TP2_checksums.md5[0m
│ ├── [ 449] [00mPOR-72-TP2-hisat2_output.flagstat[0m
│ ├── [ 636] [00mPOR-72-TP2_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-72-TP2.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-72-TP3[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-72-TP3_checksums.md5[0m
│ ├── [ 449] [00mPOR-72-TP3-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-72-TP3_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-72-TP3.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-72-TP4[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-72-TP4_checksums.md5[0m
│ ├── [ 451] [00mPOR-72-TP4-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-72-TP4_hisat2.stats[0m
│ ├── [1.2M] [00mPOR-72-TP4.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-73-TP1[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-73-TP1_checksums.md5[0m
│ ├── [ 450] [00mPOR-73-TP1-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOR-73-TP1_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-73-TP1.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-73-TP2[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-73-TP2_checksums.md5[0m
│ ├── [ 450] [00mPOR-73-TP2-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-73-TP2_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-73-TP2.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-73-TP3[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-73-TP3_checksums.md5[0m
│ ├── [ 450] [00mPOR-73-TP3-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-73-TP3_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-73-TP3.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-73-TP4[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-73-TP4_checksums.md5[0m
│ ├── [ 449] [00mPOR-73-TP4-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOR-73-TP4_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-73-TP4.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-74-TP1[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-74-TP1_checksums.md5[0m
│ ├── [ 449] [00mPOR-74-TP1-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOR-74-TP1_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-74-TP1.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-74-TP2[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-74-TP2_checksums.md5[0m
│ ├── [ 449] [00mPOR-74-TP2-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOR-74-TP2_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-74-TP2.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-74-TP3[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-74-TP3_checksums.md5[0m
│ ├── [ 450] [00mPOR-74-TP3-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-74-TP3_hisat2.stats[0m
│ ├── [1.2M] [00mPOR-74-TP3.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-74-TP4[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-74-TP4_checksums.md5[0m
│ ├── [ 450] [00mPOR-74-TP4-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-74-TP4_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-74-TP4.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-83-TP1[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-83-TP1_checksums.md5[0m
│ ├── [ 449] [00mPOR-83-TP1-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-83-TP1_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-83-TP1.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-83-TP2[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-83-TP2_checksums.md5[0m
│ ├── [ 450] [00mPOR-83-TP2-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-83-TP2_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-83-TP2.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-83-TP3[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-83-TP3_checksums.md5[0m
│ ├── [ 450] [00mPOR-83-TP3-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-83-TP3_hisat2.stats[0m
│ ├── [1.1M] [00mPOR-83-TP3.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 42M] [01;34mPOR-83-TP4[0m
│ ├── [2.8M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.3M] [00mi2t.ctab[0m
│ ├── [ 11M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOR-83-TP4_checksums.md5[0m
│ ├── [ 450] [00mPOR-83-TP4-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOR-83-TP4_hisat2.stats[0m
│ ├── [1.2M] [00mPOR-83-TP4.sorted.bam.bai[0m
│ └── [4.1M] [00mt_data.ctab[0m
├── [ 43M] [00mPorites_evermanni_v1.stringtie.gtf[0m
├── [5.8K] [00mprepDE-sample_list.txt[0m
├── [3.3K] [00mREADME.md[0m
├── [1.3K] [00msorted_bams.list[0m
└── [8.6M] [00msorted-bams-merged.bam.bai[0m
3.2G used in 40 directories, 398 files
3.2 MultiQC alignment rates
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
${programs_array[multiqc]} \
\
--interactive .
4 Merge sorted BAMs
Merge all BAMs to singular BAM for use in transcriptome assembly later, if needed.
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
## Create list of sorted BAMs for merging
find . -name "*sorted.bam" > sorted_bams.list
## Merge sorted BAMs
${programs_array[samtools]} merge \
\
-b sorted_bams.list ${merged_bam} \
${threads}
--threads
## Index merged BAM
${programs_array[samtools]} index ${merged_bam}
5 Create combined GTF
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
# Create singular transcript file, using GTF list file
"${programs_array[stringtie]}" --merge \
"${gtf_list}" \
"${threads}" \
-p "${genome_gff}" \
-G "${genome_index_name}".stringtie.gtf -o
6 Create DESeq2 Count Matrices
# Load bash variables into memory
source .bashvars
# Change to output directory
cd "${output_dir_top}"
# Check if prepDE-sample_list.txt exists
if [ -f prepDE-sample_list.txt ]; then
gtf_lines=$(wc -l < gtf_list.txt)
prepde_lines=$(wc -l < prepDE-sample_list.txt)
if [ "$gtf_lines" -eq "$prepde_lines" ]; then
echo "prepDE-sample_list.txt exists and line count matches gtf_list.txt. Skipping sample list creation."
else
echo "prepDE-sample_list.txt exists but line count does not match. Regenerating sample list."
rm prepDE-sample_list.txt
while read -r line
do
sample_no_path=${line##*/}
sample=${sample_no_path%.*}
echo ${sample} ${line}
done < gtf_list.txt >> prepDE-sample_list.txt
fi
else
echo "prepDE-sample_list.txt does not exist. Creating sample list."
while read -r line
do
sample_no_path=${line##*/}
sample=${sample_no_path%.*}
echo ${sample} ${line}
done < gtf_list.txt >> prepDE-sample_list.txt
fi
# Create count matrices for genes and transcripts
# Compatible with import to DESeq2
python3 "${programs_array[prepDE]}" \
\
--input=prepDE-sample_list.txt \
-g peve-gene_count_matrix.csv \
-t peve-transcript_count_matrix.csv ${read_length} --length=
prepDE-sample_list.txt exists and line count matches gtf_list.txt. Skipping sample list creation.
/home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py3:69: SyntaxWarning: invalid escape sequence '\-'
RE_COVERAGE=re.compile('cov "([\-\+\d\.]+)"')
/home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py3:72: SyntaxWarning: invalid escape sequence '\-'
RE_GFILE=re.compile('\-G\s*(\S+)') #assume filepath without spaces..
7 Add Gene IDs to Transcript Count Matrix
Create an enhanced version of the transcript count matrix that includes the corresponding gene ID for each transcript.
# Load bash variables into memory
source .bashvars
# Change to output directory
cd "${output_dir_top}"
# Extract transcript_id to ref_gene_id mapping from GTF file
echo "Extracting transcript to gene ID mappings from GTF file..."
grep -E $'\ttranscript\t' "${genome_index_name}.stringtie.gtf" | \
awk -F'\t' '{
# Extract transcript_id
match($9, /transcript_id "([^"]+)"/, transcript_arr)
transcript_id = transcript_arr[1]
# Extract ref_gene_id
match($9, /ref_gene_id "([^"]+)"/, gene_arr)
ref_gene_id = gene_arr[1]
# Print mapping if both IDs found
if (transcript_id && ref_gene_id) {
print transcript_id "\t" ref_gene_id
}
}' > transcript_gene_mapping.txt
# Check if mapping file was created successfully
if [ ! -s transcript_gene_mapping.txt ]; then
echo "Error: Failed to create transcript-gene mapping file"
exit 1
fi
echo "Found $(wc -l < transcript_gene_mapping.txt) transcript-gene mappings"
# Create enhanced transcript count matrix with gene IDs
echo "Creating enhanced transcript count matrix with gene IDs..."
# Read header from original transcript count matrix
head -1 peve-transcript_count_matrix.csv > temp_header.txt
# Add ref_gene_id to header (insert after transcript_id column)
sed 's/transcript_id,/transcript_id,ref_gene_id,/' temp_header.txt > peve-transcript_count_matrix_with_gene_ids.csv
# Process data rows
tail -n +2 peve-transcript_count_matrix.csv | while IFS=',' read -r transcript_id rest_of_line; do
# Look up gene ID for this transcript
gene_id=$(grep "^${transcript_id}[[:space:]]" transcript_gene_mapping.txt | cut -f2)
# If no gene ID found, use empty field
if [ -z "$gene_id" ]; then
gene_id=""
fi
# Output line with gene ID inserted after transcript ID
echo "${transcript_id},${gene_id},${rest_of_line}"
done >> peve-transcript_count_matrix_with_gene_ids.csv
# Generate summary statistics
echo ""
echo "Summary of enhanced transcript count matrix:"
original_lines=$(wc -l < peve-transcript_count_matrix.csv)
enhanced_lines=$(wc -l < peve-transcript_count_matrix_with_gene_ids.csv)
echo "Original transcript count matrix lines: $original_lines"
echo "Enhanced transcript count matrix lines: $enhanced_lines"
# Count transcripts with and without gene ID mappings
transcripts_with_genes=$(tail -n +2 peve-transcript_count_matrix_with_gene_ids.csv | awk -F',' '$2 != ""' | wc -l)
transcripts_without_genes=$(tail -n +2 peve-transcript_count_matrix_with_gene_ids.csv | awk -F',' '$2 == ""' | wc -l)
echo "Number of transcripts with gene ID mappings: $transcripts_with_genes"
echo "Number of transcripts without gene ID mappings: $transcripts_without_genes"
# Display first few lines of enhanced matrix
echo ""
echo "First 5 rows of enhanced transcript count matrix (first 5 columns):"
head -6 peve-transcript_count_matrix_with_gene_ids.csv | cut -d',' -f1-5
# Clean up temporary files
rm temp_header.txt transcript_gene_mapping.txt
echo ""
echo "Enhanced transcript count matrix saved as: peve-transcript_count_matrix_with_gene_ids.csv"
Extracting transcript to gene ID mappings from GTF file...
Found 40389 transcript-gene mappings
Creating enhanced transcript count matrix with gene IDs...
Summary of enhanced transcript count matrix:
Original transcript count matrix lines: 40390
Enhanced transcript count matrix lines: 40390
Number of transcripts with gene ID mappings: 40389
Number of transcripts without gene ID mappings: 0
First 5 rows of enhanced transcript count matrix (first 5 columns):
transcript_id,ref_gene_id,POR-216-TP1,POR-216-TP2,POR-216-TP3
mrna-00031,gene-Peve_00000032,29,75,63
mrna-00116,gene-Peve_00000122,0,1,0
mrna-00008,gene-Peve_00000008,0,0,0
mrna-00111,gene-Peve_00000117,13,13,0
mrna-00013,gene-Peve_00000013,105,114,164
Enhanced transcript count matrix saved as: peve-transcript_count_matrix_with_gene_ids.csv
8 Generate checksums
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
# Uses find command to avoid passing
# directory names to the md5sum command.
find . -maxdepth 1 -type f -exec md5sum {} + \
| tee --append checksums.md5