INTRO
In preparation for RNA-seq analysis for urol-e5/timeseries_molecular (GitHub repo) project, I previously indexed the genome using HISAT2
earlier today.
This notebook details the RNA-seq alignment using HISAT2
and StringTie
to generate DESeq2-formatted count matrices. These will be linked on the timeseries_molecular
Wiki Page.
The contents below are from markdown knitted from 02.20-F-Ptua-RNAseq-alignment-HiSat2.Rmd
(commit ac67faa7
).
1 Background
This notebook will align trimmed P.tuahiniensis RNA-seq data to the P.tuahiniensis genome using HISAT2 (Kim et al. 2019). Follwed by StringTie (Pertea et al. 2015, 2016) for transcript assembly/identification and count matrices for downstream expression analysis with DESeq2 and/or \[Ballgown\](https://github.com/alyssafrazee/ballgown.
Since the BAM files produced by this notebook are too large for GitHub, they can be accessed on our server here:
Input(s)
- Trimmed FastQ files, with format:
<colone_ID>-<timepoint>_*fastp-trim.fq.gz
- HISAT2 genome index:
Pocillopora_meandrina_HIv1.assembly
- Genome GTF:
Porites_evermanni_validated.gtf
- Sample metadata:
M-multi-species/data/rna_metadata.csv
Outputs:
Primary:
checksums.md5
: MD5 checksum for all files in this directory. Excludes subdirectories.ptua-gene_count_matrix.csv
: Gene count matrix for use in DESeq2.ptua-transcript_count_matrix.csv
: Transcript count matrix for use in DESeq2.ptua-transcript_count_matrix_with_gene_ids.csv
: Transcript count matrix which includes corresponding gene IDs.prepDE-sample_list.txt
: Sample file list provided as input to StringTie for DESeq2 count matrix generation. Also serves as documentation of which files were used for this step.Pocillopora_meandrina_HIv1.assembly.stringtie.gtf
: Canonical StringTie GTF file compiled from all individual sample GTFs.sorted-bams-merged.bam
: Merged (and sorted) BAM consisting of all individual sample BAMs.sorted-bams-merged.bam.bai
: BAM index file. Useful for visualizing assemblies in IGV.sorted_bams.list
: List file needed for merging of BAMS with samtools. Also serves as documentation of which files were used for this step.multiqc_report.html
: MultiQC report aggregating all individual HISAT2 alignment stats and samtools flagstats.gtf_list.txt
: List file needed for merging of GTF files with StringTie. Also serves as documentation of which files were used for this step.
Individuals:
Each subdirectory is labelled based on sample name and each contains individual HISAT2 alignment and StringTie output files.
<sample_name>_checksums.md5
: MD5 checksums for all files in the directory.*.ctab
: Data tables formatted for import into Ballgown.<sample_name>.cov_refs.gtf
: StringTie genome reference sequnce coverage GTF.<sample_name>.gtf
: StringTie GTF.<sample_name>.sorted.bam
: HISAT2 assembly BAM.<sample_name>.sorted.bam.bai
: BAM index file. Useful for visualizing assemblies in IGV.<sample_name>-hisat2_output.flagstat
: samtools flagstat output file.<sample_name>_hisat2.stats
: HISAT2 assembly stats.input_fastqs_checksums.md5
: MD5 checksums of files used as input for assembly. Primarily serves as documentation to track/verify which files were actually used.
2 Create a Bash variables file
This allows usage of Bash variables across R Markdown chunks.
{
echo "#### Assign Variables ####"
echo ""
echo "# Data directories"
echo 'export timeseries_dir=/home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular'
echo 'export genome_dir="${timeseries_dir}/F-Ptua/data"'
echo 'export genome_index_dir="${timeseries_dir}/F-Ptua/output/02.10-F-Ptua-RNAseq-genome-index-HiSat2"'
echo 'export output_dir_top="${timeseries_dir}/F-Ptua/output/02.20-F-Ptua-RNAseq-alignment-HiSat2"'
echo 'export trimmed_fastqs_dir="${timeseries_dir}/F-Ptua/output/01.00-F-Ptua-RNAseq-trimming-fastp-FastQC-MultiQC"'
echo ""
echo "# Location of Hisat2 index files"
echo "# Must keep variable name formatting, as it's used by HiSat2"
echo 'export HISAT2_INDEXES="${genome_index_dir}"'
echo "# Input files"
echo 'export genome_index_name="Pocillopora_meandrina_HIv1.assembly"'
echo 'export genome_gff="${genome_dir}/Pocillopora_meandrina_HIv1.genes-validated.gff3"'
echo 'export genome_fasta="${genome_dir}/Pocillopora_meandrina_HIv1.assembly.fasta"'
echo 'export transcripts_gtf="${genome_dir}/Pocillopora_meandrina_HIv1.genes-validated.gtf"'
echo "# Output files"
echo 'export gtf_list="${output_dir_top}/gtf_list.txt"'
echo 'export merged_bam="${output_dir_top}/sorted-bams-merged.bam"'
echo ""
echo "# Paths to programs"
echo 'export programs_dir="/home/shared"'
echo 'export hisat2_dir="${programs_dir}/hisat2-2.2.1"'
echo 'export hisat2="${hisat2_dir}/hisat2"'
echo 'export multiqc=/home/sam/programs/mambaforge/bin/multiqc'
echo 'export samtools="${programs_dir}/samtools-1.12/samtools"'
echo 'export prepDE="${programs_dir}/stringtie-2.2.1.Linux_x86_64/prepDE.py3"'
echo 'export stringtie="${programs_dir}/stringtie-2.2.1.Linux_x86_64/stringtie"'
echo ""
echo "# Set FastQ filename patterns"
echo "export R1_fastq_pattern='*_R1_*.fq.gz'"
echo "export R2_fastq_pattern='*_R2_*.fq.gz'"
echo "export trimmed_fastq_pattern='*fastp-trim.fq.gz'"
echo ""
echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""
echo "# Set average read length - for StringTie prepDE.py"
echo 'export read_length=125'
echo ""
echo "## Initialize arrays"
echo 'export fastq_array_R1=()'
echo 'export fastq_array_R2=()'
echo 'export R1_names_array=()'
echo 'export R2_names_array=()'
echo "declare -A sample_timepoint_map"
echo ""
echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[hisat2]="${hisat2}" \'
echo '[multiqc]="${multiqc}" \'
echo '[prepDE]="${prepDE}" \'
echo '[samtools]="${samtools}" \'
echo '[stringtie]="${stringtie}" \'
echo ")"
echo ""
echo "# Print formatting"
echo 'export line="--------------------------------------------------------"'
echo ""
} > .bashvars
cat .bashvars
#### Assign Variables ####
# Data directories
export timeseries_dir=/home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular
export genome_dir="${timeseries_dir}/F-Ptua/data"
export genome_index_dir="${timeseries_dir}/F-Ptua/output/02.10-F-Ptua-RNAseq-genome-index-HiSat2"
export output_dir_top="${timeseries_dir}/F-Ptua/output/02.20-F-Ptua-RNAseq-alignment-HiSat2"
export trimmed_fastqs_dir="${timeseries_dir}/F-Ptua/output/01.00-F-Ptua-RNAseq-trimming-fastp-FastQC-MultiQC"
# Location of Hisat2 index files
# Must keep variable name formatting, as it's used by HiSat2
export HISAT2_INDEXES="${genome_index_dir}"
# Input files
export genome_index_name="Pocillopora_meandrina_HIv1.assembly"
export genome_gff="${genome_dir}/Pocillopora_meandrina_HIv1.genes-validated.gff3"
export genome_fasta="${genome_dir}/Pocillopora_meandrina_HIv1.assembly.fasta"
export transcripts_gtf="${genome_dir}/Pocillopora_meandrina_HIv1.genes-validated.gtf"
# Output files
export gtf_list="${output_dir_top}/gtf_list.txt"
export merged_bam="${output_dir_top}/sorted-bams-merged.bam"
# Paths to programs
export programs_dir="/home/shared"
export hisat2_dir="${programs_dir}/hisat2-2.2.1"
export hisat2="${hisat2_dir}/hisat2"
export multiqc=/home/sam/programs/mambaforge/bin/multiqc
export samtools="${programs_dir}/samtools-1.12/samtools"
export prepDE="${programs_dir}/stringtie-2.2.1.Linux_x86_64/prepDE.py3"
export stringtie="${programs_dir}/stringtie-2.2.1.Linux_x86_64/stringtie"
# Set FastQ filename patterns
export R1_fastq_pattern='*_R1_*.fq.gz'
export R2_fastq_pattern='*_R2_*.fq.gz'
export trimmed_fastq_pattern='*fastp-trim.fq.gz'
# Set number of CPUs to use
export threads=40
# Set average read length - for StringTie prepDE.py
export read_length=125
## Initialize arrays
export fastq_array_R1=()
export fastq_array_R2=()
export R1_names_array=()
export R2_names_array=()
declare -A sample_timepoint_map
# Programs associative array
declare -A programs_array
programs_array=(
[hisat2]="${hisat2}" \
[multiqc]="${multiqc}" \
[prepDE]="${prepDE}" \
[samtools]="${samtools}" \
[stringtie]="${stringtie}" \
)
# Print formatting
export line="--------------------------------------------------------"
3 Align reads using HISAT2
This requires usage of the rna_metadata.csv
This step has a lengthy, semi-complex workflow:
- Parse
rna_metadata.csv
for P.tuahiniensis sample names and time point. This info will be used for downstream file naming and to assing the time point to the read group (SM:
) in the alignment file. - Loop through all samples and perform individual alignments using HISAT2.
- HISAT2 output is piped to through multiple samtools tools: flagstat (stats aggregation), sort (creates/sorts BAM), index (creates BAM index). Piping saves time and disk space, by avoiding the generation of large SAM files.
- Loop continues and runs StringTie on sorted BAM file to produce individual GTF file.
- Loop continues and adds GTF path/filename to a list file, which will be used downstream.
# Load bash variables into memory
source .bashvars
# Make output directories, if they don't exist
mkdir --parents "${output_dir_top}"
# Change to ouput directory
cd "${output_dir_top}"
# Create associative array with sample and timepoint
metadata="../../../M-multi-species/data/rna_metadata.csv"
# Declare the array
declare -A sample_timepoint_map
# Read the metadata file line by line
while IFS=',' read -r sample_number sample_name plate well_number azenta_sample_name colony_id timepoint sample_type species_strain SampleBuffer; do
# Check if the species is "Pocillopora tuahiniensis"
if [[ "${species_strain}" == "Pocillopora tuahiniensis" ]]; then
# Add the Azenta sample name as the key and Timepoint as the value in the associative array
sample_timepoint_map["${colony_id}-${timepoint}"]="${timepoint}"
fi
done < <(tail -n +2 "${metadata}") # Skip the header
## Populate trimmed reads arrays
fastq_array_R1=("${trimmed_fastqs_dir}"/${R1_fastq_pattern})
fastq_array_R2=("${trimmed_fastqs_dir}"/${R2_fastq_pattern})
############## BEGIN HISAT2 ALIGNMENTS ##############
# Loop through array using sample names (e.g. <colony_ID>-<timepoint>)
for sample in "${!sample_timepoint_map[@]}"
do
# Create and switch to dedicated sample directory
mkdir --parents "${sample}" && cd "$_"
# Create HISAT2 list of fastq R1 files
# and generated MD5 checksums file.
for fastq in "${fastq_array_R1[@]}"
do
# Parse sample name from FastQ filename
fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}')
# Process matching FastQ file, based on sample name
if [ "${fastq_sample}" == "${sample}" ]; then
# Generate checksum/list of input files used
md5sum "${fastq}" >> input_fastqs_checksums.md5
# Create comma-separated lists of FastQs for HISAT2
printf -v joined_R1 '%s,' "${fastq}"
fastq_list_R1=$(echo "${joined_R1%,}")
fi
done
# Create HISAT2 list of fastq R1 files
# and generated MD5 checksums file.
for fastq in "${fastq_array_R2[@]}"
do
# Parse sample name from FastQ filename
fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}')
# Process matching FastQ file, based on sample name
if [ "${fastq_sample}" == "${sample}" ]; then
# Generate checksum/list of input files used
md5sum "${fastq}" >> input_fastqs_checksums.md5
# Create comma-separated lists of FastQs for HISAT2
printf -v joined_R2 '%s,' "${fastq}"
fastq_list_R2=$(echo "${joined_R2%,}")
fi
done
# HISAT2 alignments
# Sets read group info (RG) using samples array
"${programs_array[hisat2]}" \
-x "${genome_index_name}" \
-1 "${fastq_list_R1}" \
-2 "${fastq_list_R2}" \
--threads "${threads}" \
--rg-id "${sample}" \
--rg "SM:""${sample_timepoint_map[$sample]}" \
2> "${sample}"_hisat2.stats \
| tee >(${programs_array[samtools]} flagstat - > "${sample}"-hisat2_output.flagstat) \
| ${programs_array[samtools]} sort - -@ "${threads}" -O BAM \
| tee "${sample}".sorted.bam \
| ${programs_array[samtools]} index - "${sample}".sorted.bam.bai
# Run stringtie on alignments
# Uses "-B" option to output tables intended for use in Ballgown
# Uses "-e" option; recommended when using "-B" option.
# Limits analysis to only reads alignments matching reference.
"${programs_array[stringtie]}" "${sample}".sorted.bam \
-p "${threads}" \
-o "${sample}".gtf \
-G "${genome_gff}" \
-C "${sample}.cov_refs.gtf" \
-B \
-e
# Add GTFs to list file, only if non-empty
# Identifies GTF files that only have header
gtf_lines=$(wc -l < "${sample}".gtf )
if [ "${gtf_lines}" -gt 2 ]; then
echo "$(pwd)/${sample}.gtf" >> "${gtf_list}"
fi
# Generate checksums
find ./ -type f -not -name "*.md5" -exec md5sum {} \; > ${sample}_checksums.md5
# Move up to orig. working directory
cd ..
done
3.1 Review HISAT2 Output
View the resulting directory structure of resulting from the HISAT2/StringTie process.
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
# Display HISAT2 output directory structure
# with directory (--du) and file sizes (-h)
tree --du -h
[4.3G] [01;34m.[0m
├── [ 607] [00mchecksums.md5[0m
├── [5.5K] [00mgtf_list.txt[0m
├── [308K] [01;34mmultiqc_data[0m
│ ├── [5.1K] [00mmultiqc_bowtie2.txt[0m
│ ├── [ 307] [00mmultiqc_citations.txt[0m
│ ├── [268K] [00mmultiqc_data.json[0m
│ ├── [3.2K] [00mmultiqc_general_stats.txt[0m
│ ├── [4.1K] [00mmultiqc.log[0m
│ ├── [7.8K] [00mmultiqc_samtools_flagstat.txt[0m
│ └── [ 16K] [00mmultiqc_sources.txt[0m
├── [1.1M] [00mmultiqc_report.html[0m
├── [113M] [01;34mPOC-201-TP1[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-201-TP1_checksums.md5[0m
│ ├── [ 13M] [00mPOC-201-TP1.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-201-TP1.gtf[0m
│ ├── [ 449] [00mPOC-201-TP1-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOC-201-TP1_hisat2.stats[0m
│ ├── [628K] [00mPOC-201-TP1.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [113M] [01;34mPOC-201-TP2[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-201-TP2_checksums.md5[0m
│ ├── [ 13M] [00mPOC-201-TP2.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-201-TP2.gtf[0m
│ ├── [ 449] [00mPOC-201-TP2-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOC-201-TP2_hisat2.stats[0m
│ ├── [680K] [00mPOC-201-TP2.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [105M] [01;34mPOC-201-TP3[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-201-TP3_checksums.md5[0m
│ ├── [6.1M] [00mPOC-201-TP3.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-201-TP3.gtf[0m
│ ├── [ 448] [00mPOC-201-TP3-hisat2_output.flagstat[0m
│ ├── [ 635] [00mPOC-201-TP3_hisat2.stats[0m
│ ├── [427K] [00mPOC-201-TP3.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [111M] [01;34mPOC-219-TP1[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-219-TP1_checksums.md5[0m
│ ├── [ 12M] [00mPOC-219-TP1.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-219-TP1.gtf[0m
│ ├── [ 448] [00mPOC-219-TP1-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOC-219-TP1_hisat2.stats[0m
│ ├── [637K] [00mPOC-219-TP1.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [109M] [01;34mPOC-219-TP2[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-219-TP2_checksums.md5[0m
│ ├── [9.8M] [00mPOC-219-TP2.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-219-TP2.gtf[0m
│ ├── [ 448] [00mPOC-219-TP2-hisat2_output.flagstat[0m
│ ├── [ 635] [00mPOC-219-TP2_hisat2.stats[0m
│ ├── [580K] [00mPOC-219-TP2.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [102M] [01;34mPOC-219-TP3[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-219-TP3_checksums.md5[0m
│ ├── [4.2M] [00mPOC-219-TP3.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-219-TP3.gtf[0m
│ ├── [ 442] [00mPOC-219-TP3-hisat2_output.flagstat[0m
│ ├── [ 634] [00mPOC-219-TP3_hisat2.stats[0m
│ ├── [364K] [00mPOC-219-TP3.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [108M] [01;34mPOC-219-TP4[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-219-TP4_checksums.md5[0m
│ ├── [8.5M] [00mPOC-219-TP4.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-219-TP4.gtf[0m
│ ├── [ 448] [00mPOC-219-TP4-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOC-219-TP4_hisat2.stats[0m
│ ├── [512K] [00mPOC-219-TP4.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [114M] [01;34mPOC-222-TP1[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-222-TP1_checksums.md5[0m
│ ├── [ 14M] [00mPOC-222-TP1.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-222-TP1.gtf[0m
│ ├── [ 449] [00mPOC-222-TP1-hisat2_output.flagstat[0m
│ ├── [ 637] [00mPOC-222-TP1_hisat2.stats[0m
│ ├── [664K] [00mPOC-222-TP1.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [110M] [01;34mPOC-222-TP2[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-222-TP2_checksums.md5[0m
│ ├── [ 11M] [00mPOC-222-TP2.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-222-TP2.gtf[0m
│ ├── [ 449] [00mPOC-222-TP2-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOC-222-TP2_hisat2.stats[0m
│ ├── [624K] [00mPOC-222-TP2.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [114M] [01;34mPOC-222-TP3[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-222-TP3_checksums.md5[0m
│ ├── [ 14M] [00mPOC-222-TP3.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-222-TP3.gtf[0m
│ ├── [ 449] [00mPOC-222-TP3-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOC-222-TP3_hisat2.stats[0m
│ ├── [717K] [00mPOC-222-TP3.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [108M] [01;34mPOC-222-TP4[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-222-TP4_checksums.md5[0m
│ ├── [9.2M] [00mPOC-222-TP4.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-222-TP4.gtf[0m
│ ├── [ 448] [00mPOC-222-TP4-hisat2_output.flagstat[0m
│ ├── [ 636] [00mPOC-222-TP4_hisat2.stats[0m
│ ├── [549K] [00mPOC-222-TP4.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [112M] [01;34mPOC-255-TP1[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-255-TP1_checksums.md5[0m
│ ├── [ 13M] [00mPOC-255-TP1.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-255-TP1.gtf[0m
│ ├── [ 449] [00mPOC-255-TP1-hisat2_output.flagstat[0m
│ ├── [ 637] [00mPOC-255-TP1_hisat2.stats[0m
│ ├── [607K] [00mPOC-255-TP1.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [110M] [01;34mPOC-255-TP2[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-255-TP2_checksums.md5[0m
│ ├── [ 10M] [00mPOC-255-TP2.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-255-TP2.gtf[0m
│ ├── [ 448] [00mPOC-255-TP2-hisat2_output.flagstat[0m
│ ├── [ 635] [00mPOC-255-TP2_hisat2.stats[0m
│ ├── [592K] [00mPOC-255-TP2.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [104M] [01;34mPOC-255-TP3[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-255-TP3_checksums.md5[0m
│ ├── [5.2M] [00mPOC-255-TP3.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-255-TP3.gtf[0m
│ ├── [ 448] [00mPOC-255-TP3-hisat2_output.flagstat[0m
│ ├── [ 639] [00mPOC-255-TP3_hisat2.stats[0m
│ ├── [513K] [00mPOC-255-TP3.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [112M] [01;34mPOC-255-TP4[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-255-TP4_checksums.md5[0m
│ ├── [ 13M] [00mPOC-255-TP4.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-255-TP4.gtf[0m
│ ├── [ 448] [00mPOC-255-TP4-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOC-255-TP4_hisat2.stats[0m
│ ├── [650K] [00mPOC-255-TP4.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [113M] [01;34mPOC-259-TP1[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-259-TP1_checksums.md5[0m
│ ├── [ 13M] [00mPOC-259-TP1.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-259-TP1.gtf[0m
│ ├── [ 449] [00mPOC-259-TP1-hisat2_output.flagstat[0m
│ ├── [ 635] [00mPOC-259-TP1_hisat2.stats[0m
│ ├── [605K] [00mPOC-259-TP1.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [113M] [01;34mPOC-259-TP2[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-259-TP2_checksums.md5[0m
│ ├── [ 14M] [00mPOC-259-TP2.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-259-TP2.gtf[0m
│ ├── [ 449] [00mPOC-259-TP2-hisat2_output.flagstat[0m
│ ├── [ 637] [00mPOC-259-TP2_hisat2.stats[0m
│ ├── [660K] [00mPOC-259-TP2.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [108M] [01;34mPOC-259-TP3[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-259-TP3_checksums.md5[0m
│ ├── [8.8M] [00mPOC-259-TP3.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-259-TP3.gtf[0m
│ ├── [ 448] [00mPOC-259-TP3-hisat2_output.flagstat[0m
│ ├── [ 640] [00mPOC-259-TP3_hisat2.stats[0m
│ ├── [587K] [00mPOC-259-TP3.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [113M] [01;34mPOC-259-TP4[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 400] [00minput_fastqs_checksums.md5[0m
│ ├── [ 601] [00mPOC-259-TP4_checksums.md5[0m
│ ├── [ 13M] [00mPOC-259-TP4.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-259-TP4.gtf[0m
│ ├── [ 448] [00mPOC-259-TP4-hisat2_output.flagstat[0m
│ ├── [ 637] [00mPOC-259-TP4_hisat2.stats[0m
│ ├── [644K] [00mPOC-259-TP4.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [111M] [01;34mPOC-40-TP1[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-40-TP1_checksums.md5[0m
│ ├── [ 11M] [00mPOC-40-TP1.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-40-TP1.gtf[0m
│ ├── [ 449] [00mPOC-40-TP1-hisat2_output.flagstat[0m
│ ├── [ 635] [00mPOC-40-TP1_hisat2.stats[0m
│ ├── [585K] [00mPOC-40-TP1.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [111M] [01;34mPOC-40-TP2[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-40-TP2_checksums.md5[0m
│ ├── [ 11M] [00mPOC-40-TP2.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-40-TP2.gtf[0m
│ ├── [ 449] [00mPOC-40-TP2-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOC-40-TP2_hisat2.stats[0m
│ ├── [599K] [00mPOC-40-TP2.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [112M] [01;34mPOC-40-TP3[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-40-TP3_checksums.md5[0m
│ ├── [ 13M] [00mPOC-40-TP3.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-40-TP3.gtf[0m
│ ├── [ 449] [00mPOC-40-TP3-hisat2_output.flagstat[0m
│ ├── [ 637] [00mPOC-40-TP3_hisat2.stats[0m
│ ├── [646K] [00mPOC-40-TP3.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [111M] [01;34mPOC-40-TP4[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-40-TP4_checksums.md5[0m
│ ├── [ 11M] [00mPOC-40-TP4.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-40-TP4.gtf[0m
│ ├── [ 449] [00mPOC-40-TP4-hisat2_output.flagstat[0m
│ ├── [ 637] [00mPOC-40-TP4_hisat2.stats[0m
│ ├── [592K] [00mPOC-40-TP4.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [111M] [01;34mPOC-42-TP1[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-42-TP1_checksums.md5[0m
│ ├── [ 11M] [00mPOC-42-TP1.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-42-TP1.gtf[0m
│ ├── [ 449] [00mPOC-42-TP1-hisat2_output.flagstat[0m
│ ├── [ 636] [00mPOC-42-TP1_hisat2.stats[0m
│ ├── [580K] [00mPOC-42-TP1.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [111M] [01;34mPOC-42-TP2[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-42-TP2_checksums.md5[0m
│ ├── [ 12M] [00mPOC-42-TP2.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-42-TP2.gtf[0m
│ ├── [ 448] [00mPOC-42-TP2-hisat2_output.flagstat[0m
│ ├── [ 636] [00mPOC-42-TP2_hisat2.stats[0m
│ ├── [610K] [00mPOC-42-TP2.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [110M] [01;34mPOC-42-TP3[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-42-TP3_checksums.md5[0m
│ ├── [ 11M] [00mPOC-42-TP3.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-42-TP3.gtf[0m
│ ├── [ 448] [00mPOC-42-TP3-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOC-42-TP3_hisat2.stats[0m
│ ├── [584K] [00mPOC-42-TP3.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [102M] [01;34mPOC-42-TP4[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-42-TP4_checksums.md5[0m
│ ├── [3.5M] [00mPOC-42-TP4.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-42-TP4.gtf[0m
│ ├── [ 448] [00mPOC-42-TP4-hisat2_output.flagstat[0m
│ ├── [ 635] [00mPOC-42-TP4_hisat2.stats[0m
│ ├── [455K] [00mPOC-42-TP4.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [ 99M] [01;34mPOC-52-TP1[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 20M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-52-TP1_checksums.md5[0m
│ ├── [1.6M] [00mPOC-52-TP1.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-52-TP1.gtf[0m
│ ├── [ 442] [00mPOC-52-TP1-hisat2_output.flagstat[0m
│ ├── [ 634] [00mPOC-52-TP1_hisat2.stats[0m
│ ├── [379K] [00mPOC-52-TP1.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [110M] [01;34mPOC-52-TP2[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-52-TP2_checksums.md5[0m
│ ├── [ 10M] [00mPOC-52-TP2.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-52-TP2.gtf[0m
│ ├── [ 448] [00mPOC-52-TP2-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOC-52-TP2_hisat2.stats[0m
│ ├── [582K] [00mPOC-52-TP2.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [111M] [01;34mPOC-52-TP3[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-52-TP3_checksums.md5[0m
│ ├── [ 12M] [00mPOC-52-TP3.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-52-TP3.gtf[0m
│ ├── [ 448] [00mPOC-52-TP3-hisat2_output.flagstat[0m
│ ├── [ 636] [00mPOC-52-TP3_hisat2.stats[0m
│ ├── [558K] [00mPOC-52-TP3.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [114M] [01;34mPOC-52-TP4[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-52-TP4_checksums.md5[0m
│ ├── [ 14M] [00mPOC-52-TP4.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-52-TP4.gtf[0m
│ ├── [ 448] [00mPOC-52-TP4-hisat2_output.flagstat[0m
│ ├── [ 636] [00mPOC-52-TP4_hisat2.stats[0m
│ ├── [631K] [00mPOC-52-TP4.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [113M] [01;34mPOC-53-TP1[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-53-TP1_checksums.md5[0m
│ ├── [ 13M] [00mPOC-53-TP1.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-53-TP1.gtf[0m
│ ├── [ 449] [00mPOC-53-TP1-hisat2_output.flagstat[0m
│ ├── [ 635] [00mPOC-53-TP1_hisat2.stats[0m
│ ├── [637K] [00mPOC-53-TP1.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [112M] [01;34mPOC-53-TP2[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-53-TP2_checksums.md5[0m
│ ├── [ 12M] [00mPOC-53-TP2.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-53-TP2.gtf[0m
│ ├── [ 448] [00mPOC-53-TP2-hisat2_output.flagstat[0m
│ ├── [ 635] [00mPOC-53-TP2_hisat2.stats[0m
│ ├── [590K] [00mPOC-53-TP2.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [113M] [01;34mPOC-53-TP3[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-53-TP3_checksums.md5[0m
│ ├── [ 13M] [00mPOC-53-TP3.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-53-TP3.gtf[0m
│ ├── [ 449] [00mPOC-53-TP3-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOC-53-TP3_hisat2.stats[0m
│ ├── [636K] [00mPOC-53-TP3.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [108M] [01;34mPOC-53-TP4[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-53-TP4_checksums.md5[0m
│ ├── [9.4M] [00mPOC-53-TP4.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-53-TP4.gtf[0m
│ ├── [ 448] [00mPOC-53-TP4-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOC-53-TP4_hisat2.stats[0m
│ ├── [520K] [00mPOC-53-TP4.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [112M] [01;34mPOC-57-TP1[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-57-TP1_checksums.md5[0m
│ ├── [ 12M] [00mPOC-57-TP1.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-57-TP1.gtf[0m
│ ├── [ 449] [00mPOC-57-TP1-hisat2_output.flagstat[0m
│ ├── [ 635] [00mPOC-57-TP1_hisat2.stats[0m
│ ├── [602K] [00mPOC-57-TP1.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [107M] [01;34mPOC-57-TP2[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 12M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-57-TP2_checksums.md5[0m
│ ├── [7.5M] [00mPOC-57-TP2.cov_refs.gtf[0m
│ ├── [ 55M] [00mPOC-57-TP2.gtf[0m
│ ├── [ 449] [00mPOC-57-TP2-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOC-57-TP2_hisat2.stats[0m
│ ├── [539K] [00mPOC-57-TP2.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [114M] [01;34mPOC-57-TP3[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-57-TP3_checksums.md5[0m
│ ├── [ 14M] [00mPOC-57-TP3.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-57-TP3.gtf[0m
│ ├── [ 449] [00mPOC-57-TP3-hisat2_output.flagstat[0m
│ ├── [ 637] [00mPOC-57-TP3_hisat2.stats[0m
│ ├── [692K] [00mPOC-57-TP3.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [113M] [01;34mPOC-57-TP4[0m
│ ├── [2.4M] [00me2t.ctab[0m
│ ├── [ 21M] [00me_data.ctab[0m
│ ├── [2.0M] [00mi2t.ctab[0m
│ ├── [ 13M] [00mi_data.ctab[0m
│ ├── [ 398] [00minput_fastqs_checksums.md5[0m
│ ├── [ 595] [00mPOC-57-TP4_checksums.md5[0m
│ ├── [ 13M] [00mPOC-57-TP4.cov_refs.gtf[0m
│ ├── [ 56M] [00mPOC-57-TP4.gtf[0m
│ ├── [ 449] [00mPOC-57-TP4-hisat2_output.flagstat[0m
│ ├── [ 638] [00mPOC-57-TP4_hisat2.stats[0m
│ ├── [636K] [00mPOC-57-TP4.sorted.bam.bai[0m
│ └── [5.8M] [00mt_data.ctab[0m
├── [ 57M] [00mPocillopora_meandrina_HIv1.assembly.stringtie.gtf[0m
├── [5.9K] [00mprepDE-sample_list.txt[0m
├── [5.1M] [00mptua-gene_count_matrix.csv[0m
├── [5.1M] [00mptua-transcript_count_matrix.csv[0m
├── [6.6M] [00mptua-transcript_count_matrix_with_gene_ids.csv[0m
├── [3.3K] [00mREADME.md[0m
├── [1.4K] [00msorted_bams.list[0m
└── [8.3M] [00msorted-bams-merged.bam.bai[0m
8.5G used in 41 directories, 486 files
3.2 MultiQC alignment rates
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
${programs_array[multiqc]} \
\
--interactive .
4 Merge sorted BAMs
Merge all BAMs to singular BAM for use in transcriptome assembly later, if needed.
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
## Create list of sorted BAMs for merging
find . -name "*sorted.bam" > sorted_bams.list
## Merge sorted BAMs
${programs_array[samtools]} merge \
\
-b sorted_bams.list ${merged_bam} \
${threads}
--threads
## Index merged BAM
${programs_array[samtools]} index ${merged_bam}
5 Create combined GTF
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
# Create singular transcript file, using GTF list file
"${programs_array[stringtie]}" --merge \
"${gtf_list}" \
"${threads}" \
-p "${genome_gff}" \
-G "${genome_index_name}".stringtie.gtf -o
6 Create DESeq2 Count Matrices
# Load bash variables into memory
source .bashvars
# Change to output directory
cd "${output_dir_top}"
# Check if prepDE-sample_list.txt exists
if [ -f prepDE-sample_list.txt ]; then
gtf_lines=$(wc -l < gtf_list.txt)
prepde_lines=$(wc -l < prepDE-sample_list.txt)
if [ "$gtf_lines" -eq "$prepde_lines" ]; then
echo "prepDE-sample_list.txt exists and line count matches gtf_list.txt. Skipping sample list creation."
else
echo "prepDE-sample_list.txt exists but line count does not match. Regenerating sample list."
rm prepDE-sample_list.txt
while read -r line
do
sample_no_path=${line##*/}
sample=${sample_no_path%.*}
echo ${sample} ${line}
done < gtf_list.txt >> prepDE-sample_list.txt
fi
else
echo "prepDE-sample_list.txt does not exist. Creating sample list."
while read -r line
do
sample_no_path=${line##*/}
sample=${sample_no_path%.*}
echo ${sample} ${line}
done < gtf_list.txt >> prepDE-sample_list.txt
fi
# Create count matrices for genes and transcripts
# Compatible with import to DESeq2
python3 "${programs_array[prepDE]}" \
\
--input=prepDE-sample_list.txt \
-g ptua-gene_count_matrix.csv \
-t ptua-transcript_count_matrix.csv ${read_length} --length=
prepDE-sample_list.txt exists and line count matches gtf_list.txt. Skipping sample list creation.
/home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py3:69: SyntaxWarning: invalid escape sequence '\-'
RE_COVERAGE=re.compile('cov "([\-\+\d\.]+)"')
/home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py3:72: SyntaxWarning: invalid escape sequence '\-'
RE_GFILE=re.compile('\-G\s*(\S+)') #assume filepath without spaces..
7 Add Gene IDs to Transcript Count Matrix
Create an enhanced version of the transcript count matrix that includes the corresponding gene ID for each transcript.
# Load bash variables into memory
source .bashvars
# Change to output directory
cd "${output_dir_top}"
# Extract transcript_id to ref_gene_id mapping from GTF file
echo "Extracting transcript to gene ID mappings from GTF file..."
grep -E $'\ttranscript\t' "${genome_index_name}.stringtie.gtf" | \
awk -F'\t' '{
# Extract transcript_id
match($9, /transcript_id "([^"]+)"/, transcript_arr)
transcript_id = transcript_arr[1]
# Extract ref_gene_id
match($9, /ref_gene_id "([^"]+)"/, gene_arr)
ref_gene_id = gene_arr[1]
# Print mapping if both IDs found
if (transcript_id && ref_gene_id) {
print transcript_id "\t" ref_gene_id
}
}' > transcript_gene_mapping.txt
# Check if mapping file was created successfully
if [ ! -s transcript_gene_mapping.txt ]; then
echo "Error: Failed to create transcript-gene mapping file"
exit 1
fi
echo "Found $(wc -l < transcript_gene_mapping.txt) transcript-gene mappings"
# Create enhanced transcript count matrix with gene IDs
echo "Creating enhanced transcript count matrix with gene IDs..."
# Read header from original transcript count matrix
head -1 ptua-transcript_count_matrix.csv > temp_header.txt
# Add ref_gene_id to header (insert after transcript_id column)
sed 's/transcript_id,/transcript_id,ref_gene_id,/' temp_header.txt > ptua-transcript_count_matrix_with_gene_ids.csv
# Process data rows
tail -n +2 ptua-transcript_count_matrix.csv | while IFS=',' read -r transcript_id rest_of_line; do
# Look up gene ID for this transcript
gene_id=$(grep "^${transcript_id}[[:space:]]" transcript_gene_mapping.txt | cut -f2)
# If no gene ID found, use empty field
if [ -z "$gene_id" ]; then
gene_id=""
fi
# Output line with gene ID inserted after transcript ID
echo "${transcript_id},${gene_id},${rest_of_line}"
done >> ptua-transcript_count_matrix_with_gene_ids.csv
# Generate summary statistics
echo ""
echo "Summary of enhanced transcript count matrix:"
original_lines=$(wc -l < ptua-transcript_count_matrix.csv)
enhanced_lines=$(wc -l < ptua-transcript_count_matrix_with_gene_ids.csv)
echo "Original transcript count matrix lines: $original_lines"
echo "Enhanced transcript count matrix lines: $enhanced_lines"
# Count transcripts with and without gene ID mappings
transcripts_with_genes=$(tail -n +2 ptua-transcript_count_matrix_with_gene_ids.csv | awk -F',' '$2 != ""' | wc -l)
transcripts_without_genes=$(tail -n +2 ptua-transcript_count_matrix_with_gene_ids.csv | awk -F',' '$2 == ""' | wc -l)
echo "Number of transcripts with gene ID mappings: $transcripts_with_genes"
echo "Number of transcripts without gene ID mappings: $transcripts_without_genes"
# Display first few lines of enhanced matrix
echo ""
echo "First 5 rows of enhanced transcript count matrix (first 5 columns):"
head -6 ptua-transcript_count_matrix_with_gene_ids.csv | cut -d',' -f1-5
# Clean up temporary files
rm temp_header.txt transcript_gene_mapping.txt
echo ""
echo "Enhanced transcript count matrix saved as: ptua-transcript_count_matrix_with_gene_ids.csv"
Extracting transcript to gene ID mappings from GTF file...
Found 31840 transcript-gene mappings
Creating enhanced transcript count matrix with gene IDs...
Summary of enhanced transcript count matrix:
Original transcript count matrix lines: 31841
Enhanced transcript count matrix lines: 31841
Number of transcripts with gene ID mappings: 31840
Number of transcripts without gene ID mappings: 0
First 5 rows of enhanced transcript count matrix (first 5 columns):
transcript_id,ref_gene_id,POC-201-TP1,POC-201-TP2,POC-201-TP3
mrna-Pocillopora_meandrina_HIv1___RNAseq.g20905.t1,gene-Pocillopora_meandrina_HIv1___RNAseq.g20905.t1,54,170,18
mrna-Pocillopora_meandrina_HIv1___RNAseq.g20902.t1,gene-Pocillopora_meandrina_HIv1___RNAseq.g20902.t1,957,759,143
mrna-Pocillopora_meandrina_HIv1___RNAseq.g20903.t1,gene-Pocillopora_meandrina_HIv1___RNAseq.g20903.t1,249,452,381
mrna-Pocillopora_meandrina_HIv1___TS.g25664.t1,gene-Pocillopora_meandrina_HIv1___TS.g25664.t1,365,191,13
mrna-Pocillopora_meandrina_HIv1___RNAseq.g20907.t1,gene-Pocillopora_meandrina_HIv1___RNAseq.g20907.t1,187,435,137
Enhanced transcript count matrix saved as: ptua-transcript_count_matrix_with_gene_ids.csv
8 Generate checksums
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
# Uses find command to avoid passing
# directory names to the md5sum command.
find . -maxdepth 1 -type f -exec md5sum {} + \
| tee --append checksums.md5