Note

This notebook entry is knitted from urol-e5/timeseries_molecular/D-Apul/code/02.20-D-Apul-RNAseq-alignment-HiSat2.Rmd (GitHub), commit 5398f74.

1 INTRODUCTION

This notebook will align trimmed A.pulchra RNA-seq data to the A.pulchra genome using HISAT2 (Kim et al. 2019). Follwed by StringTie (Pertea et al. 2015, 2016) for transcript assembly/identification and count matrices for downstream expression analysis with DESeq2 and/or \[Ballgown\](https://github.com/alyssafrazee/ballgown.

Since the BAM files produced by this notebook are too large for GitHub, they can be accessed on our server here:

https://gannet.fish.washington.edu/Atumefaciens/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2/

Input(s)

Trimmed FastQ files, with format: *fastp-trim.fq.gz
HISAT2 genome index: Apulcrha-genome
Genome GTF: Apulchra-genome.gtf
Sample metadata: M-multi-species/data/rna_metadata.csv

Outputs:

Primary:
- checksums.md5: MD5 checksum for all files in this directory. Excludes subdirectories.
- apul-gene_count_matrix.csv: Gene count matrix for use in DESeq2.
- apul-transcript_count_matrix.csv: Transcript count matrix for use in DESeq2.
- apul-transcript_count_matrix_with_gene_ids.csv: Transcript count matrix which includes corresponding gene IDs.
- prepDE-sample_list.txt: Sample file list provided as input to StringTie for DESeq2 count matrix generation. Also serves as documentation of which files were used for this step.
- Apulchra-genome.stringtie.gtf: Canonical StringTie GTF file compiled from all individual sample GTFs.
- sorted-bams-merged.bam: Merged (and sorted) BAM consisting of all individual sample BAMs.
- sorted-bams-merged.bam.bai: BAM index file. Useful for visualizing assemblies in IGV.
- sorted_bams.list: List file needed for merging of BAMS with samtools. Also serves as documentation of which files were used for this step.
- multiqc_report.html: MultiQC report aggregating all individual HISAT2 alignment stats and samtools flagstats.
- gtf_list.txt: List file needed for merging of GTF files with StringTie. Also serves as documentation of which files were used for this step.
Individuals:

Each subdirectory is labelled based on sample name and each contains individual HISAT2 alignment and StringTie output files.

<sample_name>_checksums.md5: MD5 checksums for all files in the directory.
*.ctab: Data tables formatted for import into Ballgown.
<sample_name>.cov_refs.gtf: StringTie genome reference sequnce coverage GTF.
<sample_name>.gtf: StringTie GTF.
<sample_name>.sorted.bam: HISAT2 assembly BAM.
<sample_name>.sorted.bam.bai: BAM index file. Useful for visualizing assemblies in IGV.
<sample_name>-hisat2_output.flagstat: samtools flagstat output file.
<sample_name>_hisat2.stats: HISAT2 assembly stats.
input_fastqs_checksums.md5: MD5 checksums of files used as input for assembly. Primarily serves as documentation to track/verify which files were actually used.

2 Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

{
echo "#### Assign Variables ####"
echo ""

echo "# Data directories"
echo 'export timeseries_dir=/home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular'
echo 'export genome_dir="${timeseries_dir}/D-Apul/data"'
echo 'export genome_index_dir="${timeseries_dir}/D-Apul/output/02.10-D-Apul-RNAseq-genome-index-HiSat2"'
echo 'export output_dir_top="${timeseries_dir}/D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2"'
echo 'export trimmed_fastqs_dir="${timeseries_dir}/D-Apul/output/01.00-D-Apul-RNAseq-trimming-fastp-FastQC-MultiQC"'
echo 'export trimmed_reads_url="https://gannet.fish.washington.edu/Atumefaciens/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/01.00-D-Apul-RNAseq-trimming-fastp-FastQC-MultiQC"'
echo ""

echo "# Location of Hisat2 index files"
echo "# Must keep variable name formatting, as it's used by HiSat2"
echo 'export HISAT2_INDEXES="${genome_index_dir}"'


echo "# Input files"
echo 'export exons="${output_dir_top}/Apulchra-genome_hisat2_exons.tab"'
echo 'export genome_index_name="Apulchra-genome"'
echo 'export genome_gff="${genome_dir}/Apulchra-genome.gff"'
echo 'export genome_fasta="${genome_dir}/Apulchra-genome.fa"'
echo 'export splice_sites="${output_dir_top}/Apulchra-genome_hisat2_splice_sites.tab"'
echo 'export transcripts_gtf="${genome_dir}/Apulchra-genome.gtf"'

echo "# Output files"
echo 'export gtf_list="${output_dir_top}/gtf_list.txt"'
echo 'export merged_bam="${output_dir_top}/sorted-bams-merged.bam"'
echo ""

echo "# Paths to programs"
echo 'export programs_dir="/home/shared"'
echo 'export hisat2_dir="${programs_dir}/hisat2-2.2.1"'

echo 'export hisat2="${hisat2_dir}/hisat2"'

echo 'export multiqc=/home/sam/programs/mambaforge/bin/multiqc'

echo 'export samtools="${programs_dir}/samtools-1.12/samtools"'

echo 'export prepDE="${programs_dir}/stringtie-2.2.1.Linux_x86_64/prepDE.py3"'
echo 'export stringtie="${programs_dir}/stringtie-2.2.1.Linux_x86_64/stringtie"'

echo ""

echo "# Set FastQ filename patterns"
echo "export R1_fastq_pattern='*_R1_*.fq.gz'"
echo "export R2_fastq_pattern='*_R2_*.fq.gz'"
echo "export trimmed_fastq_pattern='*fastp-trim.fq.gz'"
echo ""

echo "# Set number of CPUs to use"
echo 'export threads=47'
echo ""

echo "# Set average read length - for StringTie prepDE.py"
echo 'export read_length=125'
echo ""


echo "## Initialize arrays"
echo 'export fastq_array_R1=()'
echo 'export fastq_array_R2=()'
echo 'export R1_names_array=()'
echo 'export R2_names_array=()'
echo "declare -A sample_timepoint_map"
echo ""

echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[hisat2]="${hisat2}" \'
echo '[multiqc]="${multiqc}" \'
echo '[prepDE]="${prepDE}" \'
echo '[samtools]="${samtools}" \'
echo '[stringtie]="${stringtie}" \'
echo ")"
echo ""

echo "# Print formatting"
echo 'export line="--------------------------------------------------------"'
echo ""
} > .bashvars

cat .bashvars

#### Assign Variables ####

# Data directories
export timeseries_dir=/home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular
export genome_dir="${timeseries_dir}/D-Apul/data"
export genome_index_dir="${timeseries_dir}/D-Apul/output/02.10-D-Apul-RNAseq-genome-index-HiSat2"
export output_dir_top="${timeseries_dir}/D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2"
export trimmed_fastqs_dir="${timeseries_dir}/D-Apul/output/01.00-D-Apul-RNAseq-trimming-fastp-FastQC-MultiQC"
export trimmed_reads_url="https://gannet.fish.washington.edu/Atumefaciens/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/01.00-D-Apul-RNAseq-trimming-fastp-FastQC-MultiQC"

# Location of Hisat2 index files
# Must keep variable name formatting, as it's used by HiSat2
export HISAT2_INDEXES="${genome_index_dir}"
# Input files
export exons="${output_dir_top}/Apulchra-genome_hisat2_exons.tab"
export genome_index_name="Apulchra-genome"
export genome_gff="${genome_dir}/Apulchra-genome.gff"
export genome_fasta="${genome_dir}/Apulchra-genome.fa"
export splice_sites="${output_dir_top}/Apulchra-genome_hisat2_splice_sites.tab"
export transcripts_gtf="${genome_dir}/Apulchra-genome.gtf"
# Output files
export gtf_list="${output_dir_top}/gtf_list.txt"
export merged_bam="${output_dir_top}/sorted-bams-merged.bam"

# Paths to programs
export programs_dir="/home/shared"
export hisat2_dir="${programs_dir}/hisat2-2.2.1"
export hisat2="${hisat2_dir}/hisat2"
export multiqc=/home/sam/programs/mambaforge/bin/multiqc
export samtools="${programs_dir}/samtools-1.12/samtools"
export prepDE="${programs_dir}/stringtie-2.2.1.Linux_x86_64/prepDE.py3"
export stringtie="${programs_dir}/stringtie-2.2.1.Linux_x86_64/stringtie"

# Set FastQ filename patterns
export R1_fastq_pattern='*_R1_*.fq.gz'
export R2_fastq_pattern='*_R2_*.fq.gz'
export trimmed_fastq_pattern='*fastp-trim.fq.gz'

# Set number of CPUs to use
export threads=47

# Set average read length - for StringTie prepDE.py
export read_length=125

## Initialize arrays
export fastq_array_R1=()
export fastq_array_R2=()
export R1_names_array=()
export R2_names_array=()
declare -A sample_timepoint_map

# Programs associative array
declare -A programs_array
programs_array=(
[hisat2]="${hisat2}" \
[multiqc]="${multiqc}" \
[prepDE]="${prepDE}" \
[samtools]="${samtools}" \
[stringtie]="${stringtie}" \
)

# Print formatting
export line="--------------------------------------------------------"

If needed, download trimmed RNA-seq.

Change eval=FALSE to eval=TRUE to execute the next two chunks to download RNA-seq and then verify MD5 checksums.

# Load bash variables into memory
source .bashvars

# Make output directory if it doesn't exist
mkdir --parents ${trimmed_fastqs_dir}

# Run wget to retrieve FastQs and MD5 files
wget \
--directory-prefix ${trimmed_fastqs_dir} \
--recursive \
--no-check-certificate \
--continue \
--cut-dirs 3 \
--no-host-directories \
--no-parent \
--quiet \
--accept="*fastp-trim*, *.md5"
${trimmed_reads_url}

ls -lh "${trimmed_fastqs_dir}"

Verify trimmed read checksums

# Load bash variables into memory
source .bashvars

cd "${trimmed_fastqs_dir}"

# Verify checksums
for file in *.md5
do
  md5sum --check "${file}"
done

3 Align reads using HISAT2

3.1 HISAT2 Alignment

This requires usage of the rna_metadata.csv

This step has a lengthy, semi-complex workflow:

Parse rna_metadata.csv for A.pulchra sample names and time point. This info will be used for downstream file naming and to assign the time point to the read group (SM:) in the alignment file.
Loop through all samples and perform individual alignments using HISAT2.
HISAT2 output is piped to through multiple samtools tools: flagstat (stats aggregation), sort (creates/sorts BAM), index (creates BAM index). Piping saves time and disk space, by avoiding the generation of large SAM files.
Loop continues and runs StringTie on sorted BAM file to produce individual GTF file.
Loop continues and adds GTF path/filename to a list file, which will be used downstream.

# Load bash variables into memory
source .bashvars

# Make output directories, if they don't exist
mkdir --parents "${output_dir_top}"

# Change to ouput directory
cd "${output_dir_top}"

## Populate trimmed reads arrays
fastq_array_R1=("${trimmed_fastqs_dir}"/${R1_fastq_pattern})
fastq_array_R2=("${trimmed_fastqs_dir}"/${R2_fastq_pattern})

############## BEGIN HISAT2 ALIGNMENTS ##############

for filepath in "${fastq_array_R1[@]}"; do
    filename=${filepath##*/}      # Strip path
    sample=$(echo "$filename" | awk -F"_" '{print $1}')
    timepoint=$(echo "$filename" | awk -F"[_-]" '{print $3}')

    
  # Create and switch to dedicated sample directory
  mkdir --parents "${sample}" && cd "$_"
    
  # Create HISAT2 list of fastq R1 files
  # and generated MD5 checksums file.
  for fastq in "${fastq_array_R1[@]}"
  do

    
    # Parse sample name from FastQ filename
    fastq_sample=$(echo "${fastq##*/}" | awk -F"_" '{print $1}')
    

    
    # Process matching FastQ file, based on sample name
    if [ "${fastq_sample}" == "${sample}" ]; then
      
      # Generate checksum/list of input files used
      md5sum "${fastq}" >> input_fastqs_checksums.md5
      
      # Create comma-separated lists of FastQs for HISAT2
      printf -v joined_R1 '%s,' "${fastq}"
      fastq_list_R1=$(echo "${joined_R1%,}")
    fi
  done

  # Create HISAT2 list of fastq R1 files
  # and generated MD5 checksums file.
  for fastq in "${fastq_array_R2[@]}"
  do
    # Parse sample name from FastQ filename
    fastq_sample=$(echo "${fastq##*/}" | awk -F"_" '{print $1}')
    
    # Process matching FastQ file, based on sample name
    if [ "${fastq_sample}" == "${sample}" ]; then
      
      # Generate checksum/list of input files used
      md5sum "${fastq}" >> input_fastqs_checksums.md5

      # Create comma-separated lists of FastQs for HISAT2
      printf -v joined_R2 '%s,' "${fastq}"
      fastq_list_R2=$(echo "${joined_R2%,}")
    fi
  done



  # HISAT2 alignments
  # Sets read group info (RG) using samples array
  "${programs_array[hisat2]}" \
  -x "${genome_index_name}" \
  -1 "${fastq_list_R1}" \
  -2 "${fastq_list_R2}" \
  --threads "${threads}" \
  --rg-id "${sample}" \
  --rg "SM:""${timepoint}" \
  2> "${sample}"_hisat2.stats \
  | tee >(${programs_array[samtools]} flagstat - > "${sample}"-hisat2_output.flagstat) \
  | ${programs_array[samtools]} sort - -@ "${threads}" -O BAM \
  | tee "${sample}".sorted.bam \
  | ${programs_array[samtools]} index - "${sample}".sorted.bam.bai
  
  
  # Run stringtie on alignments
  # Uses "-B" option to output tables intended for use in Ballgown
  # Uses "-e" option; recommended when using "-B" option.
  # Limits analysis to only reads alignments matching reference.
  "${programs_array[stringtie]}" "${sample}".sorted.bam \
  -p "${threads}" \
  -o "${sample}".gtf \
  -G "${genome_gff}" \
  -C "${sample}.cov_refs.gtf" \
  -B \
  -e 
  
  
  # Add GTFs to list file, only if non-empty
  # Identifies GTF files that only have header
  gtf_lines=$(wc -l < "${sample}".gtf )
  if [ "${gtf_lines}" -gt 2 ]; then
    echo "$(pwd)/${sample}.gtf" >> "${gtf_list}"
  fi 

  # Generate checksums
  find ./ -type f -not -name "*.md5" -exec md5sum {} \; > ${sample}_checksums.md5
  # Move up to orig. working directory
  cd ..

done

3.2 Review HISAT2 Output

View the resulting directory structure of resulting from the HISAT2/StringTie process.

# Load bash variables into memory
source .bashvars

# Change to ouput directory
cd "${output_dir_top}"

# Display HISAT2 output directory structure
# with directory (--du) and file sizes (-h)
tree --du -h

[138G]  [01;34m.[0m
├── [1.6G]  [01;34mACR-139-TP1[0m
│   ├── [ 601]  [00mACR-139-TP1_checksums.md5[0m
│   ├── [7.6M]  [00mACR-139-TP1.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-139-TP1.gtf[0m
│   ├── [ 450]  [00mACR-139-TP1-hisat2_output.flagstat[0m
│   ├── [ 636]  [00mACR-139-TP1_hisat2.stats[0m
│   ├── [1.6G]  [00mACR-139-TP1.sorted.bam[0m
│   ├── [811K]  [00mACR-139-TP1.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.8G]  [01;34mACR-139-TP2[0m
│   ├── [ 601]  [00mACR-139-TP2_checksums.md5[0m
│   ├── [8.2M]  [00mACR-139-TP2.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-139-TP2.gtf[0m
│   ├── [ 449]  [00mACR-139-TP2-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-139-TP2_hisat2.stats[0m
│   ├── [1.7G]  [00mACR-139-TP2.sorted.bam[0m
│   ├── [819K]  [00mACR-139-TP2.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.6G]  [01;34mACR-139-TP3[0m
│   ├── [ 601]  [00mACR-139-TP3_checksums.md5[0m
│   ├── [6.3M]  [00mACR-139-TP3.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-139-TP3.gtf[0m
│   ├── [ 449]  [00mACR-139-TP3-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-139-TP3_hisat2.stats[0m
│   ├── [1.5G]  [00mACR-139-TP3.sorted.bam[0m
│   ├── [827K]  [00mACR-139-TP3.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.9G]  [01;34mACR-139-TP4[0m
│   ├── [ 601]  [00mACR-139-TP4_checksums.md5[0m
│   ├── [6.9M]  [00mACR-139-TP4.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-139-TP4.gtf[0m
│   ├── [ 449]  [00mACR-139-TP4-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-139-TP4_hisat2.stats[0m
│   ├── [1.8G]  [00mACR-139-TP4.sorted.bam[0m
│   ├── [853K]  [00mACR-139-TP4.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [2.1G]  [01;34mACR-145-TP1[0m
│   ├── [ 601]  [00mACR-145-TP1_checksums.md5[0m
│   ├── [5.4M]  [00mACR-145-TP1.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-145-TP1.gtf[0m
│   ├── [ 450]  [00mACR-145-TP1-hisat2_output.flagstat[0m
│   ├── [ 639]  [00mACR-145-TP1_hisat2.stats[0m
│   ├── [2.1G]  [00mACR-145-TP1.sorted.bam[0m
│   ├── [985K]  [00mACR-145-TP1.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.7G]  [01;34mACR-145-TP2[0m
│   ├── [ 601]  [00mACR-145-TP2_checksums.md5[0m
│   ├── [6.3M]  [00mACR-145-TP2.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-145-TP2.gtf[0m
│   ├── [ 449]  [00mACR-145-TP2-hisat2_output.flagstat[0m
│   ├── [ 636]  [00mACR-145-TP2_hisat2.stats[0m
│   ├── [1.6G]  [00mACR-145-TP2.sorted.bam[0m
│   ├── [735K]  [00mACR-145-TP2.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.9G]  [01;34mACR-145-TP3[0m
│   ├── [ 601]  [00mACR-145-TP3_checksums.md5[0m
│   ├── [3.0M]  [00mACR-145-TP3.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-145-TP3.gtf[0m
│   ├── [ 449]  [00mACR-145-TP3-hisat2_output.flagstat[0m
│   ├── [ 640]  [00mACR-145-TP3_hisat2.stats[0m
│   ├── [1.8G]  [00mACR-145-TP3.sorted.bam[0m
│   ├── [795K]  [00mACR-145-TP3.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.9G]  [01;34mACR-145-TP4[0m
│   ├── [ 601]  [00mACR-145-TP4_checksums.md5[0m
│   ├── [3.6M]  [00mACR-145-TP4.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-145-TP4.gtf[0m
│   ├── [ 449]  [00mACR-145-TP4-hisat2_output.flagstat[0m
│   ├── [ 638]  [00mACR-145-TP4_hisat2.stats[0m
│   ├── [1.8G]  [00mACR-145-TP4.sorted.bam[0m
│   ├── [789K]  [00mACR-145-TP4.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [2.0G]  [01;34mACR-150-TP1[0m
│   ├── [ 601]  [00mACR-150-TP1_checksums.md5[0m
│   ├── [3.0M]  [00mACR-150-TP1.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-150-TP1.gtf[0m
│   ├── [ 449]  [00mACR-150-TP1-hisat2_output.flagstat[0m
│   ├── [ 640]  [00mACR-150-TP1_hisat2.stats[0m
│   ├── [2.0G]  [00mACR-150-TP1.sorted.bam[0m
│   ├── [807K]  [00mACR-150-TP1.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.7G]  [01;34mACR-150-TP2[0m
│   ├── [ 601]  [00mACR-150-TP2_checksums.md5[0m
│   ├── [7.5M]  [00mACR-150-TP2.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-150-TP2.gtf[0m
│   ├── [ 449]  [00mACR-150-TP2-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-150-TP2_hisat2.stats[0m
│   ├── [1.6G]  [00mACR-150-TP2.sorted.bam[0m
│   ├── [796K]  [00mACR-150-TP2.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.6G]  [01;34mACR-150-TP3[0m
│   ├── [ 601]  [00mACR-150-TP3_checksums.md5[0m
│   ├── [8.0M]  [00mACR-150-TP3.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-150-TP3.gtf[0m
│   ├── [ 449]  [00mACR-150-TP3-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-150-TP3_hisat2.stats[0m
│   ├── [1.6G]  [00mACR-150-TP3.sorted.bam[0m
│   ├── [798K]  [00mACR-150-TP3.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.8G]  [01;34mACR-150-TP4[0m
│   ├── [ 601]  [00mACR-150-TP4_checksums.md5[0m
│   ├── [7.1M]  [00mACR-150-TP4.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-150-TP4.gtf[0m
│   ├── [ 449]  [00mACR-150-TP4-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-150-TP4_hisat2.stats[0m
│   ├── [1.7G]  [00mACR-150-TP4.sorted.bam[0m
│   ├── [829K]  [00mACR-150-TP4.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.8G]  [01;34mACR-173-TP1[0m
│   ├── [ 601]  [00mACR-173-TP1_checksums.md5[0m
│   ├── [4.9M]  [00mACR-173-TP1.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-173-TP1.gtf[0m
│   ├── [ 451]  [00mACR-173-TP1-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-173-TP1_hisat2.stats[0m
│   ├── [1.8G]  [00mACR-173-TP1.sorted.bam[0m
│   ├── [964K]  [00mACR-173-TP1.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.6G]  [01;34mACR-173-TP2[0m
│   ├── [ 601]  [00mACR-173-TP2_checksums.md5[0m
│   ├── [8.0M]  [00mACR-173-TP2.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-173-TP2.gtf[0m
│   ├── [ 449]  [00mACR-173-TP2-hisat2_output.flagstat[0m
│   ├── [ 635]  [00mACR-173-TP2_hisat2.stats[0m
│   ├── [1.5G]  [00mACR-173-TP2.sorted.bam[0m
│   ├── [759K]  [00mACR-173-TP2.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.8G]  [01;34mACR-173-TP3[0m
│   ├── [ 601]  [00mACR-173-TP3_checksums.md5[0m
│   ├── [5.1M]  [00mACR-173-TP3.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-173-TP3.gtf[0m
│   ├── [ 449]  [00mACR-173-TP3-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-173-TP3_hisat2.stats[0m
│   ├── [1.7G]  [00mACR-173-TP3.sorted.bam[0m
│   ├── [824K]  [00mACR-173-TP3.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.5G]  [01;34mACR-173-TP4[0m
│   ├── [ 601]  [00mACR-173-TP4_checksums.md5[0m
│   ├── [4.0M]  [00mACR-173-TP4.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-173-TP4.gtf[0m
│   ├── [ 449]  [00mACR-173-TP4-hisat2_output.flagstat[0m
│   ├── [ 636]  [00mACR-173-TP4_hisat2.stats[0m
│   ├── [1.5G]  [00mACR-173-TP4.sorted.bam[0m
│   ├── [718K]  [00mACR-173-TP4.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.4G]  [01;34mACR-186-TP1[0m
│   ├── [ 601]  [00mACR-186-TP1_checksums.md5[0m
│   ├── [2.3M]  [00mACR-186-TP1.cov_refs.gtf[0m
│   ├── [ 33M]  [00mACR-186-TP1.gtf[0m
│   ├── [ 449]  [00mACR-186-TP1-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-186-TP1_hisat2.stats[0m
│   ├── [1.4G]  [00mACR-186-TP1.sorted.bam[0m
│   ├── [635K]  [00mACR-186-TP1.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.6G]  [01;34mACR-186-TP2[0m
│   ├── [ 601]  [00mACR-186-TP2_checksums.md5[0m
│   ├── [7.0M]  [00mACR-186-TP2.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-186-TP2.gtf[0m
│   ├── [ 449]  [00mACR-186-TP2-hisat2_output.flagstat[0m
│   ├── [ 636]  [00mACR-186-TP2_hisat2.stats[0m
│   ├── [1.5G]  [00mACR-186-TP2.sorted.bam[0m
│   ├── [795K]  [00mACR-186-TP2.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.4G]  [01;34mACR-186-TP3[0m
│   ├── [ 601]  [00mACR-186-TP3_checksums.md5[0m
│   ├── [3.5M]  [00mACR-186-TP3.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-186-TP3.gtf[0m
│   ├── [ 449]  [00mACR-186-TP3-hisat2_output.flagstat[0m
│   ├── [ 636]  [00mACR-186-TP3_hisat2.stats[0m
│   ├── [1.3G]  [00mACR-186-TP3.sorted.bam[0m
│   ├── [705K]  [00mACR-186-TP3.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.7G]  [01;34mACR-186-TP4[0m
│   ├── [ 601]  [00mACR-186-TP4_checksums.md5[0m
│   ├── [6.7M]  [00mACR-186-TP4.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-186-TP4.gtf[0m
│   ├── [ 449]  [00mACR-186-TP4-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-186-TP4_hisat2.stats[0m
│   ├── [1.6G]  [00mACR-186-TP4.sorted.bam[0m
│   ├── [803K]  [00mACR-186-TP4.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.5G]  [01;34mACR-225-TP1[0m
│   ├── [ 601]  [00mACR-225-TP1_checksums.md5[0m
│   ├── [4.8M]  [00mACR-225-TP1.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-225-TP1.gtf[0m
│   ├── [ 449]  [00mACR-225-TP1-hisat2_output.flagstat[0m
│   ├── [ 636]  [00mACR-225-TP1_hisat2.stats[0m
│   ├── [1.5G]  [00mACR-225-TP1.sorted.bam[0m
│   ├── [793K]  [00mACR-225-TP1.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.8G]  [01;34mACR-225-TP2[0m
│   ├── [ 601]  [00mACR-225-TP2_checksums.md5[0m
│   ├── [6.1M]  [00mACR-225-TP2.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-225-TP2.gtf[0m
│   ├── [ 450]  [00mACR-225-TP2-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-225-TP2_hisat2.stats[0m
│   ├── [1.7G]  [00mACR-225-TP2.sorted.bam[0m
│   ├── [854K]  [00mACR-225-TP2.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.8G]  [01;34mACR-225-TP3[0m
│   ├── [ 601]  [00mACR-225-TP3_checksums.md5[0m
│   ├── [7.0M]  [00mACR-225-TP3.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-225-TP3.gtf[0m
│   ├── [ 449]  [00mACR-225-TP3-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-225-TP3_hisat2.stats[0m
│   ├── [1.8G]  [00mACR-225-TP3.sorted.bam[0m
│   ├── [831K]  [00mACR-225-TP3.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.4G]  [01;34mACR-225-TP4[0m
│   ├── [ 601]  [00mACR-225-TP4_checksums.md5[0m
│   ├── [4.2M]  [00mACR-225-TP4.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-225-TP4.gtf[0m
│   ├── [ 449]  [00mACR-225-TP4-hisat2_output.flagstat[0m
│   ├── [ 636]  [00mACR-225-TP4_hisat2.stats[0m
│   ├── [1.4G]  [00mACR-225-TP4.sorted.bam[0m
│   ├── [725K]  [00mACR-225-TP4.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [2.0G]  [01;34mACR-229-TP1[0m
│   ├── [ 601]  [00mACR-229-TP1_checksums.md5[0m
│   ├── [7.9M]  [00mACR-229-TP1.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-229-TP1.gtf[0m
│   ├── [ 450]  [00mACR-229-TP1-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-229-TP1_hisat2.stats[0m
│   ├── [1.9G]  [00mACR-229-TP1.sorted.bam[0m
│   ├── [942K]  [00mACR-229-TP1.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.7G]  [01;34mACR-229-TP2[0m
│   ├── [ 601]  [00mACR-229-TP2_checksums.md5[0m
│   ├── [6.9M]  [00mACR-229-TP2.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-229-TP2.gtf[0m
│   ├── [ 449]  [00mACR-229-TP2-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-229-TP2_hisat2.stats[0m
│   ├── [1.6G]  [00mACR-229-TP2.sorted.bam[0m
│   ├── [792K]  [00mACR-229-TP2.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.6G]  [01;34mACR-229-TP3[0m
│   ├── [ 601]  [00mACR-229-TP3_checksums.md5[0m
│   ├── [7.5M]  [00mACR-229-TP3.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-229-TP3.gtf[0m
│   ├── [ 449]  [00mACR-229-TP3-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-229-TP3_hisat2.stats[0m
│   ├── [1.5G]  [00mACR-229-TP3.sorted.bam[0m
│   ├── [791K]  [00mACR-229-TP3.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.9G]  [01;34mACR-229-TP4[0m
│   ├── [ 601]  [00mACR-229-TP4_checksums.md5[0m
│   ├── [2.5M]  [00mACR-229-TP4.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-229-TP4.gtf[0m
│   ├── [ 449]  [00mACR-229-TP4-hisat2_output.flagstat[0m
│   ├── [ 640]  [00mACR-229-TP4_hisat2.stats[0m
│   ├── [1.9G]  [00mACR-229-TP4.sorted.bam[0m
│   ├── [809K]  [00mACR-229-TP4.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.9G]  [01;34mACR-237-TP1[0m
│   ├── [ 601]  [00mACR-237-TP1_checksums.md5[0m
│   ├── [4.5M]  [00mACR-237-TP1.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-237-TP1.gtf[0m
│   ├── [ 449]  [00mACR-237-TP1-hisat2_output.flagstat[0m
│   ├── [ 638]  [00mACR-237-TP1_hisat2.stats[0m
│   ├── [1.8G]  [00mACR-237-TP1.sorted.bam[0m
│   ├── [823K]  [00mACR-237-TP1.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.5G]  [01;34mACR-237-TP2[0m
│   ├── [ 601]  [00mACR-237-TP2_checksums.md5[0m
│   ├── [5.7M]  [00mACR-237-TP2.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-237-TP2.gtf[0m
│   ├── [ 449]  [00mACR-237-TP2-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-237-TP2_hisat2.stats[0m
│   ├── [1.4G]  [00mACR-237-TP2.sorted.bam[0m
│   ├── [699K]  [00mACR-237-TP2.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.6G]  [01;34mACR-237-TP3[0m
│   ├── [ 601]  [00mACR-237-TP3_checksums.md5[0m
│   ├── [8.4M]  [00mACR-237-TP3.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-237-TP3.gtf[0m
│   ├── [ 449]  [00mACR-237-TP3-hisat2_output.flagstat[0m
│   ├── [ 636]  [00mACR-237-TP3_hisat2.stats[0m
│   ├── [1.5G]  [00mACR-237-TP3.sorted.bam[0m
│   ├── [789K]  [00mACR-237-TP3.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [2.0G]  [01;34mACR-237-TP4[0m
│   ├── [ 601]  [00mACR-237-TP4_checksums.md5[0m
│   ├── [4.6M]  [00mACR-237-TP4.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-237-TP4.gtf[0m
│   ├── [ 449]  [00mACR-237-TP4-hisat2_output.flagstat[0m
│   ├── [ 638]  [00mACR-237-TP4_hisat2.stats[0m
│   ├── [2.0G]  [00mACR-237-TP4.sorted.bam[0m
│   ├── [901K]  [00mACR-237-TP4.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.8G]  [01;34mACR-244-TP1[0m
│   ├── [ 601]  [00mACR-244-TP1_checksums.md5[0m
│   ├── [4.1M]  [00mACR-244-TP1.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-244-TP1.gtf[0m
│   ├── [ 448]  [00mACR-244-TP1-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-244-TP1_hisat2.stats[0m
│   ├── [1.7G]  [00mACR-244-TP1.sorted.bam[0m
│   ├── [648K]  [00mACR-244-TP1.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [2.0G]  [01;34mACR-244-TP2[0m
│   ├── [ 601]  [00mACR-244-TP2_checksums.md5[0m
│   ├── [7.3M]  [00mACR-244-TP2.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-244-TP2.gtf[0m
│   ├── [ 450]  [00mACR-244-TP2-hisat2_output.flagstat[0m
│   ├── [ 638]  [00mACR-244-TP2_hisat2.stats[0m
│   ├── [1.9G]  [00mACR-244-TP2.sorted.bam[0m
│   ├── [809K]  [00mACR-244-TP2.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [2.1G]  [01;34mACR-244-TP3[0m
│   ├── [ 601]  [00mACR-244-TP3_checksums.md5[0m
│   ├── [8.5M]  [00mACR-244-TP3.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-244-TP3.gtf[0m
│   ├── [ 450]  [00mACR-244-TP3-hisat2_output.flagstat[0m
│   ├── [ 638]  [00mACR-244-TP3_hisat2.stats[0m
│   ├── [2.1G]  [00mACR-244-TP3.sorted.bam[0m
│   ├── [1008K]  [00mACR-244-TP3.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.4M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.8G]  [01;34mACR-244-TP4[0m
│   ├── [ 601]  [00mACR-244-TP4_checksums.md5[0m
│   ├── [3.2M]  [00mACR-244-TP4.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-244-TP4.gtf[0m
│   ├── [ 449]  [00mACR-244-TP4-hisat2_output.flagstat[0m
│   ├── [ 640]  [00mACR-244-TP4_hisat2.stats[0m
│   ├── [1.8G]  [00mACR-244-TP4.sorted.bam[0m
│   ├── [882K]  [00mACR-244-TP4.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [2.6G]  [01;34mACR-265-TP1[0m
│   ├── [ 601]  [00mACR-265-TP1_checksums.md5[0m
│   ├── [3.0M]  [00mACR-265-TP1.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-265-TP1.gtf[0m
│   ├── [ 450]  [00mACR-265-TP1-hisat2_output.flagstat[0m
│   ├── [ 642]  [00mACR-265-TP1_hisat2.stats[0m
│   ├── [2.5G]  [00mACR-265-TP1.sorted.bam[0m
│   ├── [1002K]  [00mACR-265-TP1.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [1.9G]  [01;34mACR-265-TP2[0m
│   ├── [ 601]  [00mACR-265-TP2_checksums.md5[0m
│   ├── [6.4M]  [00mACR-265-TP2.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-265-TP2.gtf[0m
│   ├── [ 450]  [00mACR-265-TP2-hisat2_output.flagstat[0m
│   ├── [ 637]  [00mACR-265-TP2_hisat2.stats[0m
│   ├── [1.8G]  [00mACR-265-TP2.sorted.bam[0m
│   ├── [856K]  [00mACR-265-TP2.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [2.0G]  [01;34mACR-265-TP3[0m
│   ├── [ 601]  [00mACR-265-TP3_checksums.md5[0m
│   ├── [5.7M]  [00mACR-265-TP3.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-265-TP3.gtf[0m
│   ├── [ 450]  [00mACR-265-TP3-hisat2_output.flagstat[0m
│   ├── [ 638]  [00mACR-265-TP3_hisat2.stats[0m
│   ├── [1.9G]  [00mACR-265-TP3.sorted.bam[0m
│   ├── [940K]  [00mACR-265-TP3.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.3M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [2.0G]  [01;34mACR-265-TP4[0m
│   ├── [ 601]  [00mACR-265-TP4_checksums.md5[0m
│   ├── [3.9M]  [00mACR-265-TP4.cov_refs.gtf[0m
│   ├── [ 34M]  [00mACR-265-TP4.gtf[0m
│   ├── [ 449]  [00mACR-265-TP4-hisat2_output.flagstat[0m
│   ├── [ 640]  [00mACR-265-TP4_hisat2.stats[0m
│   ├── [1.9G]  [00mACR-265-TP4.sorted.bam[0m
│   ├── [849K]  [00mACR-265-TP4.sorted.bam.bai[0m
│   ├── [2.4M]  [00me2t.ctab[0m
│   ├── [ 15M]  [00me_data.ctab[0m
│   ├── [1.9M]  [00mi2t.ctab[0m
│   ├── [7.2M]  [00mi_data.ctab[0m
│   ├── [ 400]  [00minput_fastqs_checksums.md5[0m
│   └── [3.7M]  [00mt_data.ctab[0m
├── [ 35M]  [00mApulchra-genome.stringtie.gtf[0m
├── [5.1M]  [00mapul-gene_count_matrix.csv[0m
├── [5.2M]  [00mapul-transcript_count_matrix.csv[0m
├── [5.7M]  [00mapul-transcript_count_matrix_with_gene_ids.csv[0m
├── [ 633]  [00mchecksums.md5[0m
├── [5.7K]  [00mgtf_list.txt[0m
├── [316K]  [01;34mmultiqc_data[0m
│   ├── [5.3K]  [00mmultiqc_bowtie2.txt[0m
│   ├── [ 307]  [00mmultiqc_citations.txt[0m
│   ├── [275K]  [00mmultiqc_data.json[0m
│   ├── [3.3K]  [00mmultiqc_general_stats.txt[0m
│   ├── [4.1K]  [00mmultiqc.log[0m
│   ├── [8.0K]  [00mmultiqc_samtools_flagstat.txt[0m
│   └── [ 16K]  [00mmultiqc_sources.txt[0m
├── [1.1M]  [00mmultiqc_report.html[0m
├── [6.1K]  [00mprepDE-sample_list.txt[0m
├── [3.0K]  [00mREADME.md[0m
├── [1.4K]  [00msorted_bams.list[0m
├── [ 66G]  [00msorted-bams-merged.bam[0m
└── [ 13M]  [00msorted-bams-merged.bam.bai[0m

 209G used in 42 directories, 539 files

3.3 MultiQC alignment rates

# Load bash variables into memory
source .bashvars

# Change to ouput directory
cd "${output_dir_top}"

${multiqc} --interactive .

4 Merge sorted BAMs

Merge all BAMs to singular BAM for use in transcriptome assembly later, if needed.

# Load bash variables into memory
source .bashvars

# Change to ouput directory
cd "${output_dir_top}"


## Create list of sorted BAMs for merging
find . -name "*sorted.bam" > sorted_bams.list

## Merge sorted BAMs
${programs_array[samtools]} merge \
-b sorted_bams.list \
${merged_bam} \
--threads ${threads}

## Index merged BAM
${programs_array[samtools]} index ${merged_bam}

5 Create combined GTF

# Load bash variables into memory
source .bashvars

# Change to ouput directory
cd "${output_dir_top}"


# Create singular transcript file, using GTF list file
"${programs_array[stringtie]}" --merge \
"${gtf_list}" \
-p "${threads}" \
-G "${genome_gff}" \
-o "${genome_index_name}".stringtie.gtf

6 Create DESeq2 Count Matrices

# Load bash variables into memory
source .bashvars

# Change to output directory
cd "${output_dir_top}"

# Check if prepDE-sample_list.txt exists
if [ -f prepDE-sample_list.txt ]; then
  gtf_lines=$(wc -l < gtf_list.txt)
  prepde_lines=$(wc -l < prepDE-sample_list.txt)
  if [ "$gtf_lines" -eq "$prepde_lines" ]; then
    echo "prepDE-sample_list.txt exists and line count matches gtf_list.txt. Skipping sample list creation."
  else
    echo "prepDE-sample_list.txt exists but line count does not match. Regenerating sample list."
    rm prepDE-sample_list.txt
    while read -r line
    do
      sample_no_path=${line##*/}
      sample=${sample_no_path%.*}
      echo ${sample} ${line}
    done < gtf_list.txt >> prepDE-sample_list.txt
  fi
else
  echo "prepDE-sample_list.txt does not exist. Creating sample list."
  while read -r line
  do
    sample_no_path=${line##*/}
    sample=${sample_no_path%.*}
    echo ${sample} ${line}
  done < gtf_list.txt >> prepDE-sample_list.txt
fi

# Create count matrices for genes and transcripts
# Compatible with import to DESeq2
python3 "${programs_array[prepDE]}" \
--input=prepDE-sample_list.txt \
-g apul-gene_count_matrix.csv \
-t apul-transcript_count_matrix.csv \
--length=${read_length}

prepDE-sample_list.txt exists and line count matches gtf_list.txt. Skipping sample list creation.
/home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py3:69: SyntaxWarning: invalid escape sequence '\-'
  RE_COVERAGE=re.compile('cov "([\-\+\d\.]+)"')
/home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py3:72: SyntaxWarning: invalid escape sequence '\-'
  RE_GFILE=re.compile('\-G\s*(\S+)') #assume filepath without spaces..

7 Add Gene IDs to Transcript Count Matrix

Create an enhanced version of the transcript count matrix that includes the corresponding gene ID for each transcript.

# Load bash variables into memory
source .bashvars

# Change to output directory
cd "${output_dir_top}"

# Extract transcript_id to ref_gene_id mapping from GTF file
echo "Extracting transcript to gene ID mappings from GTF file..."

grep -E $'\ttranscript\t' "${genome_dir}/Apulchra-genome.stringtie.gtf" | \
awk -F'\t' '{
    # Extract transcript_id
    match($9, /transcript_id "([^"]+)"/, transcript_arr)
    transcript_id = transcript_arr[1]
    
    # Extract ref_gene_id
    match($9, /ref_gene_id "([^"]+)"/, gene_arr)
    ref_gene_id = gene_arr[1]
    
    # Print mapping if both IDs found
    if (transcript_id && ref_gene_id) {
        print transcript_id "\t" ref_gene_id
    }
}' > transcript_gene_mapping.txt

# Check if mapping file was created successfully
if [ ! -s transcript_gene_mapping.txt ]; then
    echo "Error: Failed to create transcript-gene mapping file"
    exit 1
fi

echo "Found $(wc -l < transcript_gene_mapping.txt) transcript-gene mappings"

# Create enhanced transcript count matrix with gene IDs
echo "Creating enhanced transcript count matrix with gene IDs..."

# Read header from original transcript count matrix
head -1 apul-transcript_count_matrix.csv > temp_header.txt

# Add ref_gene_id to header (insert after transcript_id column)
sed 's/transcript_id,/transcript_id,ref_gene_id,/' temp_header.txt > apul-transcript_count_matrix_with_gene_ids.csv

# Process data rows
tail -n +2 apul-transcript_count_matrix.csv | while IFS=',' read -r transcript_id rest_of_line; do
    # Look up gene ID for this transcript
    gene_id=$(grep "^${transcript_id}[[:space:]]" transcript_gene_mapping.txt | cut -f2)
    
    # If no gene ID found, use empty field
    if [ -z "$gene_id" ]; then
        gene_id=""
    fi
    
    # Output line with gene ID inserted after transcript ID
    echo "${transcript_id},${gene_id},${rest_of_line}"
done >> apul-transcript_count_matrix_with_gene_ids.csv

# Generate summary statistics
echo ""
echo "Summary of enhanced transcript count matrix:"
original_lines=$(wc -l < apul-transcript_count_matrix.csv)
enhanced_lines=$(wc -l < apul-transcript_count_matrix_with_gene_ids.csv)
echo "Original transcript count matrix lines: $original_lines"
echo "Enhanced transcript count matrix lines: $enhanced_lines"

# Count transcripts with and without gene ID mappings
transcripts_with_genes=$(tail -n +2 apul-transcript_count_matrix_with_gene_ids.csv | awk -F',' '$2 != ""' | wc -l)
transcripts_without_genes=$(tail -n +2 apul-transcript_count_matrix_with_gene_ids.csv | awk -F',' '$2 == ""' | wc -l)

echo "Number of transcripts with gene ID mappings: $transcripts_with_genes"
echo "Number of transcripts without gene ID mappings: $transcripts_without_genes"

# Display first few lines of enhanced matrix
echo ""
echo "First 5 rows of enhanced transcript count matrix (first 5 columns):"
head -6 apul-transcript_count_matrix_with_gene_ids.csv | cut -d',' -f1-5

# Clean up temporary files
rm temp_header.txt transcript_gene_mapping.txt

echo ""
echo "Enhanced transcript count matrix saved as: apul-transcript_count_matrix_with_gene_ids.csv"

Extracting transcript to gene ID mappings from GTF file...
Found 44371 transcript-gene mappings
Creating enhanced transcript count matrix with gene IDs...

Summary of enhanced transcript count matrix:
Original transcript count matrix lines: 44372
Enhanced transcript count matrix lines: 44372
Number of transcripts with gene ID mappings: 44371
Number of transcripts without gene ID mappings: 0

First 5 rows of enhanced transcript count matrix (first 5 columns):
transcript_id,ref_gene_id,ACR-139-TP1,ACR-139-TP2,ACR-139-TP3
FUN_002326-T1,FUN_002326,3,2,3
FUN_002315-T1,FUN_002315,0,1,0
FUN_002316-T1,FUN_002316,0,0,0
FUN_002303-T1,FUN_002303,10,5,10
FUN_002304-T1,FUN_002304,2,1,1

Enhanced transcript count matrix saved as: apul-transcript_count_matrix_with_gene_ids.csv

8 Generate checksums

# Load bash variables into memory
source .bashvars

# Change to ouput directory
cd "${output_dir_top}"

# Uses find command to avoid passing
# directory names to the md5sum command.
find . -maxdepth 1 -type f -exec md5sum {} + \
| tee --append checksums.md5

Kim, Daehwan, Joseph M. Paggi, Chanhee Park, Christopher Bennett, and Steven L. Salzberg. 2019. “Graph-Based Genome Alignment and Genotyping with HISAT2 and HISAT-Genotype.” Nature Biotechnology 37 (8): 907–15. https://doi.org/10.1038/s41587-019-0201-4.

Pertea, Mihaela, Daehwan Kim, Geo M Pertea, Jeffrey T Leek, and Steven L Salzberg. 2016. “Transcript-Level Expression Analysis of RNA-Seq Experiments with HISAT, StringTie and Ballgown.” Nature Protocols 11 (9): 1650–67. https://doi.org/10.1038/nprot.2016.095.

Pertea, Mihaela, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell, and Steven L Salzberg. 2015. “StringTie Enables Improved Reconstruction of a Transcriptome from RNA-Seq Reads.” Nature Biotechnology 33 (3): 290–95. https://doi.org/10.1038/nbt.3122.