Posts - Page 2 of 145
Transcript Alignments - P.generosa RNA-seq Alignments for lncRNA Identification Using Hisat2 StingTie and gffcompare on Mox
This is a continuation of the process for identification of lncRNAs,. I aligned FastQs which were previously trimmed earlier today to our Panopea-generosa-v1.0 genome FastA using HISAT2
. I used the HISAT2
genome index created on 20190723, which was created with options to identify exons and splice sites. The GFF used was from 20220323. StringTie
was used to identify alternative transcripts, assign expression values, and create expression tables for use with ballgown
. The job was run on Mox.
FastQ Trimming and QC - P.generosa RNA-seq Data from 20220323 on Mox
Addressing the update to this GitHub Issue regarding identifying Panopea generosa (Pacific geoduck) long non-coding RNAs (lncRNAs), I used the RNA-seq data from the Nextflow NF-Core RNAseq pipeline run on 20220323. Although that data was supposed to have been trimmed in the Nextflow NF-Core RNA-seq pipeline, the FastQC reports still show adapter contamination and some funky stuff happening at the 5’ end of the reads. So, I’ve opted to trim the “trimmed” files with fastp
, using a hard 20bp trim at the 5’ end of all reads. FastQC
and MultiQC
were run before/after trimming. Job was run on Mox.
Data Wrangling - CEABIGR C.virginica Exon Expression Table
As part of the CEABIGR project (GitHub repo), Steven asked that I generate an exon expression table (GitHub Issue) where each row is a gene and the columns are the corresponding exons, filled with their expression value. For this, I planned on using the read count from the ballgown
e_data.ctab
files as an expression value.
Daily Bits - April 2023
20230403
Data Wrangling - Append Gene Ontology Aspect to P.generosa Primary Annotation File
Steven tasked me with updating our P.generosa genome annotation file (GitHub Issue) a while back and I finally managed to get it all figured out. Although I wanted to perform most of this using the GSEAbase package (PDF), as this package is geared towards storage/retrieval of gene set data, I eventually decided to abondon this approach due to the time it was taking and my lack of familiarity/understanding of how to manipulate objects in R. Despite that, GSEAbase
was still utilized for its very simple use for identifying GOlims (IDs and Terms).
Data Received - Trimmed M.magister RNA-seq from NOAA
Sequencing Read Taxonomic Classification - P.verrucosa E5 RNA-seq Using DIAMOND BLASTx and MEGAN daa-meganizer on Mox
After some discussion with Steven at Science Hour last week regarding the handling of endosymbiont sequences in the E5 P.verrucosa RNA-seq data, Steven thought it would be interesting to run the RNA-seq reads through MEGAN6 just to see what the taxonomic breakdown looks like. We may or may not (probably not) separating reads based on taxonomy. In the meantime, we’ll still proceed with HISAT2
alignments to the respective genomes as a means to separate the endosymbiont reads from the P.verrucosa reads.
Data Wrangling - C.goreaui Genome GFF to GTF Using gffread
As part of getting these three coral species genome files (GitHub Issue) added to our Lab Handbook Genomic Resources page, I also need to get the coral endosymbiont sequence. After talking with Danielle Becker in Hollie Putnam’s Lab at Univ. of Rhode Island, she pointed me to the Cladocopium goreaui genome from Chen et. al, 2022 available here. Access to the genome requires agreeing to some licensing provisions (primarily the requirment to cite the publication whenever the genome is used), so I will not be providing any public links to the file. In order to index the Cladocopium goreaui genome file (Cladocopium_goreaui_genome_fa
) using HISAT2
for downstream isoform analysis using StringTie
and ballgown
, I need a corresponding GTF to also identify exon/intro splice sites. Since a GTF file is not available, but a GFF file is, I needed to convert the GFF to GTF. Used gffread
to do this on my computer. Process is documented in Jupyter Notebook linked below.
Transcript Identification and Alignments - P.verrucosa RNA-seq with Pver_genome_assembly_v1.0 Using HiSat2 and Stringtie on Mox
After getting the RNA-seq data trimmed, it was time to perform alignments and determine expression levels of transcripts/isoforms using with HISAT2
and StringTie
, respectively. StringTie
was set to output tables formatted for import into ballgown
. After those two analyses were complete, I ran gffcompare
, using the merged StringTie
GTF and the input GFF3. I caught this in one of Danielle Becker’s scripts and thought it might be interesting. The analsyes were run on Mox.
FastQ Trimming and QC - P.verrucosa RNA-seq Data from Danielle Becker in Hollie Putnam Lab Using fastp FastQC and MultiQC on Mox
After receiving the P.verrucosa RNA-seq data from Danielle Becker (Hollie Putnam’s Lab, Univ. of Rhode Island), I noticed that the trimmed reads didn’t appear to actually be trimmed. There was still adapter contamination (solely in R2 reads - suggesting the detect_adapter_for_pe
option had been omitted from the fastp
command?), but the reads had an average read length of 150bp - except when looking at the adapter content report!!??.