Since we’ve generated a number of versions of the C.bairdi transcriptome, we’ve decided to compare them using various metrics. Here, I’ve compared the BUSCO scores generated for each transcriptome using BUSCO’s built-in plotting script. The script generates a stacked bar plot of all BUSCO short summary files that it is provided with, as well as the R code used to generate the plot.
This was run on Mox with the following script (GitHub):
RESULTS
Output folder:
Here’s a table to help see which libraries contribute to each of the transcriptomes:
assembly_name | arthropoda_only(y/n) | library_01 | library_02 | library_03 | library_04 |
---|---|---|---|---|---|
cbai_transcriptome_v1.0.fasta | y | 2018 | 2019 | NA | NA |
cbai_transcriptome_v1.5.fasta | y | 2018 | 2019 | 2020-GW | NA |
cbai_transcriptome_v1.6.fasta | y | 2018 | 2019 | 2020-GW | 2020-UW |
cbai_transcriptome_v1.7.fasta | y | 2018 | 2019 | 2020-UW | NA |
cbai_transcriptome_v2.0.fasta | n | 2018 | 2019 | 2020-GW | 2020-UW |
cbai_transcriptome_v3.0.fasta | n | 2018 | 2019 | 2020-UW | NA |
Unsurprisingly, we see a high amount of duplicated BUSCOs in these results. Why is this unsurprising? This is not surprising because we looked at BUSCO results using the full Trinty transcriptome FastAs. These FastAs include all isoforms for any given gene. As such, the presence of the isoforms will lead to a large increase in duplicated (and fragmented) BUSCOs.
Also, we see that transcriptomes v2.0 & v3.0 show the highest amounts of duplicated BUSCOs, compared with the other three. This is likely due to the fact that these two assemblies have not been subjected to taxonomic filtering, so BUSCOs are likely being identified from multiple organisms (e.g. Hematodinium sp.) that would be present.
I’ll extract just the genes from each of the assemblies and re-run BUSCO and subsequent comparisons to see how they look.