Since we’ve generated a number of versions of the C.bairdi transcriptome, we’ve decided to compare them using various metrics. Here, I’ve compared the BUSCO scores generated for each transcriptome using BUSCO’s built-in plotting script. The script generates a stacked bar plot of all BUSCO short summary files that it is provided with, as well as the R code used to generate the plot.

This was run on Mox with the following script (GitHub):

busco_comparison_plotting.sh

RESULTS

Output folder:

20200528_cbai_transcriptome_busco_comparisons

Here’s a table to help see which libraries contribute to each of the transcriptomes:

assembly_name	arthropoda_only(y/n)	library_01	library_02	library_03	library_04
cbai_transcriptome_v1.0.fasta	y	2018	2019	NA	NA
cbai_transcriptome_v1.5.fasta	y	2018	2019	2020-GW	NA
cbai_transcriptome_v1.6.fasta	y	2018	2019	2020-GW	2020-UW
cbai_transcriptome_v1.7.fasta	y	2018	2019	2020-UW	NA
cbai_transcriptome_v2.0.fasta	n	2018	2019	2020-GW	2020-UW
cbai_transcriptome_v3.0.fasta	n	2018	2019	2020-UW	NA

Unsurprisingly, we see a high amount of duplicated BUSCOs in these results. Why is this unsurprising? This is not surprising because we looked at BUSCO results using the full Trinty transcriptome FastAs. These FastAs include all isoforms for any given gene. As such, the presence of the isoforms will lead to a large increase in duplicated (and fragmented) BUSCOs.

Also, we see that transcriptomes v2.0 & v3.0 show the highest amounts of duplicated BUSCOs, compared with the other three. This is likely due to the fact that these two assemblies have not been subjected to taxonomic filtering, so BUSCOs are likely being identified from multiple organisms (e.g. Hematodinium sp.) that would be present.

I’ll extract just the genes from each of the assemblies and re-run BUSCO and subsequent comparisons to see how they look.