In our continuing quest to wrangle the geoduck transcriptome assemblies we have, I was tasked with compiling assembly stats for our various assemblies. The table below provides an overview of some stats for each of our assemblies. Links within the table go to the the notebook entries for the various methods from which the data was gathered. In general:
Genes/Isoforms stats come directly from the Trinity assembly stats output file.
transdecoder_pep is a count of headers in the Transdecoder FastA output file,
transdecoder_pep
.CD-Hit is a count of headers in the CD-Hit-est FastA output file.
Assembly | Genes | Isoforms | transdecoder_pep | CD-Hit |
---|---|---|---|---|
ctenidia | 216248 | 72274 | 325783 | |
gonad | 151263 | 31706 | 189378 | |
Juvenile (EPI 115) | 199765 | 78149 | 297848 | |
Juvenile (EPI 116) | 268476 | 99089 | 408498 | |
Juvenile (EPI 123) | 196131 | 67398 | 284852 | |
Juvenile (EPI 124) | 255277 | 93285 | 395527 | |
Larvae (EPI 99) | 249799 | 77694 | 379210 | |
MEANS | 219566 | 350642 | 74228 | 325871 |
RESULTS
From this brief summarization of various assembly stats, it seems like the Transdecoder numbers are probably the most “realistic”, at least when it comes to the number of actual coding mRNAs present in the geoduck genome.
It’s also good to keep in mind that the Pgenerosa_v070 MAKER annotation identified 53,035 transcripts/proteins, while the Pgenerosa_v074 annotation. These numbers from the MAKER annotations do not take into account transcript isoforms…