Data Summary - P.generosa Transcriptome Assemblies Stats

In our continuing quest to wrangle the geoduck transcriptome assemblies we have, I was tasked with compiling assembly stats for our various assemblies. The table below provides an overview of some stats for each of our assemblies. Links within the table go to the the notebook entries for the various methods from which the data was gathered. In general:

  • Genes/Isoforms stats come directly from the Trinity assembly stats output file.

  • transdecoder_pep is a count of headers in the Transdecoder FastA output file, transdecoder_pep.

  • CD-Hit is a count of headers in the CD-Hit-est FastA output file.

Assembly Genes Isoforms transdecoder_pep CD-Hit
ctenidia 216248 349773 72274 325783
gonad 151263 198748 31706 189378
Juvenile (EPI 115) 199765 320691 78149 297848
Juvenile (EPI 116) 268476 434877 99089 408498
Juvenile (EPI 123) 196131 303568 67398 284852
Juvenile (EPI 124) 255277 421670 93285 395527
Larvae (EPI 99) 249799 425165 77694 379210
MEANS 219566 350642 74228 325871


From this brief summarization of various assembly stats, it seems like the Transdecoder numbers are probably the most “realistic”, at least when it comes to the number of actual coding mRNAs present in the geoduck genome.

It’s also good to keep in mind that the Pgenerosa_v070 MAKER annotation identified 53,035 transcripts/proteins, while the Pgenerosa_v074 annotation. These numbers from the MAKER annotations do not take into account transcript isoforms…