UPDATE 20191203
Due to a bug with GenSAS, the final GFF generated by their system to did not properly merge all of the repeats data properly. The repeats data below should be ignored! It appears that the other features data (e.g. CDS, exons, etc) was unaffected by the bug.
Here’s the updated analysis: https://robertslab.github.io/sams-notebook/posts/2019/2019-12-03-Data-Wrangling—Renaming,-Splitting,-and-Feature-Counts-of-Updated-Pgenerosa_v074-GenSAS-Merged-GFF/
I’ll leave original notebook entry below for posterity.
In preparation for a paper we’re writing, we needed some summary stats for Panopea-generosa-vv0.74.a4. This info will be compiled in to a table for the manuscript. See our Genomic Resources wiki for more info on GFFs:
- Genomic Resources Wiki (GitHub)
Calculations were performed using Python in Jupyter Notebooks.
Genome Features Jupyter Notebook (GitHub):
Repeat Features Jupyter Notbooke (GitHub):
RESULTS
I’ve copied/pasted the summary data for each of the GFFs that were analyzed, for quick reference. Will get this compiled in to a table of some sort for people to use for the manuscript.
Also, the repeats were split into individual GFFs by repeat type. Those GFFs can be found here:
Genome Features
Panopea-generosa-vv0.74.a4.mRNA.gff3
-------------------------
mean 12903.649559
min 166.000000
median 5453.000000
max 283066.000000
Name: seqlength, dtype: float64
Panopea-generosa-vv0.74.a4.gene.gff3
-------------------------
mean 10811.04461
min 166.00000
median 4464.00000
max 283066.00000
Name: seqlength, dtype: float64
Panopea-generosa-vv0.74.a4.CDS.gff3
-------------------------
mean 201.476988
min 3.000000
median 133.000000
max 13221.000000
Name: seqlength, dtype: float64
Panopea-generosa-vv0.74.a4.tRNA.gff3
-------------------------
mean 74.805659
min 53.000000
median 75.000000
max 314.000000
Name: seqlength, dtype: float64
Panopea-generosa-vv0.74.a4.rRNA.gff3
-------------------------
mean 118.428571
min 113.000000
median 115.000000
max 138.000000
Name: seqlength, dtype: float64
Panopea-generosa-vv0.74.a4.exon.gff3
-------------------------
mean 201.476988
min 3.000000
median 133.000000
max 13221.000000
Name: seqlength, dtype: float64
Repeats Features
Panopea-generosa-vv0.74.a4.repeats.LINE.gff3
-------------------------
percent 2.91
sum 27388849.00
mean 394.85
min 11.00
median 226.00
max 6604.00
Name: seqlength, dtype: float64
Panopea-generosa-vv0.74.a4.repeats.Simple_repeat.gff3
-------------------------
percent 0.5
sum 4733271.0
mean 261.2
min 6.0
median 125.0
max 5981.0
Name: seqlength, dtype: float64
Panopea-generosa-vv0.74.a4.repeats.Unknown.gff3
-------------------------
percent 29.09
sum 2.740281e+08
mean 1.991900e+02
min 1.100000e+01
median 1.440000e+02
max 6.574000e+03
Name: seqlength, dtype: float64
Panopea-generosa-vv0.74.a4.repeats.LTR.gff3
-------------------------
percent 0.22
sum 2060084.00
mean 712.83
min 11.00
median 316.00
max 6541.00
Name: seqlength, dtype: float64
Panopea-generosa-vv0.74.a4.repeats.RC.gff3
-------------------------
percent 0.02
sum 232303.00
mean 425.46
min 13.00
median 464.00
max 674.00
Name: seqlength, dtype: float64
Panopea-generosa-vv0.74.a4.repeats.SINE.gff3
-------------------------
percent 0.65
sum 6133778.00
mean 155.69
min 11.00
median 164.00
max 934.00
Name: seqlength, dtype: float64
Panopea-generosa-vv0.74.a4.repeats.DNA.gff3
-------------------------
percent 0.91
sum 8602532.00
mean 407.82
min 11.00
median 247.00
max 7012.00
Name: seqlength, dtype: float64
-------------------------
Repeats composition of genome (percent): 34.3