Data Wrangling - Renaming, Splitting, and Feature Counts of Updated Pgenerosa_v074 GenSAS Merged GFF

In the final GFF from our GenSAS Pgenerosa_v074.a4 annotation , we noticed that there were no repeat motifs/sequences identified on Scaffold 01. The remaining scaffolds all had repeat motifs present on them, so something seemed amiss (see this GitHub Issue for more info).

I ended up contacting GenSAS and it turned out there was a bug on their end that led to this issue:

Hi Sam,

Thank you so much for your report. There was a bug and it has been fixed. Your gff3 files has been re-generated.

I generated a merged GFF after I “published” my annotation. I included RepeatModeler features in the merged GFF.

My genome has 18 chromosomes. All of them except one chromosome (name: PGA_scaffold1__77_contigs__length_89643857) has the expected repeats annotations present.

I looked at the individual RepeatMasker and RepeatModeler jobs, and both of those GFFs identified repeats on PGA_scaffold1__77_contigs__length_89643857.

Would you happen to have any ideas on why PGA_scaffold1__77_contigs__length_89643857 isn’t showing any repeat features in the merged GFF?>

This is for my project Pgenerosa_v074.

So, now that I have the updated, final GFF, I want to re-run the GFF splitting into separate feature files, as well as counts and sequence length stats for all features (including repeats).

Everything is documented in this Jupyter Notebook (GitHub):


Output folder:

I’ve copied/pasted the summary data for each of the GFFs that were analyzed, for quick reference.

I’ll double-check numbers and update the manuscript as needed. Also, all files will be uploaded to the OSF repository for this paper

Feature Count
CDS 236960
exon 236960
gene 34947
mRNA 38326
repeat_region 1676544
rRNA 8
tRNA 16889
mean       10811.04461
min          166.00000
median      4464.00000
max       283066.00000

mean       12903.649559
min          166.000000
median      5453.000000
max       283066.000000

mean       74.807745
min        53.000000
median     75.000000
max       314.000000

mean      117.125
min       108.000
median    115.000
max       138.000

mean        212.244974
min           6.000000
median      149.000000
max       10981.000000

mean        201.476988
min           3.000000
median      133.000000
max       13221.000000

percent 0.25
sum       2315583.00
mean          711.83
min            11.00
median        323.00
max          6541.00

percent 0.03
sum       258182.00
mean         429.59
min           13.00
median       464.00
max          674.00

percent 0.55
sum       5138701.00
mean          258.71
min             6.00
median        124.00
max          5981.00

percent 1.01
sum       9497156.00
mean          409.48
min            11.00
median        248.00
max          7012.00

percent 0.72
sum       6737909.00
mean          156.23
min            11.00
median        165.00
max           934.00

percent 3.19
sum       30035624.00
mean           395.53
min             11.00
median         226.00
max           6604.00

percent 32.04
sum       3.018520e+08
mean      1.998300e+02
min       1.100000e+01
median    1.450000e+02
max       1.098100e+04

Repeats composition of genome (percent): 37.79