Steven wanted transposable elements (TEs) in the Olympia oyster genome identified.
After some minor struggles, I was able to get RepeatMasker installed on on both of our Apple Xserves (emu & roadrunner; running Ubuntu 16.04LTS).
Genome used: pbjelly_sjw_01
I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 four times:
Default settings (i.e. no species select - will use human genome).
Species = Crassostrea gigas (Pacific oyster)
Species = Crassostrea virginica (Eastern oyster)
Species = Ostrea lurida (Olympia oyster)
The idea was to get a sense of how the analyses would differ with species specifications. However, it’s likely that the only species setting that will make any difference will be Run #2 (Crassostrea gigas).
The reason I say this is that RepeatMasker has a built in tool to query which species are available in the RepBase database (e.g.):
<code>RepeatMasker-4.0.7/util/queryRepeatDatabase.pl -species "crassostrea virginica" -stat</code>
Here’s a very brief overview of what that yields:
Crassotrea gigas: 792 specific repeats
Crassostrea virginica: 4 Crassostrea virginica specific repeats
Ostrea lurida: 0 Ostrea lurida specific repeats
All runs were performed on roadrunner.
All commands were documented in a Jupyter Notebook (GitHub):
_NOTE: RepeatMasker writes the desired output files (.out, .cat.gz, and *.gff) to the same directory that the genome is located in! If you conduct multiple runs with the same genome in the same directory, it will overwrite those files, as they are named using the genome assembly filename._
RESULTS:
RUN 1 (default settings - human genome)
Output folder:
Summary table (text):
Output table (GFF):
SUMMARY TABLE
<code>
==================================================
file name: jelly.out.fasta
sequences: 696946
total length: 1253001795 bp (1172226648 bp excl N/X-runs)
GC level: 36.51 %
bases masked: 20002806 bp ( 1.71 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
SINEs: 17794 1061170 bp 0.09 %
ALUs 363 31340 bp 0.00 %
MIRs 1166 92129 bp 0.01 %
LINEs: 4456 888114 bp 0.08 %
LINE1 976 103929 bp 0.01 %
LINE2 813 82891 bp 0.01 %
L3/CR1 699 63627 bp 0.01 %
LTR elements: 1187 199118 bp 0.02 %
ERVL 155 15828 bp 0.00 %
ERVL-MaLRs 200 20737 bp 0.00 %
ERV_classI 379 42833 bp 0.00 %
ERV_classII 66 6896 bp 0.00 %
DNA elements: 2290 196866 bp 0.02 %
hAT-Charlie 190 15468 bp 0.00 %
TcMar-Tigger 732 37473 bp 0.00 %
Unclassified: 101 12946 bp 0.00 %
Total interspersed repeats: 2358214 bp 0.20 %
Small RNA: 5954 433422 bp 0.04 %
Satellites: 366 55705 bp 0.00 %
Simple repeats: 310641 14322152 bp 1.22 %
Low complexity: 47381 2844279 bp 0.24 %
==================================================
* most repeats fragmented by insertions or deletions
have been counted as one element
Runs of >=20 X/Ns in query were excluded in % calcs
The query species was assumed to be homo sapiens
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
run with rmblastn version 2.6.0+
</code>
RUN 2 (species - Crassostrea gigas)
Output folder:
Summary table (text):
Output table (GFF):
SUMMARY TABLE
<code>
==================================================
file name: jelly.out.fasta
sequences: 696946
total length: 1253001795 bp (1172226648 bp excl N/X-runs)
GC level: 36.51 %
bases masked: 160759267 bp ( 13.71 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
Retroelements 213132 69887654 bp 5.96 %
SINEs: 2374 311974 bp 0.03 %
Penelope 171792 57862186 bp 4.94 %
LINEs: 195605 63430615 bp 5.41 %
CRE/SLACS 0 0 bp 0.00 %
L2/CR1/Rex 731 357995 bp 0.03 %
R1/LOA/Jockey 0 0 bp 0.00 %
R2/R4/NeSL 13 11377 bp 0.00 %
RTE/Bov-B 8085 1948581 bp 0.17 %
L1/CIN4 0 0 bp 0.00 %
LTR elements: 15153 6145065 bp 0.52 %
BEL/Pao 2119 955773 bp 0.08 %
Ty1/Copia 101 75372 bp 0.01 %
Gypsy/DIRS1 11776 4815361 bp 0.41 %
Retroviral 0 0 bp 0.00 %
DNA transposons 256292 35689117 bp 3.04 %
hobo-Activator 19847 2059651 bp 0.18 %
Tc1-IS630-Pogo 43269 6806311 bp 0.58 %
En-Spm 0 0 bp 0.00 %
MuDR-IS905 0 0 bp 0.00 %
PiggyBac 7935 1060296 bp 0.09 %
Tourist/Harbinger 9503 887332 bp 0.08 %
Other (Mirage, 0 0 bp 0.00 %
P-element, Transib)
Rolling-circles 0 0 bp 0.00 %
Unclassified: 174943 38299211 bp 3.27 %
Total interspersed repeats: 143875982 bp 12.27 %
Small RNA: 280 78768 bp 0.01 %
Satellites: 7383 1362194 bp 0.12 %
Simple repeats: 278809 12982714 bp 1.11 %
Low complexity: 44078 2622506 bp 0.22 %
==================================================
* most repeats fragmented by insertions or deletions
have been counted as one element
Runs of >=20 X/Ns in query were excluded in % calcs
The query species was assumed to be crassostrea gigas
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
run with rmblastn version 2.6.0+
</code>
RUN 3 (species - Crassostrea virginica)
Output folder:
Summary table (text):
Output table (GFF):
SUMMARY TABLE
<code>
==================================================
file name: jelly.out.fasta
sequences: 696946
total length: 1253001795 bp (1172226648 bp excl N/X-runs)
GC level: 36.51 %
bases masked: 39598953 bp ( 3.38 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
Retroelements 63882 10327611 bp 0.88 %
SINEs: 63882 10327611 bp 0.88 %
Penelope 0 0 bp 0.00 %
LINEs: 0 0 bp 0.00 %
CRE/SLACS 0 0 bp 0.00 %
L2/CR1/Rex 0 0 bp 0.00 %
R1/LOA/Jockey 0 0 bp 0.00 %
R2/R4/NeSL 0 0 bp 0.00 %
RTE/Bov-B 0 0 bp 0.00 %
L1/CIN4 0 0 bp 0.00 %
LTR elements: 0 0 bp 0.00 %
BEL/Pao 0 0 bp 0.00 %
Ty1/Copia 0 0 bp 0.00 %
Gypsy/DIRS1 0 0 bp 0.00 %
Retroviral 0 0 bp 0.00 %
DNA transposons 9433 2307292 bp 0.20 %
hobo-Activator 0 0 bp 0.00 %
Tc1-IS630-Pogo 0 0 bp 0.00 %
En-Spm 0 0 bp 0.00 %
MuDR-IS905 0 0 bp 0.00 %
PiggyBac 0 0 bp 0.00 %
Tourist/Harbinger 0 0 bp 0.00 %
Other (Mirage, 0 0 bp 0.00 %
P-element, Transib)
Rolling-circles 0 0 bp 0.00 %
Unclassified: 51558 9836468 bp 0.84 %
Total interspersed repeats: 22471371 bp 1.92 %
Small RNA: 64164 10406776 bp 0.89 %
Satellites: 10 5985 bp 0.00 %
Simple repeats: 298612 14185090 bp 1.21 %
Low complexity: 47510 2866522 bp 0.24 %
==================================================
* most repeats fragmented by insertions or deletions
have been counted as one element
Runs of >=20 X/Ns in query were excluded in % calcs
The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
run with rmblastn version 2.6.0+
</code>
RUN 4 (species - Ostrea lurida)
Output folder:
Summary table (text):
Output table (GFF):
SUMMARY TABLE
<code>
==================================================
file name: jelly.out.fasta
sequences: 696946
total length: 1253001795 bp (1172226648 bp excl N/X-runs)
GC level: 36.51 %
bases masked: 17617763 bp ( 1.50 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
Retroelements 0 0 bp 0.00 %
SINEs: 0 0 bp 0.00 %
Penelope 0 0 bp 0.00 %
LINEs: 0 0 bp 0.00 %
CRE/SLACS 0 0 bp 0.00 %
L2/CR1/Rex 0 0 bp 0.00 %
R1/LOA/Jockey 0 0 bp 0.00 %
R2/R4/NeSL 0 0 bp 0.00 %
RTE/Bov-B 0 0 bp 0.00 %
L1/CIN4 0 0 bp 0.00 %
LTR elements: 0 0 bp 0.00 %
BEL/Pao 0 0 bp 0.00 %
Ty1/Copia 0 0 bp 0.00 %
Gypsy/DIRS1 0 0 bp 0.00 %
Retroviral 0 0 bp 0.00 %
DNA transposons 0 0 bp 0.00 %
hobo-Activator 0 0 bp 0.00 %
Tc1-IS630-Pogo 0 0 bp 0.00 %
En-Spm 0 0 bp 0.00 %
MuDR-IS905 0 0 bp 0.00 %
PiggyBac 0 0 bp 0.00 %
Tourist/Harbinger 0 0 bp 0.00 %
Other (Mirage, 0 0 bp 0.00 %
P-element, Transib)
Rolling-circles 0 0 bp 0.00 %
Unclassified: 3 189 bp 0.00 %
Total interspersed repeats: 189 bp 0.00 %
Small RNA: 282 79165 bp 0.01 %
Satellites: 10 5985 bp 0.00 %
Simple repeats: 313082 14662647 bp 1.25 %
Low complexity: 47785 2878201 bp 0.25 %
==================================================
* most repeats fragmented by insertions or deletions
have been counted as one element
Runs of >=20 X/Ns in query were excluded in % calcs
The query species was assumed to be ostrea lurida
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
run with rmblastn version 2.6.0+
</code>