cbai_genome_v1.0.fasta with our NanoPore Q7 reads on 20200917 and noticed that there were numerous sequences that were well shorter than the expected 500bp threshold that the assembler (Flye) was supposed to spit out. I created an Issue on the Flye GitHub page to find out why. The developer responded and determined it was an issue with the assembly polisher and that sequences <500bp could be safely ignored.
So, I’ve decided to subset the
cbai_genome_v1.0.fasta to exclude all sequences <1000bp, as that seems like a more reasonable minimum length for potential genes. I did not run this in a Jupyter Notebook, due to the brevity of the commands. Here are the commands, using
faidx --size-range 1000,1000000000 cbai_genome_v1.0.fasta > cbai_genome_v1.01.fasta
Index new FastA
samb@mephisto:~/data/C_bairdi/genomes$ sort -nk2,2 cbai_genome_v1.01.fasta.fai | head contig_4272 1000 15642836 60 61 contig_4503 1000 16422183 60 61 contig_4429 1001 16145927 60 61 contig_1038 1002 230201 60 61 contig_1691 1005 1716551 60 61 contig_2992 1005 7322005 60 61 contig_3284 1006 9674445 60 61 contig_1810 1008 2050977 60 61 contig_408 1008 15069716 60 61 contig_1616 1009 1549839 60 61
Subsetting looks like it worked.
Looking at sequence counts in FastAs:
samb@mephisto:~/data/C_bairdi/genomes$ for file in *.fasta; do grep --with-filename -c ">" $file; done cbai_genome_v1.01.fasta:2431 cbai_genome_v1.0.fasta:3294
Any future work with C.bairdi genome assemblies will be with
cbai_genome_v1.01.fasta (until a better assembly comes along).
All files were copied to our genomic databank on Owl.
See our Genomic Resources wiki (GitHub) for a more concise overview.