Data Wrangling - Subsetting cbai_genome_v1.0 Assembly with faidx

Previously assembled cbai_genome_v1.0.fasta with our NanoPore Q7 reads on 20200917 and noticed that there were numerous sequences that were well shorter than the expected 500bp threshold that the assembler (Flye) was supposed to spit out. I created an Issue on the Flye GitHub page to find out why. The developer responded and determined it was an issue with the assembly polisher and that sequences <500bp could be safely ignored.

So, I’ve decided to subset the cbai_genome_v1.0.fasta to exclude all sequences <1000bp, as that seems like a more reasonable minimum length for potential genes. I did not run this in a Jupyter Notebook, due to the brevity of the commands. Here are the commands, using faidx:

>1kbp subsetting

faidx --size-range 1000,1000000000 cbai_genome_v1.0.fasta > cbai_genome_v1.01.fasta

Index new FastA

faidx Pgenerosa_v071.fasta

samb@mephisto:~/data/C_bairdi/genomes$ sort -nk2,2 cbai_genome_v1.01.fasta.fai | head

contig_4272 1000    15642836    60  61
contig_4503 1000    16422183    60  61
contig_4429 1001    16145927    60  61
contig_1038 1002    230201  60  61
contig_1691 1005    1716551 60  61
contig_2992 1005    7322005 60  61
contig_3284 1006    9674445 60  61
contig_1810 1008    2050977 60  61
contig_408  1008    15069716    60  61
contig_1616 1009    1549839 60  61

Subsetting looks like it worked.

Looking at sequence counts in FastAs:

samb@mephisto:~/data/C_bairdi/genomes$ for file in *.fasta; do grep --with-filename -c ">" $file; done

cbai_genome_v1.01.fasta:2431
cbai_genome_v1.0.fasta:3294

MD5 checksums

5a08d8b0651484e3ff75fcf032804596 cbai_genome_v1.01.fasta

Any future work with C.bairdi genome assemblies will be with cbai_genome_v1.01.fasta (until a better assembly comes along).

All files were copied to our genomic databank on Owl.

See our Genomic Resources wiki (GitHub) for a more concise overview.