INTRO
Steven asked me (GitHub Issue) to generate a DIAMOND
BLAST database for use with his class, as well as our lab, since we didn’t currently have one set up.
MATERIALS & METHODS
Downloaded NCBI NR FastA from here:
https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
Confirmed MD5 sum was okay (not shown)
Renamed the unzipped FastA to match download date:
ncbi-nr-20250429.fasta
This is a huge FastA file. Unzipped, it will be ~350GB in size!
Downloaded NCBI Taxonomy info from here:
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
Confirmed MD5 sum was okay (not shown)
Before creating the DIAMOND BLAST database, I had to manipulate some of the NCBI files first.
Had to replace superkingdom
designation in NCBI NR node.dmp
, as this is a recent change at NCBI and DIAMOND
does not yet have a release to handle this change in nomenclature.
sed -i 's/domain/superkingdom/g' nodes.dmp
And, I actually had to replace all the new ranks (“domain”, “realm”, “acellular root”, and “cellular root”) in nodes.dmp
and names.dmp
with “superkingdom” (figured out due to this comment in this GitHub issue).
Make the Database
Ran this command to make the database:
/home/shared/diamond-2.1.8 makedb \
\
--in ncbi-nr-20250429.fasta \
--db ncbi-nr-20250429 \
--taxonmap prot.accession2taxid \
--taxonnodes nodes.dmp \
--taxonnames names.dmp --threads 40
The --db ncbi-nr-20250429
simply specifies the output file name to be used for the resulting database.
RESULTS
The final database is located here:
/home/shared/16TB_HDD_01/sam/databases/blastdbs/ncbi-nr-20250429.dmnd
The database is 349GB!
SUMMARY
Overall, the entire process of downloading, unzipping, and creating the database took nearly 6hrs.