I performed the assembly on Hyak (Klone), UW’s high-performance computing cluster. I used our Apptainer (Singularity) container to run the job (srlab-R4.4-bioinformatics-container-e5bcfea.sif), which includes hifiasm version 0.25.0.
Below is the rendered markdown from 13.1-hifiasm-genome-assembly-lean.Rmd. The run was very fast; it only took a day and half, vs the ~7 days for the Flye assembly.
PRIMARY OUTPUTS
hifiasm produces three assemblies: a primary assembly, and two haplotype-resolved assemblies. Below are the links to the FastA files. The outputs are only GFA graph files. I had to convert them to FastA using gfatools.
Primary assembly:
FastA (3.5GB): https://gannet.fish.washington.edu/gitrepos/project-lake-trout/output/13.1-hifiasm-genome-assembly-lean/pb-hifiasm-lean-assembly.fa
FastA index: https://gannet.fish.washington.edu/gitrepos/project-lake-trout/output/13.1-hifiasm-genome-assembly-lean/pb-hifiasm-lean-assembly.fa.fai
Haplotype 1:
FastA (3.2GB): https://gannet.fish.washington.edu/gitrepos/project-lake-trout/output/13.1-hifiasm-genome-assembly-lean/pb-hifiasm-lean-hap1-assembly.fa
FastA index: https://gannet.fish.washington.edu/gitrepos/project-lake-trout/output/13.1-hifiasm-genome-assembly-lean/pb-hifiasm-lean-hap1-assembly.fa.fai
Haplotype 2:
FastA (2.7GB): https://gannet.fish.washington.edu/gitrepos/project-lake-trout/output/13.1-hifiasm-genome-assembly-lean/pb-hifiasm-lean-hap2-assembly.fa
FastA index: https://gannet.fish.washington.edu/gitrepos/project-lake-trout/output/13.1-hifiasm-genome-assembly-lean/pb-hifiasm-lean-hap2-assembly.fa.fai
After this, I’ll compare the two assemblies (Flye vs. hifiasm) using QUAST, which will provide a more comprehensive set of assembly statistics and metrics.
1 BACKGROUND
Use hifiasm (GitHub) (Cheng et al. 2021, 2022, 2024) to assemble PacBio reads for S.namaycushlean ecotype.
Due to large file sizes, the outputs will not be on GitHub. They may be found here instead:
# OUTPUT DIRECTORYdata_dir <-"/mmfs1/gscratch/scrubbed/samwhite/gitrepos/RobertsLab/project-lake-trout/data/pacbio-lean"output_dir <-"../output/13.1-hifiasm-genome-assembly-lean"# PROGRAMSgfatools <-c("/srlab/programs/gfatools/gfatools")hifiasm <-c("/srlab/programs/miniforge3-24.7.1-0/envs/hifiasm-0.25.0_env/bin/hifiasm")# SETTINGSthreads <-"40"# Export these as environment variables for bash chunks.Sys.setenv(data_dir = data_dir,gfatools = gfatools,hifiasm=hifiasm,output_dir = output_dir,threads = threads)
# Make output directory, if it doesn't existmkdir--parents"${output_dir}"cd"${output_dir}"# Create an array of .fastq.gz filesfastqs=(${data_dir}/*.fastq.gz)# Print the result (optional, for verification)## newline-delimited formatecho"List of FastQs used for assembly:"printf"%s\n""${fastqs[@]}"\|tee input_fastqs.txt# Run hifiasm assembly"${hifiasm}"\-o pb-hifiasm-lean-assembly.asm \-t "${threads}"\"${fastqs[@]}"\> pb-hifiasm-lean-assembly.log 2>&1
3 GFATOOLS
3.1 Primary contigs to FastA
This will extract a FastA file from the graph files output by hifiasm.
Cheng, Haoyu, Mobin Asri, Julian Lucas, Sergey Koren, and Heng Li. 2024. “Scalable Telomere-to-Telomere Assembly for Diploid and Polyploid Genomes with Double Graph.” Nature Methods 21 (6): 967–70. https://doi.org/10.1038/s41592-024-02269-8.
Cheng, Haoyu, Gregory T. Concepcion, Xiaowen Feng, Haowen Zhang, and Heng Li. 2021. “Haplotype-Resolved de Novo Assembly Using Phased Assembly Graphs with Hifiasm.” Nature Methods 18 (2): 170–75. https://doi.org/10.1038/s41592-020-01056-5.
Cheng, Haoyu, Erich D. Jarvis, Olivier Fedrigo, et al. 2022. “Haplotype-Resolved Assembly of Diploid Genomes Without Parental Data.” Nature Biotechnology 40 (9): 1332–35. https://doi.org/10.1038/s41587-022-01261-x.