Bypassing CPU for NVMe-to-GPU transfer is clever. The bottleneck for running large models locally has always been the memory hierarchy — this essentially treats NVMe as extended VRAM with direct DMA.
I wonder how this compares to Apple's unified memory approach on M-series chips for similar workloads. The M4 Max can fit 70B models entirely in memory without any offloading tricks, though at lower throughput than a 3090.
Would be interesting to see comparative benchmarks: this NVMe approach on a 3090 vs M4 Max native, especially for batch inference where the NVMe latency might be amortized.
NVMEs are much, much slower than RAM. Especially unified/soldered RAM.
To be fair, llama.cpp had this feature for over a year now. It just applies to GGUF.
I got an m3, I will test it on metal and check how it goes