Show HN: Run 500B+ Parameter LLMs Locally on a Mac Mini

Hi HN, I built OpenGraviton, an open-source AI inference engine that pushes the limits of running extremely large LLMs on consumer hardware. By combining 1.58-bit ternary quantization, dynamic sparsity with Top-K pruning and MoE routing, and mmap-based layer streaming, OpenGraviton can run models far larger than your system RAM—even on a Mac Mini. Early benchmarks: TinyLlama-1.1B drops from ~2GB (FP16) to ~0.24GB with ternary quantization. At 140B scale, models that normally require ~280GB fit within ~35GB packed. Optimized for Apple Silicon with Metal + C++ tensor unpacking, plus speculative decoding for faster generation. Check benchmarks, architecture, and details here: https://opengraviton.github.io GitHub: https://github.com/opengraviton This project isn’t just about squeezing massive models onto tiny hardware—it’s about democratizing access to giant LLMs without cloud costs. Feedback, forks, and ideas are very welcome!

github.com

6 points

fatihturker

14 hours ago


3 comments

deflator an hour ago

Fascinating. I don't understand the technical terms, but running a big coding agent locally is a dream of mine, so I thank you for your efforts!

ryanholtdev 5 hours ago

Running a Mac Mini M4 as a home server for a bunch of automation scripts right now. The mmap-based layer streaming is the part I'm most curious about -- how does latency look when you're streaming layers from disk mid-inference? I'd expect throughput to degrade sharply once you exceed unified memory, but maybe the Top-K sparsity masks enough of the weight accesses that it's not as bad as sequential streaming would be. What's the actual tokens/sec at 140B scale on the base Mac Mini config?