TurboQuant is a restricted version of EDEN quantization (NeurIPS 21, ICML 22). It lacks the optimal scale derivations, which makes the TurboQuant variant considerably less accurate than those works. We show this thoroughly in a new note at https://arxiv.org/abs/2604.18555.
We were the first to introduce post-rotation distribution-aware quantization in 2021. This was later implemented in many fields, including federated learning, vector retrieval, databases, inference engines, and KV-cache.
It would be appropriate to receive credit for this. Furthermore, it is baffling to see the name "TurboQuant" repeated in this context, considering the many works published from 2021 onwards.
The blog post mentioned above essentially guides you through EDEN quantization but ultimately settles on a sub-optimal MSE-minimizing version and an unbiasing trick. This trick often costs a full bit more than DRIVE/EDEN requires to achieve the same results using the unbiasing scale shown in the original 2021 paper.
https://docs.vllm.ai/en/v0.20.0/api/vllm/model_executor/laye...
`vllm.model_executor.layers.quantization.turboquant`
> The technique implemented here consists of the scalar case of the HIGGS quantization method (Malinovskii et al., "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", NAACL 2025; preprint arXiv:2411.17525): rotation + optimized grid + optional re-normalization, applied to KV cache compression. A first application of this approach to KV-cache compression is in "Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models" (Shutova et al., ICML 2025; preprint arXiv:2501.19392). Both these references pre-date the TurboQuant paper (Zandieh et al., ICLR 2026).
For those who want to :popcorn-meme: the drama, there's some great comments on the peer review of the TurboQuant paper: https://openreview.net/forum?id=tO3ASKZlok
Thanks a lot for pointing this out. I will update this explainer to properly add the prior literature so that there is a proper attribution.
Thanks for the quick response and for being willing to update the explainer. I really appreciate the clarification.
I wonder how often this happens in practice - by "this", I mean industry/LLM world not noticing* some research until a bigger player repeats it with louder PR.
(*hopefully I didn't misunderstand the situation)
Ask Jürgen Schmidhuber
If we go only by the cases that have been publicly known it already happens all the time. Lots of patents are a race to register by multiple parties too and it's rarely done fairly.
https://arxiv.org/abs/2604.18555
"This note clarifies the relationship between the recent TurboQuant work and the earlier DRIVE (NeurIPS 2021) and EDEN (ICML 2022) schemes. DRIVE is a 1-bit quantizer that EDEN extended to any bits per coordinate; we refer to them collectively as EDEN. First, TurboQuant is a special case of EDEN obtained by fixing EDEN's scalar scale parameter to . EDEN supports both biased and unbiased quantization, each optimized by a different (chosen via methods described in the EDEN works). The fixed choice used by TurboQuant is generally suboptimal, although the optimal for biased EDEN converges to as the dimension grows; accordingly TurboQuant approaches EDEN's behavior for large . Second, TurboQuant combines a biased -bit EDEN step with an unbiased 1-bit QJL quantization of the residual. It is suboptimal in three ways: (1) its -bit step uses the suboptimal ; (2) its 1-bit unbiased residual quantization has worse MSE than (unbiased) 1-bit EDEN; (3) chaining a biased -bit step with a 1-bit unbiased residual step is inferior to unbiasedly quantizing the input directly with -bit EDEN. Third, some of the analysis in the TurboQuant work mirrors that of the EDEN works: both exploit the connection between random rotations and the shifted Beta distribution, use the Lloyd-Max algorithm, and note that Randomized Hadamard Transforms can replace uniform random rotations. Experiments support these claims: biased EDEN (with optimized ) is more accurate than TurboQuant, and unbiased EDEN is markedly more accurate than TurboQuant, often by more than a bit (e.g., 2-bit EDEN beats 3-bit TurboQuant). We also repeat all accuracy experiments from the TurboQuant paper, showing that EDEN outperforms it in every setup we have tried."
Are you guys going to follow up with a paper showing EDEN results match or beat turboquant for needle in a haystack benchmarks?
The note includes extensive experiments and reproduces many of the figures from the TurboQuant paper in our Section 5. Honestly, I think our case is pretty clear-cut as is. I am not sure what the overhead for those specific benchmarks would be, but we will look into it.
(In any case, I want to emphasize that TurboQuant quantizer is a private case of EDEN)
with the amount of traction this has gotten... coming with a clear set of experiments even on arxiv paper would be of great help to showcase your improvements. And if they're easily reproducible, they could get integrated in the mainstream inference engines as well, as the main point here is compression with little degradation.
When you use TurboQuant, you are essentially using the EDEN quantizer under a different name applied to KV-cache.
Both EDEN and its 1-bit variant have been implemented in PyTorch, JAX, and TensorFlow across numerous open-source libraries and are used in various applications. I am currently writing a blog post that will document these in detail.
EDEN defines a scale parameter, S, for which we suggest specific optimal values for both biased and unbiased versions. As shown in the note I shared, these values lead to clear empirical improvements. Consequently, users who rely on the less optimal S value and the unbiasing method popularized by TurboQuant will generally see inferior results compared to those using EDEN with the optimal scale values suggested in our original papers.