This is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions.
Pardon my simplistic question, but when you mean rotation you’re essentially talking about diagonalization aren’t you?
So storing the diagonal as a matrix and the new bases is more compact?
In this context, the rotation is for spreading energy and ensuring predictable coordinate distributions rather than diagonalization; it makes coordinate-wise quantization much more computationally efficient, though it throws away learnable structure.
ah ok, so intuitively it's like minimizing the error when replacing the values with a well-known distribution. So all you need to carry along is the rotation and the assumption that there is some amount of loss.
There are papers that try to quantize angles associated with weights because angles have a more uniform distribution. I haven't read this specific paper, but it looks like it uses a similar trick at a glance.
I just today learned about Multi-Head Latent Attention, which is also sort of a way of compressing the KV cache. Can someone explain how this new development relates to MHLA?
Multi-Head Latent attention is a redesigned attention mechanism that produces lower-dimensional KV-cache entries. Vector quantization can store KV-cache entries using a small number of bits per dimension while ensuring that the resulting attention scores don't change too much. So MLA needs to be part of the model from the beginning of training, whereas VQ can be retrofitted afterwards, and you could also combine the two.
MLA makes it so the keys and values used are a function of a smaller latent vector you cache instead of a key and a value for each token. KV cache quantization reduces the size of the values in the cache by using less bits to store each value. These two approaches operate on different parts of the process so they can be used in combination. For example, you can quantize the latents that are stored for MLA.
If they didn't cite your paper that's bullshit.
But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it they were going to do so anyway, and wanted to chat with someone who was also doing the thing they were already doing. Good ideas tend to reveal themselves to anyone who is aware of the problem.
To be clear, I am not claiming they stole an idea. They have made significant independent research. However, a specific part regarding the treatment of rotation with bias correction relates to prior work, and it would be appropriate to have that recognized.
That's rationalizing like crazy. If they knew about it they should have cited it.
Doesn't matter, you should still cite. It's basic manners in science.
Exactly, that's why the section is called "Related Work".
The earlier paper was from 2021!
> But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it
That's more than a stretch. They likely invited them because someone thought the abstract sounded interesting, or something like that.
Schmidhuber'd