So, what happens when you test it on 11 digit numbers? I don’t mean that as a gotcha or “LOL dumb transformer” snark. More like, does the accuracy start to drop as you add digits? Or instead, maybe it’s the transformer equivalent of a stack overflow and it outputs a picture of a burning spoon or something?
And for that matter, what’s it do with 9 digit numbers? Like, is it more accurate with them, or are these little guys mainly good at adding numbers with exactly 10 digits?
Basically, are the failures modes a gentle increase in inaccuracy, or spectacle failure outside their parameters?
One major limitation of the LLM architecture is that even the failure mode varies unpredictably between inputs.
The set of 11-digit numbers with any given failure mode (or even successful output) has no discernable pattern, merely whatever randomness the training process baked into the model.
You can't predict ahead of time when they will fail spectacularly, nor draw a clear boundary around the failure cases. And early major example of this were the "glitch tokens" introduced into most LLMs by training on reddit data.
But there is an "in general"/"average failure rate across all inputs of a given size" answer: LLMs performance drops off a cliff once the input reaches too much complexity. (A "┐" shaped curve) In contrast to humans, where you can ask a child to add two N-digit numbers and the error rate will be approximately linear to N.
Most humans struggle to compute 10 digit stuff. They use tools instead. Can LLM learn to use calculator? Sorry if that is a stupid question. Maybe brains are not well suited for calculations natively.
Depends on how the transformer has been trained. If it has seen 11 digit examples while training it might work, else the input will be out of distribution and it will respond with a nonsensical number.
For instance the current high score model (311 params [0]), when given 12345678900 + 1, responds with 96913456789.
An interesting experiment would be: what's the minimum number of parameters required to handle unbounded addition (without offloading it to tool calls).
Of course memory constraints would preclude such an experiment. And so a sensible proxy would be: what kind of neural-net architecture and training would allow a model to handle numbers lengths it hasn't been trained on. I suspect, this may be not be possible.