For those claiming they rigged it. Do you have any concrete evidence? What if the models have just gotten really good?
I just asked Gemini pro to generate an SVG of an octopus dunking a basketball and it did a great job. Not even Deep Think model. Then I did "generate an svg of raccoon at a beach drinking a beer" you can go try this out yourself. Ask it to generate anything you want in SVG. use your imagination.
Rant: This is why AI is going to take over, folks are not even trying the least.
> What if the models have just gotten really good?
Kagi Assistant remains my main way of interacting with AI. One of its benefits is you're encouraged to try different models.
The heterogeneity in competence, particular per unit in time, is growing rapidly. If I'm extrapolating image-creation capabilities from Claude, I'm going to underestimate what Gemini can do without fuckery. Likewise, if I'm using Grok all day, Gemini and Claude will seem unbelievably competent when it comes to deep research.
Every bit of improvement on AI ability will have the corresponding denial phrase. Some people still think AI can't generate the correct number of fingers today.
Why frame it as rigging? I assume they would teach the models to improve on tasks the public find interesting. Then we just have to come up with more challenges for it.
Simon has a private set of SVG tests he uses as well. He said that the private ones were just as impressive.
> For those claiming they rigged it.
I don't think they "rigged" it, but might be given a bit more push on that part since it's going for a very long time now.
Another benchmark is going on at [0]. It's pretty interesting. A perfect scoring model "borks" in the next iteration, for example.
> Rant: This is why AI is going to take over, folks are not even trying the least.
It might be drawing things alright, at least some cases. I seldom use it when my hours long researches doesn't take me to the place I want, and guess what? AI can't go there, either. It hallucinates things, makes up stuff, etc. For a couple of things I asked, it managed to find a single reference, and it was the thing I was looking for, so it works rarely in my cases.
Rant: This is why people are delusional. They test the happy path and claims it knows all the paths, and then some.
and it will be folks using AI taking over for at least a while...
Some people try, most people don't.
AI makes doing almost anything easier for the people that do..
Despite the prophesied near-term obliteration of white collar work, I've never felt luckier to work in software.
Everyone should have their own private evals for models. If I ask a question and a model flat out gets it wrong sometimes I will put it in my test questions bank.