blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...
edit: biggest benchmark changes from 3 pro:
arc-agi-2 score went from 31.1% -> 77.1%
apex-agents score went from 18.4% -> 33.5%
Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests
Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.
Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.
Didn't the same Francois Chollet claim that this was the Real Test of Intelligence? If they target it, perhaps they target... real intelligence?
I don't know what he could mean by that, as the whole idea behind ARC-AGI is to "target the benchmark." Got any links that explain further?
The fact that ARC-AGI has public and semi-private in addition to private datasets might explain it: https://arcprize.org/arc-agi/2/#dataset-structure
I assume all the frontier models are benchmaxxing, so it would make sense
Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.
The touted SVG improvements make me excited for animated pelicans.
I just gave it a shot and this is what I got: https://codepen.io/takoid/pen/wBWLOKj
The model thought for over 5 minutes to produce this. It's not quite photorealistic (some parts are definitely "off"), but this is definitely a significant leap in complexity.
Looks great!
Here's what I got from Gemini Pro on gemini.google.com, it thought for under a minute...might you have been using AI studio? https://jsbin.com/zopekaquga/edit?html,output
It does say 3.1 in the Pro dropdown box in the message sending component.
The blog post includes a video showcasing the improvements. Looks really impressive: https://blog.google/innovation-and-ai/models-and-research/ge...
I imagine they're also benchgooning on SVG generation
My perennial joke is as soon as that got on HN front page Google went and hired some interns and they spend a 100% of the time on pelicans.
SVG is an under-rated use case for LLMs because it gives you the scalability of vector graphics along with CSS-style interactivity (hover effects, animations, transitions, etc.).
How about STL files for 3d printing pelicans!