https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing):
- Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution
- The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% ((10/100)^2)
- 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said that the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median human takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3%
- The scoring is designed so that even if AI performs on a human level it will score below 100%
- No harness at all and very simplistic prompt
- Models can't use more than 5X the steps that a human used
- Notice how they also gave higher weight to later levels? The benchmark was designed to detect the continual learning breakthrough. When it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"
Those are supposed to be issues? After reading your list my impression of ARC-AGI has gone up rather than down. All of those things seem like the right way to go about this.
No, those aren't issues. But it's good to know the meaning of those numbers we get. For example, 25% is about the average human level (on this category of problems). 100% is either top human level or superhuman level or the information-theoretically optimal level.
Sure, but, aim for the stars and you hit the moon right? Like fundamentally who cares? For the purpose of an AGI benchmark I'd argue you'd rather err on the side of being more intelligent and counting that as less intelligent than vice versa.
Yeah I'm quite surprised as to how all of those are supposed to be considered problems. They all make sense to me if we're trying to judge whether these tools are AGI, no?
I think that any logic-based test that your average human can "fail" (aka, score below 50%) is not exactly testing for whether something is AGI or not. Though I suppose it depends on your definition of AGI (and whether all humans, or at least your average human, is considered AGI under that definition).
If I had a puzzle I really needed solved, then I would not ask a rando on the street, I would ask someone I know is really good at puzzles.
My point is: For AGI to be useful, it really should be able to perform at the top 10% or better level for as many professions as possible (ideally all of them).
An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can.
> An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can.
Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity.
It seems they don't test for that, since they use the second-best human solution as a baseline.
And that's the right way to go. When computers were about to become superhuman at chess, few people cared that it could beat random people for many years prior to that. They cared when Kasparov was dethroned.
Remember, the point here is marketing as well as science. And the results speak for themselves. After all, you remember Deep Blue, and not the many runners-up that tried. The only reason you remember is because it beat Kasparov.
> The only reason you remember is because it beat Kasparov
There is an additional fascinating aspect to these matches, in that Kasparov obviously knew he was facing a computer, and decided to play a number of sub-optimal openings because he hoped they might confound the computer's opening book.
It's not at all clear Deep Blue would have eked out the rematch victory had Kasparov respected it as an opponent, in the way he did various human grandmasters at the time.
This is supposed to test for AGI, not ASI. ARC-AGI (later labelled "1") was supposed to detect AGI with a test that is easy for humans, not top humans.
> Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity.
Humans without a clinically recognized mental disability are generally capable of some kind of skilled labor. The "general" part of intelligence is independent of, but sufficient for, any such special application.
- [deleted]
This issue here is that people have different definitions of AGI. From the description. Getting 100% on this benchmark would be more than AGI and would qualify for ASI (Algorithmic Super Intelligence) not just AGI.
If you only outdo humans 50% of the time you're never going to get consensus on if you've qualified. Whereas outdoing 90% of humans on 90% of all the most difficult tasks we could come up with is going to be difficult to argue against.
This benchmark is only one such task. After this one there's still the rest of that 90% to go.
Beating humans isn't anywhere near sufficient to qualify as ASI. That's an entirely different league with criteria that are even more vague.
Even dumb humans are considered to have general intelligence. If the bar is having to outdo the median human, then 50% of humans don't have general intelligence.
Not true. We don't have a good definition for intelligence - it's very much an I'll know it when I see it sort of thing.
Frontier models are reliably providing high undergraduate to low graduate level customized explanations of highly technical topics at this point. Yet I regularly catch them making errors that a human never would and which betray a fatal lack of any sort of mental model. What are we supposed to make of that?
It's an exceedingly weird situation we find ourselves in. These models can provide useful assistance to literal mathematicians yet simultaneously show clear evidence of lacking some sort of reasoning the details of which I find difficult to articulate. They also can't learn on the job whatsoever. Is that intelligence? Probably. But is it general? I don't think so, at least not in the sense that "AGI" implies to me.
Once humanity runs out of examples that reliably trip them up I'll agree that they're "general" to the same extent that humans are regardless of if we've figured out the secrets behind things such as cohesive world models, self awareness, active learning during operation, and theory of mind.
> Not true.
It's certainly true. By definition. If the bar for general intelligence is being smarter than the median human, 50% of people won't reach the threshold for general intelligence. (And if the bar is beating the median in every cognitive test, then a much smaller fraction of people would qualify.)
People don't have a consistent definition of AGI, and the definitions have changed over the past couple years, but I think most people have settled on it meaning at least as smart as humans in every cognitive area. But that has to be compared to dumb people, not median. We don't want to say that regular people don't have general intelligence.
You are using terms like "smart" and "dumb" as if they have universally-accepted definitions. You can make up as many definitions of intelligence as you like (I would argue that is a sign of intelligence) but using those terms is certainly going to lead to circular reasoning.
> Yet I regularly catch them making errors that a human never would
I have yet to see a "error" that modern frontier models make that I could not imagine a human making - average humans are way more error prone than the kind of person who posts here thinks, because the social sorting effects of intelligence are so strong you almost never actually interact with people more than a half standard deviation away. (The one exception is errors in spatial reasoning with things humans are intimately familiar with - for example, clothing - because LLMs live in literary space, not physics space, and only know about these things secondhand)
> and which betray a fatal lack of any sort of mental model.
This has not been a remotely credible claim for at least the past six months, and it seemed obviously untrue for probably a year before then. They clearly do have a mental model of things, it's just not one that maps cleanly to the model of a human who lives in 3D space. In fact, their model of how humans interact is so good that you forget that you're talking to something that has to infer rather than intuit how the physical world works, and then attribute failures of that model to not having one.
> you almost never actually interact with people more than a half standard deviation away
I wasn't talking about the average person there but rather those who could also craft the high undergrad to low grad level explanations I referred to.
> This has not been a remotely credible claim for at least the past six months
Well it's happened to me within the past six months (actually within the past month) so I don't know what you want from me. I wasn't claiming that they never exhibit evidence of a mental model (can't prove a negative anyhow). There are cases where they have rendered a detailed explanation to me yet there were issues with it that you simply could not make if you had a working mental model of the subject that matched the level of the explanation provided (IMO obviously). Imagine a toddler spewing a quantum mechanics textbook at you but then uttering something completely absurd that reveals an inherent lack of understanding; not a minor slip up but a fundamental lack of comprehension. Like I said it's really weird and I'm not sure what to make of it nor how to properly articulate the details.
I'm aware it's not a rigorous claim. I have no idea how you'd go about characterizing the phenomenon.
How much of this is expectations setting by the heights models reach? i.e. of we could assess a consistent floor of model performance in a vacuum, would we say it's better at "AGI" than the bottom 0.1% of humans?
I think you are getting caught up on the intelligence part. That is the easy part since AGI doesn't have to be intelligent, it just has to be intelligence. If you look at early chess AI you will see that they are very weak compared to even a beginner human. The level of intelligence does not matter for a chess bot to be considered AI. It is that it is emulating intelligence that makes it AI.
>But is it general? I don't think so
I would consider it as general due to me being able to take any problem I can think of and the AI will make an attempt to solve it. Actually solving it is not a requirement for AGI. Being able to solve it just makes it smarter than an AGI that can't. You can trip up chess AI, but that don't stop them from being AI. So why apply that standard to AGI?
How am I getting caught up on it? I acknowledged that I think frontier models qualify as intelligent but disputed the "general" part. In fact for quite a few years now there have been many non-frontier models that I also consider intelligent within a very narrow domain.
I think stockfish reasonably qualifies as superhuman AI but not even remotely "general". Similarly alphafold.
> Actually solving it is not a requirement for AGI.
I think I see what you're trying to get at but taken as worded that can't possibly be right. Otherwise a dumb-as-a-brick automaton that made an "attempt" to tackle whatever you put in front of it would qualify as AGI.
>Otherwise a dumb-as-a-brick automaton that made an "attempt" to tackle whatever you put in front of it would qualify as AGI.
I would agree as long as there is a general mechanism to represent problems. It is AGI, but would perform poorly on benchmarks compared to better AGI.
I’d be hesitant to call that ASI if it’s pretty obvious how you’d write a regular old program to solve it.
It’s not that simple since each problem is supposed to be distinct and different enough that no single program can solve multiple of them properly. No problem spec is provided as well iiuc so you can’t simply ask an LLM to generate code without doing other things.
A human can sit down to play a game with unknown rules and write a spec as he goes. If a model can't even figure out to attempt that, let alone succeed at it, then it most certainly isn't an example of "general" intelligence.
> A human can sit down to play a game with unknown rules and write a spec as he goes.
Some humans can. Many, if not most humans cannot. A significant enough fraction of humans have trouble putting together Ikea furniture that there are memes about its difficulty. You're vastly overestimating the capabilities of the average human. Working in tech puts you in probably the top ~1-5% of capability to intuit and understand rules, but it distorts your intuition of what a "reasonable" baseline for that is.
Yes, I am aware. However an idealized human can do so. Analogously, there are plenty of humans that can't run an 8 minute mile but if your bipedal robot is physically incapable of ever doing that then it isn't reasonable to claim having achieved human level athletic performance. When it can compete in every Olympic event you can claim human level performance at athletics in general.
If the model can't generalize to arbitrary tasks on its own without any assistance then it doesn't qualify as a general intelligence. AGI to my mind means meeting or exceeding idealized human performance on the vast majority of arbitrary tasks that are cherrypicked to be particularly challenging.
It's not obvious at all. And I would say pretty much impossible without using machine learning. Even for ARC-AGI-1 there is no GOFAI program that scores high.
- [deleted]
People are still debating whether these models exhibit any kind of intelligence and any kind of thinking. Setting the bar higher then necessary is welcome, but at this point I’m pretty sure everyone’s opinions are set in stone.
There's a single true definition of AGI, open the page about AGI on Wikipedia but using archive.org on a snapshot from 10 years ago.
All the rest is bullshit made up by LLM labs to make it seem like they hit AGI by dumbing down its definition.
https://web.archive.org/web/20150108000749/https://en.wikipe...
In retrospect, it seems obvious that we hit AGI by a reasonable "at least as intelligent as some humans" definition when o3 came out, and everything since then has been goalpost moving by people who have higher and higher bars for which percentile human they would be willing to employ (or consider intellectually capable). People should really just use the term "ASI" when their definition of AGI excludes the majority of humans.
Edit: Here's the guy who coined the term saying we're already there. Everything else is arguing over definitions.
https://x.com/mgubrud/status/2036262415634153624
> Well, Lars, I INVENTED THE TERM and I say we have achieved AGI. Current models perform at roughly high-human level in command of language and general knowledge, but work thousands of times faster than us. Still some major deficiencies remain but they're falling fast.
> They all make sense to me if we're trying to judge whether these tools are AGI, no?
As long as the mean and median human scores are clearly communicated, the scoring is fine. I think the human scores above would surprise people at first glance, even if they make sense once you think about it, so there's an argument to be made that scores can be misleading.
“no harnass at all” might be an issue, though, as these types of benchmarks are often gamified and then models perform great on them without actually being better models.
They are severe problems if your income is tied to LLM hype generation.
We're at the point where LLMs and coding agents are supposed to do higher-level work. It makes sense to benchmark them against top human performance, rather than average human performance, because at specialized tasks, average human performance isn't enough.
The issues you described seem like they're actually strengths of the benchmark.
> No harness at all and very simplistic prompt
TBF, that's basically what the kaggle competition is for. Take whatever they do, plug in a SotA LLM and it should do better than whatever people can do with limited GPUs and open models.
Defining the baseline human is always a bit arbitrary. The median human is illiterate and also dead.
It actually makes sense. For any task it is completely trivial for anyone to become better than >80% humans and still easy to be better than >95%. The only problem is motivation not intelligence.
If anything this makes the test much harder for the LLM to get high scores and that makes the scores they’re getting all that much more impressive.
The scroes they're getting are on the order of 0-1% for this ARC-AGI-3 benchmark.
Francois here. The scoring metric design choices are detailed in the technical report: https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf - the metric is meant to discount brute-force attempts and to reward solving harder levels instead of the tutorial levels. The formula is inspired by the SPL metric from robotics navigation, it's pretty standard, not a brand new thing.
We tested ~500 humans over 90 minute sessions in SF, with $115-$140 show up fee (then +$5/game solved). A large fraction of testers were unemployed or under-employed. It's not like we tested Stanford grad students. Many AI benchmarks use experts with Ph.D.s as their baseline -- we hire regular folks as our testers.
Each game was seen by 10 people. They were fully solved (all levels cleared) by 2-8 of them, most of the time 5+. Our human baseline is the second best action count, which is considerably less than an optimal first-play (even the #1 human action count is much less than optimal). It is very achievable, and most people on this board would significantly outperform it.
Try the games yourself if you want to get a sense of the difficulty.
> Models can't use more than 5X the steps that a human used
These aren't "steps" but in-game actions. The model can use as much compute or tools as it wants behind the API. Given that models are scored on efficiency compared to humans, the cutoff makes basically no difference on the final score. The cutoff only exists because these runs are incredibly expensive.
> No harness at all and very simplistic prompt
This is explained in the paper. Quoting: "We see general intelligence as the ability to deal with problems that the system was not specifically designed or trained for. This means that the official leaderboard will seek to discount score increases that come from direct targeting of ARC-AGI-3, to the extent possible."
...
"We know that by injecting a high amount of human instructions into a harness, or even hand-crafting harness configuration choices such as which tools to use, it is possible to artificially increase performance on ARC-AGI-3 (without improving performance on any other domain). The purpose of ARC-AGI-3 is not to measure the amount of human intelligence that went into designing an ARC-AGI-3 specific system, but rather to measure the general intelligence of frontier AI systems.
...
"Therefore, we will focus on reporting the performance of systems that have not been specially prepared for ARC-AGI-3, served behind a general-purpose API (representing developer-aware generalization on a new domain as per (8)). This is similar to looking at the performance of a human test-taker walking into our testing center for the first time, with no prior knowledge of ARC-AGI-3. We know such test takers can indeed solve ARC-AGI-3 environments upon first contact, without prior training, without being briefed on solving strategies, and without using external tools."
If it's AGI, it doesn't need human intervention to adapt to a new task. If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.
Suppose you construct a Mechanical Turk AI who plays ARC-AGI-3 by, for each task, randomly selecting one of the human players who attempted it, and scoring them as an AI taking those same actions would be scored. What score does this Turk get? It must be <100% since sometimes the random human will take more steps than the second best, but without knowing whether it's 90% or 50% it's very hard for me to contextualize AI scores on this benchmark.
The people recruited weren’t experts. I can imagine it’s straightforward to find humans (such as those that play many video games) that can score >100% on this benchmark.
So, if you look at the way the scoring works, 100% is the max. For each task, you get full credit if you solve in a number of steps less than or equal to the baseline. If you solve it with more steps, you get points off. But each task is scored independently, and you can't "make up" for solving one slowly by solving another quickly.
Like suppose there were only two tasks, each with a baseline score of solving in 100 steps. You come along and you solve one in only 50 steps, and the other in 200 steps. You might hope that since you solved one twice as quickly as the baseline, but the other twice as slowly, those would balance out and you'd get full credit. Instead, your scores are 1.0 for the first task, and 0.25 (scoring is quadratic) for the second task, and your total benchmark score is a mere 0.625.
The purpose is to benchmark both generality and intelligence. "Making up for" a poor score on one test with an excellent score on another would be the opposite of generality. There's a ceiling based on how consistent the performance is across all tasks.
Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not.
I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability.
I'm all for testing humans and AI on a fair basis; how about we restrict testing to robots physically coming to our testing center to solve the environments via keyboard / mouse / screen like our human testers? ;-)
(This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)
This counterpoint doesn't address the issue, and I would argue that it is partially bad faith.
Yes, making it to the test center is significantly harder, but in fact the humans could have solved it from their home PC instead, and performed the exact same. However, if they were given the same test as the LLMs, forbidden from input beyond JSON, they would have failed. And although buying robots to do the test is unfeasible, giving LLMs a screenshot is easy.
Without visual input for LLMs in a benchmark that humans are asked to solve visually, you are not comparing apples to apples. In fact, LLMs are given a different and significantly harder task, and in a benchmark that is so heavily weighted against the top human baseline, the benchmark starts to mean something extremely different. Essentially, if LLMs eventually match human performance on this benchmark, this will mean that they in fact exceed human performance by some unknown factor, seeing as human JSON performance is not measured.
Personally, this hugely decreased my enthusiasm for the benchmark. If your benchmark is to be a North star to AGI, labs should not be steered towards optimizing superhuman JSON parsing skills. It is much more interesting to steer them towards visual understanding, which is what will actually lead the models out into the world.
I just realized that this also means that the benchmark is in practice unverified by third parties, as all tasks are not verified to be solvable through the JSON interface. Essentially there is no guarantee that it is even possible to understand how to complete every task optimally through the JSON interface alone.
I assume you did not develop the puzzles by visualizing JSON yourselves, and so there might be non obvious information that is lost in translation to JSON. Until humans optimally solve all the puzzles without ever having seen the visual version, there is no guarantee that this is even possible to do.
I think the only viable solution here is to release a version of the benchmark with a vision only harness. Otherwise it is impossible to interpret what LLM progress on this benchmark actually means.
Well, yes, and would hand even more of an advantage to humans. My point is that designing a test around human advantages seems odd and orthogonal to measuring AGI.
The whole point of AGI is "general" intelligence, and for that intelligence to be broadly useful it needs to exist within the context of a human centric world
Does this mean blind people are not intelligent?
Blind people do function within the context of a human-centric world, though, so they would qualify as intelligent.
Yes, but they use various "harnesses" to do so (dog guides, text to speech software, assistance of other humans when needed..). Why can't AI?
Then why deny it a harness it can also use in a human centric world?
There is no general purpose harness.
General intelligence not owning retinas.
Denying proper eyesight harness is like trying to construct speech-to-text model that makes transcripts from air pressure values measured 16k times per second, while human ear does frequency-power measurement and frequency binning due to it's physical construction.
The human testers were provided with their customary inputs, as were the LLMs. I don't see the issue.
I guess it could be interesting to provide alternative versions that made available various representations of the same data. Still, I'd expect any AGI to be capable of ingesting more or less any plaintext representation interchangeably.
The issue is that ARC AGI 3 specifically forbids harnesses that humans get to use.
- [deleted]
So what? Are you suggesting that an agent exhibiting genuine AGI will be tripped up by having to ingest json rather than rgb pixels? LLMs are largely trained on textual data so json is going to be much closer to whatever native is for them.
But by all means, give the agents access to an API that returns pixel data. However I fully expect that would reduce performance rather than increase it.
Because it is. Opus 4.6 jumps from 0.0% to 97.1% when given visual input
That's impressive. I'm also a bit surprised - I wouldn't have expected it to be trained much at all on that sort of visual input task. I think I'd be similarly surprised to learn that a frontier model was particularly good at playing retro videogames or actuating a robot for example.
However, if it can't figure out to render the json to a visual on its own does it really qualify as AGI? I'd still say the benchmark is doing its job here. Granted it's not a perfectly even playing field in that case but I think the goal is to test for progress towards AGI as opposed to hosting a fair tournament.
> However, if it can't figure out to render the json to a visual on its own does it really qualify as AGI? I'd still say the benchmark is doing its job here.
Can you render serialized JSON text blob to a visual with your brain only? The model can't do anything better than this - no harness means no tool at all, no way to e.g. implement a visualizer in whatever programming language and run it.
Why don't human testers receive the same JSON text blob and no visualizers? It's like giving human testers a harness (a playable visualizer), but deliberately cripples it for the model.
Huh. I thought it wasn't supposed to receive any instructions tailored to the task but I didn't understand it to be restricted from accessing truly general tools such as programming languages. To do otherwise is to require pointless hoop jumping as frontier models inevitably get retrained to play games using a json (or other arbitrary) representation at which point it will be natural for them and the real test will begin.
This is my understanding as well, I thought tools where allowed.
Source? I haven't seen anything like that for ARC-AGI performance.
Also, if it makes that big of a difference, then make a renderer for your agent that looks like the web page and have it solve them in the graphical interface and funnel the results to the API. I guarantee you won't get better performance, because the AGI is going to have to "understand" the raw data can be represented as a 2D matrix regardless of whether it gets a 2D matrix of pixels or a 2D matrix of enumeration in JSON. If anything, that makes it a more difficult problem for a AI system that "speaks" in tokens.
That score is in the arc technical paper [1]. It's the full benchmark score using this harness [2] (which is just open code with read, grep, bash tools).
This is already a solved benchmark. That's why scoring is so convoluted and a self proclaimed Agent benchmark won't allow basic agent tools. ARC has always been a bit of a nothing burger of a benchmark but this takes the cake.
[1] https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf
> For example, in a variant of environment TR87, Opus 4.6 scores 0.0% with no harness and 97.1% with the Duke harness (12), yet in environment BP35, Opus 4.6 scores 0.0% under both configuration
This is with a harness that has been designed to tackle "a small set of public environments: ls20, ft09, and vc33" (of the arc-agi-3 challenge), yet it looks like it does not solve the full arc-agi-3 benchmark, just some of them.
The harness was designed with the preview, but no it was still tested on the full public set in that environment. You can run the benchmark in different 'environments' though it's unclear what the difference between them is.
>We then tested the harnesses on the full public set (which researchers did not have access to at the time)
My sense is that a powerful enough AI would have the sense to think something like "ah, this sounds like a video game! Let me code up an interactive GUI, test it for myself, then use it to solve these puzzles..." and essentially self-harness (the way you would if you were reading a geometry problem, by drawing it out on paper).
Yeah but thats literally above ASI, let alone AGI. Average human scores <1% on this bench, opus scores 97.1% when given an actual vision access, which means agi was long ago achieved
> opus scores 97.1% when given an actual vision access
Do you have a source for this? I would be very curious to see how top models do with vision.
No, there is no source for this. Opus is scoring around 1% just like all the other frontier models. It would be fairly trivial to add a renderer intermediary. And if it improves to 97+%... Then you would get a huge cut of $2 million dollars. The assertion that Opus gets 97% if you just give it a gui is completely bogus.
- [deleted]
I tried ls20 and it was surprisingly fun! Just from a game design POV, these are very well made.
Nit: I didn't see a final score of how many actions I took to complete 7 levels. Also didn't see a place to sign in to see the leaderboard (I did see the sign in prompt).
Agree 100%. I want to be able to see how many actions it took me. And it would be good if it were possible to see how well I'm doing compared to other humans, i.e. what is my percentile.
While I think all of your design choices are defensible, I do think you should release the full human baseline data. The second best action count is fine, but other choices are reasonable as well.
> If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.
If I understand correctly the model can carry only very limited memory among tests, so it looks like it's not really possible for the model to self specialize itself under this assumptions.
There's a very simple solution to this problem here. Instead of wink-wink-nudge-nudge implying that 100% is 'human baseline', calculate the median human score from the data you already have and put it on that chart.
Its below 1% lmao
where did you get this 1%?
Something that I don't understand after reading the technical report is: Why is having access to a python interpreter as part of the harness not allowed (like the Duke harness), but using one hidden behind the model API (as a built-in tool) considered kosher?
The Duke harness was specifically designed for these puzzles, that's why they don't want to measure it.
My reading of that part in the technical report (models "could be using their own tools behind the model’s API, which is a blackbox"), is that there's no way to prevent it.
But from fchollet's comment here, using tools and harnesses is encouraged, as long as they are generic and not arc-agi specific. In that case, the models should be benchmarked by prompting through claude code and codex, rather than the through API (as from the api we only expect raw LLM output, and no tool use).
OpenAi does have python execution behind general purpose api, but it has to be enabled with a flag so I don't think it was used.
Don't you see the massive problem with requiring visual input? Are blind people not intelligent because they cannot solve ARC-AGI-3 without a "harness"?
A theoretical text-only superintelligent LLM could prove the Riemann hypothesis but fail ARC-AGI-3 and won't even be AGI according to this benchmark...
Think of it as spatial input, not visual. Blind people do have spatial inputs, and high spatial intelligence.
Well, it would be AGI if you could connect a camera to it to solve it, similar to how blind people would be able to solve it if you restored their eyesight. But if the lack of vision is a fundamental limitation of their architecture, then it seems more fair not to call them AGI.
People blind from birth literally lack the neural circuits to comprehend visual data. Are they not intelligent?
I think I can confidently say they are not visually intelligent at all.
If you were phrasing things to quantify intelligence, you would have a visual intelligence pillar. And they would not pass that pillar. It doesn't make them dysfunctional or stupid, but visual intelligence is a key part of human intelligence.
Visual intelligence is a near meaningless term as it's almost entirely dependant on spatial intelligence. The visually impaired do have high spatial intelligence, I wouldn't be surprised if their spatial intelligence is actually higher on average than those without visual impairment.
I think they don't actually lack them, or lack only a small fraction (their brains are ≈99% like a normal human brain), such that if they were an AI model, they could be fairly trivially upgraded with vision capability.
Maybe this is a neither can confirm or deny thing, but are there systems in place or design decisions made that are meant to surface attempts at benchmark optimizing (benchmaxxing), outside of just having private sets? Something like a heuristic anti-cheat I suppose.
Or perhaps the view is that any gains are good gains? Like studying for a test by leaning on brute memorization is still a non-zero positive gain.
There are no tricks. Our approach to reducing the impact of targeting (without fully eliminating it) is described in the paper.
Are you prompting the models through their APIs, which are not designed to use tools or harnesses? Or do the "system prompt" results come from prompting into the applications (i.e. claude code, or codex, or even the web front-ends)?
Off topic but I have been following your Twitter for a while and your posts specifically about the nature of intelligence have been a read.
New benchmark idea: 20 questions of guess the number 1-10, with different answers. We run this on 10,000 humans, take best score. Then we take 50 ai attempts, but take the worst attempt so "worst case scenarior robustness or so". We also discard questions where human failed but ai passed because uhhh reasons... Then we also take the final relative score to the power of 100 so that the benchmark punishes bad answers or sum. Good benchmark?
This is a gross misrepresentation of the scoring process.
"Very simplistic prompt" is the absolute and total core of this and the thing that ensures validity of the whole exercise.
If you are trying to measure GENERAL intelligence then it needs to be general.
Like other ARC-AGI challenges it was never needed to reach 100% to get human-level. The benchmark score is stretched so that the benchmark takes more time to be saturated, that's it.
The current SotA models are still very far from your hypothetical “average human” with a score of 3%. So the benchmark is indeed useful to help the field progress (which is the entire point of ARC-AGI benchmarks).
Lol basically we're saying AI isn't AI if we utilize the strength of computers (being able to compute). There's no reason why AGI should have to be as "sample efficient" as humans if it can achieve the same result in less time.
Let's say an agent needs to do 10 brain surgeries on a human to remove a tumor and a human doctor can do it in a single surgery. I would prefer the human.
"steps" are important to optimize if they have negative externalities.
It's kind of the point? To test AI where it's weak instead of where it's strong.
"Sample efficient rule inference where AI gets to control the sampling" seems like a good capability to have. Would be useful for science, for example. I'm more concerned by its overreliance on humanlike spatial priors, really.
ARC has always had that problem but for this round, the score is just too convoluted to be meaningful. I want to know how well the models can solve the problem. I may want to know how 'efficient' they are, but really I don't care if they're solving it in reasonable clock time and/or cost. I certainly do not want them jumbled into one messy convoluted score.
'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates.
If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but with more 'reasoning steps' than the best human scores." Literally wildly different implications. What use is a score like that ?
The measurement metric is in-game steps. Unlimited reasoning between steps is fine.
This makes sense to me. Most actions have some cost associated, and as another poster stated it's not interesting to let models brute-force a solution with millions of steps.
Same thing in this case. No Utility and just as arbitrary. None of the issues with the score change.
Models do not brute force solutions in that manner. If they did, we'd wait the lifetimes of several universes before we could expect a significant result.
Regardless, since there's a x5 step cuttof, 'brute forcing with millions of steps' was never on the table.
The metric is very similar to cost. It seems odd to justify one and not the other.
Cost has utility in the real world and this doesn't. That's the only reason i would tolerate thinking about cost, and even then, i would never bundle it into the same score as the intelligence, because that's just silly.
It's an interesting point but I too find it questionable. Humans operate differently than machines. We don't design CPU benchmarks around how humans would approach a given computation. It's not entirely obvious why we would do it here (but it might still be a good idea, I am curious).
I think your logic isn't sound: Wouldn't we want a "intelligence" to solve problems efficiently rather than brute force a million monkies? There's defnitely a limit to compute, the same ways there's a limit to how much oil we can use, etc.
In theory, sure, if I can throw a million monkies and ramble into a problem solution, it doesnt matter how I got there. In practice though, every attempt has a direct and indirect impact on the externalities. You can argue those externalities are minor, but the largesse of money going to data centers suggests otherwise.
Lastly, humans use way less energy to solve these in fewer steps, so of course it matter when you throw Killowatts at something that takes milliwatts to solve.
> Lastly, humans use way less energy to solve these in fewer steps,
Not if you count all the energy that was necessary to feed, shelter and keep the the human at his preferred temperature so that he can sit in front of a computer and solve the problem.
ok, but thats the same for bulding a data center.
Try again.
Yes, especially when considering a dataceter needed the energy of pretty many people to be built.
A single human is indeed more efficent, and way more flexible and actually just general intelligence.
Oh and who provided the 'food' for the models?
...
People who write the stuff like the poster above you... are bizzaro. Absolutely bizarro. Did the LLM manfiest itself into existence? Wtf.
Edit, just got confirmation about the bizarro-ness after looking at his youtube.