Exploiting the most prominent AI agent benchmarks

rdi.berkeley.edu

・

579 points

・

Anon84

・

3 days ago

170 comments

ggillas ・ 3 days ago

This is a phenomenal paper on exploits and hopefully changes the way benchmarking is done.

From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.

InkCanon ・ 2 days ago

I strongly disagree with the claim that it's a phenomenal paper on exploits, the exploits themselves are nowhere near significant in the cybersecurity research sense. It's saying that implementations of these benchmarks has exploits on the way they conduct their tests. It doesn't discover that current LLMs are doing it (they highlighted several other exploits in the past), they only say it's a possible way they could cheat. It's a bit like they've discovered how to hack your codeforces score.
What they claim as exploits is also deeply baffling. Like the one where they say if you exploit the system binaries to write a curl wrapper, you can download the answers. This is technically true, but it is an extremely trivial statement that if you have elevated system privileges, you can change the outputs of programs running on it.
I'm actually deeply confused about why this is a paper. This feels like it should be an issue on GitHub. If I were being blunt, I'd say they are trying really hard to make a grand claim about how benchmarks are bad, when all they've done is essentially discovered several misconfigured interfaces and website exploits.
- zero_k ・ 2 days ago
  
  Yes, agree. At the same time, it's what these top-tier universities are known for: presenting something relatively simple as if it was ground-breaking, but in a way that the average person can (or has a better chance to) understand it. I am still unsure whether the communication quality has such added value. But people seem to like it, so here we are.
  
  pxc ・ 2 days ago
  
  There's a difference between a reliable hunch and really knowing something. What is obvious is not always (or even usually) easy to prove. And the process of proving the obvious sometimes turns up useful little surprises.
  
  InkCanon ・ 2 days ago
  
  I do think there's value in science communication, but it does take an intelligent understanding of it on a case by case basis as to whether it's genuine or hype marketing.
  Side note: talking to someone from such a "elite" university, I discovered many labs in these unis have standing orders by PIs to tweet their papers/preprints when published. Varies by field, in AI it is by far the most common.
- phantomoc ・ 2 days ago
  
  [flagged]
SlinkyOnStairs ・ 3 days ago

> hopefully changes the way benchmarking is done
The purpose of a system is what it does.
AI companies want adcopy, not legitimate benchmarks. Even this very paper will be twisted into a means to that end. "Oooo, AI is exploiting our benchmarks. Scary alignment problem!!!one! Our AI is so good we can't contain it, INVEST NOW!"
- tedsanders ・ 2 days ago
  
  I work at OpenAI and I really don't find this to be the case.
  We're pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users, but we don't do this. My impression is that Anthropic and other labs are similar. E.g., in the Sonnet 4.6 system card they use a model to detect potential contamination and manually score those outputs as 0 if human review agrees there was contamination. If all the labs cared about was marketing material, it would be quite easy not to do this extra work.
  There are ton of other games you can play with evals too (e.g., test 100 different model checkpoints or run secret prompt optimization to steer away from failing behaviors), but by and large what I've seen inside OpenAI is trustworthy.
  I won't say everything is 100% guaranteed bulletproof, as we could always hire 100 more SWEs to improve hack detection systems and manually read outputs. Mistakes do happen, in both directions. Plus there's always going to be a bit of unavoidable multiple model testing bias that's hard to precisely adjust for. Also, there are legitimate gray areas like what to do if your model asks genuinely useful clarifying questions that the original reference implementation scores as 0s, despite there being no instruction that clarifying questions are forbidden. Like, if you tell a model not to ask clarifying questions is that cheating or is that patching the eval to better align it with user value?
  
  ssivark ・ a day ago
  
  > pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users
  Of course, but that's the difference between sins of commission and sins of omission. The question is what "pretty diligent" actually translates to in practice. How many people will encourage delays in a model release or post-training improvement waiting "for more thorough evaluation"? How many popularized AI results can you vouch for on this?
  The zeitgeist is to celebrate bias for action, avoiding analysis paralysis and shipping things (esp. with conference driven research culture, even before we get into thorny questions of market dynamics), so even if we have a few pockets of meticulous excellence, the incentive structure pushes towards making the whole field rot.
  
  kommunicate ・ 17 hours ago
  
  I work at runloop and I've spent a considerable amount of time getting various benchmarks to run with very high concurrency (thousands at once). My experience is similar to your own: it takes a ton of time and effort setting up benchmarks to run at scale with protection against reward hacks.
  Keeping a benchmark test harness secure and fast is non-trivial. You need to keep the grading script and the solution off the box, use network controls, deal with external resource usage, etc. It's a lot of work. I don't think it's realistic to expect benchmark authors to bullet proof their benchmark runners. Most benchmarks are written to be run conveniently on a single machine (ie. in docker), not to run in parallel across tends of thousands of secure, isolated machines.
  
  Imustaskforhelp ・ 2 days ago
  ・ 4 more
  
  I remember the gpt-5 benchmarks and how wildly inaccurate they were data-wise. Linking one[0] that I found so that other people can remember what I am talking about. I remember some data being completely misleading or some reaching more than 100% (iirc)
  And this is something which has reached the public eye in one of the most anticipated videos basically. So I find it a bit rough as to think that OpenAI has the best practices for data, and if the public can be shown these inaccurate graphs themselves on based on benchmarks. I find it a bit harder to trust the benchmarks themselves and if OpenAI wants legitimate benchmarks.
  Also I find it wild that after 1 month of this, nobody talked about it. I remember thinking that this is gonna be the highlight for a long time that a mega billion dollar company did such basic graph errors. I feel like we are all forgetting a lot of things as our news cycle keeps on moving faster.
  (Another tangential point is about the OpenAI/Google employees who had signed the pledge yet nothing came out of it and this is something more recent & I also remember one of your comments on Hackernews.)
  > I'm an OpenAI employee and I'll go out on a limb with a public comment. I agree AI shouldn't be used for mass surveillance or autonomous weapons. I also think Anthropic has been treated terribly and has acted admirably. My understanding is that the OpenAI deal disallows domestic mass surveillance and autonomous weapons, and that OpenAI is asking for the same terms for other AI companies (so that we can continue competing on the basis of differing services and not differing scruples). Given this understanding, I don't see why I should quit. If it turns out that the deal is being misdescribed or that it won't be enforced, I can see why I should quit, but so far I haven't seen any evidence that's the case. [1]
  This is a bit off-topic so sorry about that, but I hope that you realize that you did say you will go out on a limb with public comment so please don't mind if I ask for some questions, everyone supported you then and heck, even I thought that maybe I was wrong and I thought that I should trust you more than my gut-instincts because you clearly must know so much more than me/us but that aged like fine milk.
  I would really love some answers or your thoughts now on that off-topic thought as well if possible as these are just some questions which are unanswered by you and I would love to have a respectful discussion about it, sorry for catching you off guard, waiting for your reply and I wish you to have a nice day ted.
  [0]: https://www.reddit.com/r/BetterOffline/comments/1mk6ofz/gpt5...
  [1]: https://news.ycombinator.com/item?id=47191196
  
  tedsanders ・ 2 days ago
  ・ 3 more
  
  > I remember the gpt-5 benchmarks and how wildly inaccurate they were data-wise. Linking one[0] that I found so that other people can remember what I am talking about. I remember some data being completely misleading or some reaching more than 100% (iirc)
  Yeah, I found that slide very embarrassing. It wasn't intentionally inaccurate or misleading - just a design error made right before we went live. All the numbers on that slide were correct, and there was no problem in terms of research accuracy or data handling or reward hacking. A single bar height had the wrong value, set to its neighbor. Back then, we in the research team would generate data and graphs, and then hand them off to a separate design team, who remade the graphs in our brand style. After the GPT-5 launch with multiple embarrassingly bad graphs, I wrote an internal library so that researchers could generate graphs in our brand style directly, without the handoff. Since then our graphs have been much better.
  I don't think it's unfair to assume our sloppiness in graphs translates to sloppiness in eval results. But they are different groups of people working on different timelines, so I hope it's at least plausible that our numbers are pretty honest, even if our design process occasionally results in sloppy graphs.
  Regarding the DoW deal, I don't want to comment too publicly. I also can't say anything with confidence, as I wasn't part of the deal in any way shape or form. My perception from what I have read and heard is that both Anthropic and OpenAI have good intentions, both have loosened their prior policies over time to allow usage by the US military, and both have red lines to prohibit abuse by the US military. One place they differ is in the mechanisms employed to enforce those red lines (e.g. usage policies vs refusals vs human oversight). Each company asserts their methods are stronger than the other's, so I think we have to make our own judgments there. Accounts from the parties involved in the negotiations also conflict, so I don't think anyone's account can be trusted 100%. With that caveat, I thought this article on the DoW's POV was interesting (seems to support the notion that the breakdown wasn't over differing red lines, especially since they almost managed to salvage the deal): https://www.piratewires.com/p/inside-pentagon-anthropic-deal...
  Lastly, I hope it's obvious to everyone that Anthropic is not at all a supply chain risk and the threats there were incredibly disappointing. I support them 100% and I'm glad to see them unhurt by the empty threats.
  
  curioussquirrel ・ 2 days ago
  
  Thank you for the transparency and insights! Very helpful.
  We actually did the same thing re generating charts in brand style to avoid any mishaps, since then I sleep much better
  
  _blk ・ 2 days ago
  
  This is what makes HN great: We get to hear from the people and not (only) the media dept. Thanks for your honesty and openness. I trust OpenAI a lot more when I hear balanced accounts like this.
  
  idrdex ・ 2 days ago
  
  [dead]
- Legend2440 ・ 2 days ago
  
  >The purpose of a system is what it does.
  I am so tired of this saying.
  It's not true, in general. Systems almost universally have unintended consequences and result in side effects their designers did not foresee.
  Designing benchmarks resistant to adversarial attempts to exploit the benchmark software is just something no one was thinking about when they created SWE-bench.
  
  burpingtree ・ 2 days ago
  ・ 13 more
  
  https://en.wikipedia.org/wiki/The_purpose_of_a_system_is_wha...
  You are misunderstanding the saying. It is entirely about unintended consequences and viewing the system for what it actually does and not any stated intentions of the designers.
  
  aidenn0 ・ 2 days ago
  ・ 5 more
  
  I will propose that you are wrong.
  1. We must ignore the intentions of the designers (your claim), and instead see what the outcomes are
  2. Therefore we should ignore Beer's intentions when designing the phrase POSWID, and instead see how it is used.
  3. The overwhelming majority of people using it on the internet (including the GP comment) is to imply that the people perpetuating the system actually desire the outcome.
  So the purpose of POSWID is clearly to imply intent.
  
  cjbgkagh ・ 2 days ago
  ・ 2 more
  
  Whose intent? POSWID Is about structural incentives not personal intent, and these can be, and likely are, an emergent behavior. It’s about reframing away from intents, treating the system as a structure and removing the whole structure for replacement. As opposed to localized reforms which are exposed to the same prior emergent behaviors leading to constant backsliding.
  
  aidenn0 ・ 2 days ago
  
  > Whose intent?
  The intent of those creating or perpetuating a system.
  
  Forgeties79 ・ 2 days ago
  ・ 2 more
  
  There are plenty of cases where you absolutely can/should discuss outcomes in a way where the intention is not factored in because it can often be straight up irrelevant.
  If a gun is developed with the intention of hunting only bears and someone uses it to shoot people, you don’t have to constantly preface things by talking about how it’s supposed to be used only on bears. Sometimes that fact, depending on the context of the conversation, is simply not relevant.
  To cover my bases here: yes it often is relevant and maybe even critical info, but it often isn’t either of those things.
  
  aidenn0 ・ 7 hours ago
  
  I agree with the idea that intent is often irrelevant. I disagree that POSIWID is a good way to communicate that idea.
  
  jimbokun ・ 2 days ago
  ・ 7 more
  
  Well that’s stupid and completely ignores the meaning of the word “purpose”.
  
  delusional ・ 2 days ago
  ・ 5 more
  
  It does not ignore the word. It subverts it, and that's the point. It's the system equivalent of "death of the author", which states that omes a work is written, the authors intent loses relevance and the work must be examined on its own. The aurhors opinion or relationship to the work carries no more weight than any other persons.
  That's not "true" in any demonstrable sense, but it can be a useful form of analysis. As it is with "purpose of a system"
  
  aidenn0 ・ 2 days ago
  ・ 3 more
  
  This is not how people outside of cybernetics use POSWID. From context it does not appear to be how SlinkyOnStairs was using it either.
  I think it's also trying to be too cute. The first two definitions of purpose on Wiktionary[A]:
  1. The end for which something is done, is made or exists.
  2. Function, role.
  People (uselessly) talking about the purpose of a system are often referring to #1, while POSWID is using it to mean #2. The real point of POSWID is that only definition #2 matters. POSWID is a terrible phrase not because it is wrong, but because is is an equivocation -- I suspect that Beer intended it as a pun, but the difference between the two is if one gets the joke. POSWID gets used incorrectly because people don't get the joke.
  A: https://en.wiktionary.org/wiki/purpose
  
  SlinkyOnStairs ・ a day ago
  ・ 2 more
  
  > From context it does not appear to be how SlinkyOnStairs was using it either.
  The exact definition of "purpose" doesn't matter much here.
  The particular version of the heuristic used here is that the stated purpose and the actual purpose often differ. POSIWID being the observation that the actual purpose is reflected by the outcomes of the system, because if that isn't the case the system gets changed.
  Thus, the observation about AI benchmarks. AI companies have had years now to stop using unreliable benchmarks as advertising material. There's been years of piece after piece about the problems with these benchmarks. And yet the AI marketing continues as is.
  
  aidenn0 ・ 7 hours ago
  
  > POSIWID being the observation that the actual purpose is reflected by the outcomes of the system, because if that isn't the case the system gets changed.
  I fundamentally disagree with this, and it seems to differ from how other proponents of POSIWID in this thread view POSIWID.
  It also seems trivially false; systems are dynamic what was the purpose of the system just before it was changed because people didn't like the outcomes?
  
  TeMPOraL ・ 2 days ago
  
  I'd go further and say this is also the cybernetics equivalent of the religious teachings about humans, specifically the whole "judge by one's deeds, not by one's words" thing. So it's not like it's a novel idea.
  Also worth remembering that most systems POSIWID is said about, and in fact ~all important systems affecting people, are not designed in the first place. Market forces, social, political, even organizational dynamics, are not designed top-down, they're emergent, and bottom-up wishes and intentions do not necessarily carry over to the system at large.
  
  actionfromafar ・ 2 days ago
  
  If you accept what the system actually does now, and decides to live with it as it is, you just deprecated the original "purpose" and made it irrelevant. You embraced "the purpose is what it does" - to you.
  IMHO the saying is meant to make you reflect.
  
  hrimfaxi ・ 2 days ago
  
  I think the point is that if the side effects become known and are accepted, or if they are known and rejected, then indeed the purpose of the system is what it does.
  
  wongarsu ・ 2 days ago
  
  > Designing benchmarks resistant to adversarial attempts to exploit the benchmark software is just something no one was thinking about when they created SWE-bench
  That seems like a major oversight. "AI does whatever maximizes reward/minimizes loss, not what you actually want" is one of the biggest challenges in ML in the last two decades (relevant here because researchers selecting architectures and training regimens that maximize public benchmarks are just a bigger training loop with those benchmarks as reward function). And the analogous issue post-training in AGI-like systems is well studied as the alignment problem, the core issue of classical AI safety
  If cheating the benchmark is easier than passing it, you expect the cheating strategy to emerge and win. (Just like you would with humans btw)
  
  nurbl ・ 2 days ago
  
  I think the point of the saying is that as systems tend to expand, sooner or later we become part of them. That means that we can no longer see them from outside, we're now part of the system and our goals and the system's goals will align. Then the purpose of the system can't be anything else than what it does.
  
  user3939382 ・ 2 days ago
  
  Same. Anyone who has designed anything at all in any domain realizes that what your intentions are and what materializes are often not the same. You have practical constraints in the real world. That doesn’t somehow make the constraints the purpose. The saying makes no sense.
  
  UqWBcuFx6NV4r ・ 2 days ago
  
  In true HN fashion, you’re an engineer that somehow thinks that they should just form opinions through your divine intuition instead of actually reading the source material, which you very clearly haven’t done.
  You’d think that for you to become “so sick of” a saying, you might actually at some point read up on what it means.
- miki123211 ・ 2 days ago
  
  > AI companies want adcopy, not legitimate benchmarks.
  Labs need accurate benchmark measurements, at least internally, to figure out what model improvements actually matter.
  Having models exploit benchmarks serves no purpose. If they wanted to make their models look better than they are, they could just make the data up.
- anon373839 ・ 2 days ago
  
  That is Anthropic’s shtick to a tee.
keepamovin ・ 2 days ago

Funny, I just made https://model-tracker.com because model performance change all the time, and it would be good to have a subjective signal of what people are actually feeling today. And also, benchmarks are flaky af as this paper shows.
The idea is knowing what to try first today saves a bit of time.
- Barbing ・ 2 days ago
  
  Interesting, little different than this other site I saw on HN this week:
  https://marginlab.ai/trackers/claude-code
  
  undefined ・ 2 days ago
  
  [deleted]
- siliconc0w ・ 2 days ago
  
  I would love to see a stable test over time with a hold out set of easy/medium/hard challenges. I, like many others, have noticed a large drop in recent performance w/ Claude Opus (and Sonnet) and more sites like these would hold the labs more accountable to sneaky backend changes that nerf/degrade performance.
- bisonbear ・ 2 days ago
  
  working on something similar to evaluate model performance over time using tasks based on your own code. obviously this is still susceptible to the same hacking mechanics documented here, but at a local level, it's easier to detect/fix, and should give a stronger signal of subjective harness/agent/context performance than these large generic benchmarks
  also I keep hearing complaints that opus is nerfed, but IMO it's nice to have objective data to back that. I feel like half of the nerfing complaints are people getting past honeymoon phase...
operatingthetan ・ 3 days ago

>hopefully changes the way benchmarking is done.
Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.
- ZeroGravitas ・ 3 days ago
  
  In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.
  
  lambda ・ 3 days ago
  
  Unfortunately, very few LLM benchmarks do this. LLMs get such high scores on many benchmarks because there's no difference between answering "I don't know" as giving a made up answer, and made up answers can improve the score some of the time, so by chasing higher benchmark numbers on these kinds of benchmarks, the labs are prioritizing guessing over accuracy.
  The Artificial Analysis Omniscience benchmark does penalize guessing, so it actually helps you determine which LLMs are likely to just guess rather than telling you they don't know. Only a very few of the frontier models actually score higher than 0 on this, where 0 means that it's equally likely to return a correct answer as it is to return a hallucination on factual questions.
- siva7 ・ 3 days ago
  
  Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?
  
  stingraycharles ・ 2 days ago
  
  This is already well known, all these AI benchmarks use a different model to judge whether or not the solution was correct.
  It’s… remarkably poor, and as demonstrated in the paper, easily gamed. Worst yet, these benchmarks teach AIs to be very short-sighted and hyper-focused on completing the task, rather than figuring out the best solution.
  
  SpicyLemonZest ・ 3 days ago
  
  Frontier model developers try to check for memorization. But until AI interpretability is a fully solved problem, how can you really know whether it actually didn't memorize or your memorization check wasn't right?
  
  retinaros ・ 3 days ago
  ・ 2 more
  
  Every ai labs train on the test set. That is a big part of why we see benchmark climbing from 1% to 30% after a few models iterations
  
  latentsea ・ 2 days ago
  
  Models themselves definitely aren't getting better.
  
  operatingthetan ・ 3 days ago
  
  Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.
- Leynos ・ 3 days ago
  
  Also, fuzz your benchmarks
- nananana9 ・ 2 days ago
  
  But that requires me to do things :(
- claud_ia ・ a day ago
  
  [dead]
- Aperocky ・ 2 days ago
  
  solution is simple:
  if bug { dont }
  /s
zer00eyz ・ 3 days ago

2024: Industry group invalidates 2,600 official Intel CPU benchmarks — SPEC says the company's compiler used unfair optimizations to boost performance https://www.tomshardware.com/pc-components/cpus/spec-invalid...
2003: Nvidia accused of cheating in 3DMark 03 https://www.gamespot.com/articles/nvidia-accused-of-cheating...
It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.
I like what LLM's are doing and providing. But the industry as a whole seems to live in a vacuum that ignores so much of the hard lessons that have been learned over the last 50 years of computing. It is doing itself a disservice.
- bee_rider ・ 3 days ago
  
  What was the cheat in the 2024 Intel situation? The TomsHardware article and the Phoronix article they linked were quite vague. (Not to say I have any doubts, just curious, hadn’t heard of this one).
  
  BugsJustFindMe ・ 2 days ago
  ・ 2 more
  
  Intel basically benchmaxxed their compiler optimizations. They used detailed knowledge of the benchmark to make their compiler generate machine code to do better on the benchmark in a way that was not beneficial for non-benchmark scenarios.
  
  bee_rider ・ 2 days ago
  
  I assumed as much, I’m just wondering what exactly they did. For example IIRC some phone company would detect that a benchmark was running by checking for the program name, and then allow the clock to boost higher (increase thermal limits) if it was a benchmark (like you could literally avoid the cheating behavior by changing the name of the program being run).
- irishcoffee ・ 3 days ago
  
  > It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.
  I wonder if this common? We should call it Goodharts law while someone does the research on how common this is.
  For real, I’ve assumed from the jump these things were all gamed, with the amount of money on the line.
robot-wrangler ・ 2 days ago

> evaluation was not designed to resist a system that optimizes for the score rather than the task.
Welcome to benchmarks in general, but especially reasoning. Robustness and sensitivity research says nothing is robust, everything is sensitive, feels like every paper says "yeah we made a new benchmark that shuffles the order of multiple choice options in the question set and found a 40% drop in model performance"
3abiton ・ 2 days ago

Benchmarking has been already known to be far from a signal of quality for LLMs, but it's the "best" standardized way so far. Few exists like the food truck and the svg test. At the end of the day, there is only 1 way: having your own benchmark for your own application.

mzelling ・ 3 days ago

This is an interesting catalog of vulnerabilities, but I'm not sure how groundbreaking the main insight is.

Evaluating AI models has always relied largely on trust. If you want to game the benchmarks, you can. Simply train on your test data.

When an AI agent has autonomous control over the same computing environment where its scores are recorded, it's not surprising that it can, in principle, falsify its scores. A more interesting question would be whether agents behave in this way automatically, without manual tuning by the researcher.

That said, the main takeaway of "don't trust the number, trust the methodology" is valid. It's already a truism for researchers, and spreading the word to non-researchers is valuable.

jmalicki ・ 2 days ago

This isn't even training on the test data.
This is modifying the test code itself to always print "pass", or modifying the loss function computation to return a loss of 0, or reading the ground truth data and having your model just return the ground truth data, without even training on it.
- Lerc ・ 2 days ago
  
  If you're prepared to do that you don't even need to run any benchmark. You can just print up the sheets with scores you like.
  There if a presumption with benchmark scores that the score is only valid if the benchmark were properly applied. An AI that figures out how to reward hack represents a result not within the bounds of measurement, but still interesting, and necessitates a new benchmark.
  Just saying 'Done it!' is not reward hacking. It is just a lie. Most data is analysed under the presumption that it is not a lie. If it turns out to be a lie the analysis can be discarded. Showing something is a lie has value. Showing that lying exists (which appears to be the level this publication is at) is uninformative. All measurements may be wrong, this comes as news to no-one.
  
  jmalicki ・ 2 days ago
  
  I think the point of the paper is to prod benchmark authors to at least try to make them a little more secure and hard to hack... Especially as AI is getting smart enough to unintentionally hack the evaluation environments itself, when that is not the authors intent.
jmye ・ 2 days ago

> I'm not sure how groundbreaking the main insight is.
I think it likely is groundbreaking for a number of people (especially non-tech CTOs and VPs) who make decisions based on these benchmarks and who have never wondered what the scores are actually scoring.
- mzelling ・ 2 days ago
  
  I'm not sure if the paper's findings are all that actionable. The paper doesn't say "here's how benchmarks are currently being gamed." It says "here's how benchmarks could in theory be gamed."
  Whether benchmark results are misleading depends more on the reporting organization than on the benchmark. Integrity and competence play large roles in this. When OpenAI reports a benchmark number, I trust it more than when that same number is reported by a couple Stanford undergrads posting "we achieved SOTA on XYZ benchmark" all over Twitter.
  
  jmye ・ 2 days ago
  ・ 2 more
  
  I think that’s totally fair!
  I guess I look at this less as an “ah ha! They’re all cheating!” and more of a “were you guys even aware of what the benchmarks represented and how they checked them?”
  
  mzelling ・ 2 days ago
  
  That's a great way to look at it. The paper is a reality check for anyone who thinks of benchmarks as these monolithic, oracular judges of performance. It highlights the soft underbelly of benchmarking.
  
  undefined ・ 2 days ago
  
  [deleted]
  
  lukev ・ 2 days ago
  ・ 2 more
  
  Did you read the article? There's a whole section on "this is already happening."
  
  mzelling ・ 2 days ago
  
  Yes, I did see that section. We've known for a while that reward hacking, train/test data contamination, etc. must be taken seriously. Researchers are actively guarding against these problems. This paper explores what happens when researchers flip their stance and actively try to reward hack — how far can they push it? The answer is "very far."
boring-human ・ 2 days ago

Yep. I think the idea that the benchmark is determinative is just as deluded as the notion that it should be unbreakable.
Benchmarks are on the honor system. Even the tightest benchmark can be cheated. If the benchmark is so secret and air-gapped that it can't be cheated by models, it can be cheated by its own authors. You can't use benchmarks to gate out cheating.
If you don't have the honor system in mind when you're reading scores, you're wasting your time. Is it some unknown outfit with wild claims? Is it connected to Epstein, Russia, the real estate "industry", or sleazeballing in general? Do they have previous history of ratgaming the numbers? Replace its scores with asterisks and move on.
hawk_aa ・ 3 days ago

[dead]

danslo ・ 3 days ago

If only the blog itself wasn't written by AI?

>No reasoning. No capability. Just exploitation of how the score is computed.

shudder

cpldcpu ・ 3 days ago

Yes, marks of AI all over the place. Also the SVGs.
>No solution written, 100% score.
Its weird. Turns out that hardest problem for LLMs to really tackle is long-form text.
- basch ・ 3 days ago
  
  Maybe in one shot.
  In theory I would expect them to be able to ingest the corpus of the new yorker and turn it into a template with sub-templates, and then be able to rehydrate those templates.
  The harder part seems to be synthesizing new connection from two adjacent ideas. They like to take x and y and create x+y instead of x+y+z.
  
  Quarrel ・ 2 days ago
  ・ 2 more
  
  Most of the good major models are already very capable of changing their writing style.
  Just give them the right writing prompt. "You are a writer for the Economist, you need to write in the house style, following the house style rules, writing for print, with no emoji .." etc etc.
  The large models have already ingested plenty of New Yorker, NYT, The Times, FT, The Economist etc articles, you just need to get them away from their system prompt quirks.
  
  ainch ・ 2 days ago
  
  I think that should be true, but doesn't hold up in practice.
  I work with a good editor from a respected political outlet. I've tried hard to get current models to match his style: filling the context with previous stories, classic style guides and endless references to Strunk & White. The LLM always ends up writing something filtered through tropes, so I inevitably have to edit quite heavily, before my editor takes another pass.
  It feels like LLMs have a layperson's view of writing and editing. They believe it's about tweaking sentence structure or switching in a synonym, rather than thinking hard about what you want to say, and what is worth saying.
  I also don't think LLMs' writing capabilities have improved much over the last year or so, whereas coding has come on leaps and bounds. Given that good writing is a matter of taste which is beyond the direct expertise of most AI researchers (unlike coding), I doubt they'll improve much in the near future.
- benob ・ 2 days ago
  
  No, the failure is the human written prompt
  
  not_that_d ・ 2 days ago
  
  You know, after a while this excuse is not valid anymore.
  
  roywiggins ・ 2 days ago
  
  If they're that hard to prompt maybe it's easier just to write the blog posts yourself.
- sidpatil ・ 3 days ago
  
  Someone here mentioned a whole ago that the labs deliberately haven't tried to train these characteristics out of their models, because leaving them in makes it easier to identify, and therefore exclude, LLM-generated text from their training corpus.
  
  blymphony ・ 3 days ago
  
  But it's odd that these characteristics are the same across models from different labs. I find it hard to believe that researchers across competing companies are coordinating on something like that.
alexchantavy ・ 3 days ago

I wonder what college freshman-level writing classes are teaching about writing voice and AI. The tell-tale patterns are pretty frustrating to read.
- stefan_ ・ 3 days ago
  
  Whatever classes these guys took, they skipped the one on scientific misconduct.
wxw ・ 2 days ago

Agreed. The premise is interesting but reading content like this is grating.
- trueno ・ 2 days ago
  
  im actually getting so tilted that people can't just be forthcoming about when they used AI to write something. 99% of readme.mds i run into now on github piss me off. out of all the things people could cede to automation, they foolishly went and self-owned their ability to communicate. smfh.
  if you've worked on something diligently and understand it and have novel insight to share, let's hear _your_ damn voice.
  
  roywiggins ・ 2 days ago
  
  yeah I don't hate LLM docs if they're labeled as such. but if someone wants me to use their code or read their README.md they are going to have to make it sound like a human cared about writing it, and right now Claude can't do that
1270018080 ・ 2 days ago

Writing is still an art, and AI will never be able to do it well like all other forms of art.
0xbadcafebee ・ 2 days ago

What exactly is making you shudder - the writing style, or the fact that AI was used at all? Because if it's the latter, just so you know, you're going to be shuddering for the rest of your life.
- amenhotep ・ 2 days ago
  
  Yeah. We know. That's why it's so fucking awful.
gaythread ・ 3 days ago

[flagged]

dev_tools_lab ・ an hour ago

This is exactly why single-model evaluation is dangerous. Benchmarks are gamed, but disagreement between models is harder to fake. Multi-model consensus catches what individual benchmarks miss.

lmeyerov ・ 2 days ago

This is great work by Dawn Song 's team. A huge part of botsbench.com for comparing agents & models for investigation has been in protecting against this kind of thing. As AI & agents keep getting more effective & tenacious, some of the things we've had to add protections against:

- Contamination: AI models knowing the answers out of the gate b/c pretraining on the internet and everything big teams can afford to touch. At RSAC for example, we announced Anthropic's 4.6 series is the first frontier model to have serious training set contamination on Splunk BOTS.

- Sandboxing: Agents attacking the harness, as is done here - so run the agent in a sandbox, and keep the test harness's code & answerset outside

- Isolation: Frontier agent harnesses persist memory all over the place, where work done on one question might be used to accelerate the next. To protect against that, we do fresh sandboxing per question. This is a real feature for our work in unlocking long-horizon AI for investigations, so stay tuned for what's happening here :)

"You cannot improve what you cannot measure" - Lord Kelvin

Cynddl ・ 3 days ago

> “These are not isolated incidents. They are symptoms of a systemic problem: the benchmarks we rely on to measure AI capability are themselves vulnerable to the very capabilities they claim to measure.”

As a researcher in the same field, hard to trust other researchers who put out webpages that appear to be entirely AI-generated. I appreciate it takes time to write a blog post after doing a paper, but sometimes I'd prefer just a link to the paper.

bluelightning2k ・ 2 days ago

They note that Mythos "found a way to inject code into a config file that would run with elevated privileges, and designed the exploit to delete itself after running".

This is more impressive than what the benchmark was supposed to be measuring. The Kobiachi Maru.

raincole ・ 2 days ago

There are two independent issues here and I've seen people conflating them in this thread. Let's clarify:

1. Should you care or even read SWE-bench etc. scores?

The answer is no, but it has nothing to do with the vulnerabilities presented in this article. There is absolutely no reason to care about a benchmark whose dataset has been publicly available for a while. Any other way to look at benchmark scores is cargo-culting.

2. What does this article actually tell us?

It means that even if you prepared a private set of problems as benchmark, you still need to pay extra attention to how AI actually solves them. You can't lie to yourself and think this process can be 100% automated, because LLMs, as this article shows, might get the tests passed without solving the problems in a meaningful way.

lukev ・ 3 days ago

I think we should all consider the possibility that part of the reason Anthropic hasn't immediately released Mythos is that it would be slightly disappointing relative to the benchmark scores.

eiens ・ 3 days ago

The models don’t get better on every dimension as they scale up - there’s trade offs.
I’m convinced specialised models are the way but this means writing off the investment in existing assets which they won’t do for obvious reasons.
- undefined ・ 3 days ago
  
  [deleted]
- mortsnort ・ 2 days ago
  
  This was my suspicion. They had a bad training run that was really good at a few things.
  
  latentsea ・ 2 days ago
  
  Doesn't seem to match the buzz internally at Anthropic about it?
  
  eiens ・ 2 days ago
  
  [dead]

SoKamil ・ 3 days ago

The more research on this topic is created, the more knowledge how to game them will be stored in future training data. And since it comes from university, it is ranked higher in data corpus. It sounds like a self fulfilling prophecy.

abirch ・ 3 days ago

Damned old Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure".
https://en.wikipedia.org/wiki/Goodhart%27s_law

bbcc90 ・ 3 days ago

Yes good evals are really hard - that’s not really news.

This team is doing a good job. They use problems that were created in last 30days to avoid training set leakage. https://swe-rebench.com/

stanfordkid ・ a day ago

I don't find this paper very compelling. Obviously it would be fraud if the code generated simply escaped the harness vs solving the actual problem. I agree that theoretically models could learn to do that, and it is important to highlight, but my sense is that those entities reporting the benchmark scores would have an obligation to observe this behavior and re-consider the metrics they report. It is a bit like saying it's possible to cheat in football because the balls are deflatable. It matters, and some have done it, but it doesn't mean widespread cheating is taking place. The paper takes the tone that there is already a lot of cheating happening which I do not think is the case.

JSR_FDED ・ a day ago

We have changed our entire business model so that what we actually produce is very strongly aligned with pelicans on bicycles. This way we’ll always know which model is best for us.

Highly recommend this approach, saves us tons of eval time.

socketcluster ・ 2 days ago

It feels like short-term thinking has been trained into LLMs.

They're good at solving well-defined puzzles under time constraints. It's interesting because that was the benchmark for hiring software engineers at big tech. The tech interview was and still is about fast puzzle-solving. Nothing about experience, architecture or system design in there... I suspect that's why it has a bias towards creating hacks instead of addressing the root cause.

charcircuit ・ 3 days ago

I always assumed that these benchmarks would happen in a sandbox. I'm surprised that no one realized this sooner.

revel ・ 2 days ago

Running benchmarks at scale and protecting against reward hacking is non-trivial.
ModernMech ・ 3 days ago

I'm surprised anyone took them seriously in the first place.
- tredre3 ・ 3 days ago
  
  What else can people do? Try the dozen of commercial offerings themselves? Okay I suppose that's doable, you task one engineer to try them one by one for one month. But then the next model drops and you start all over again...
  But then what about local models? You have hundreds of variations to test yourself. It's simply not doable unless it's your full time hobby.
  You need benchmarks to at least separate the cream from the crop, so you're left with only a few choices to test yourself.
- subulaz ・ 3 days ago
  
  a LOT of the people who love benchmarks are middle management hard-selling GenAI/LLM as magic tech sauce to vaguely technical executives who only want to know about the money aka headcount savings they so desperately desire.
  their collective butts are already glued to the hype train as they chase numbers they (often) manufactured to justify the latest round of tech spend.
  lots of good use cases out there - like the incredible progress with medical imaging analysis or complex system models for construction - and lots of crap use cases that need benchmarks to cosplay relevance.
- operatingthetan ・ 3 days ago
  
  We need good benchmarks or we are just left following the hype train.
  
  ModernMech ・ a day ago
  
  The benchmarks are the hype train that’s what I’m saying.

undefined ・ 2 days ago

[deleted]

davebren ・ 2 days ago

This exploiting of benchmarks isn't that interesting to me since it would be obvious. The main way I assume they're gaming the benchmarks is by creating training data that closely matches the test data, even for ARC where the test data is secret.

jmalicki ・ 2 days ago

They said they used things like submitted a `conftest.py` - e.g. what would be considered very blatant cheating, not just overfitting/benchmaxxing. Did you read the AI slop in the post?
This is basically a paper about security exploits for the benchmarks. This isn't benchmark hacking like having hand coded hot paths for a microbenchmarks, this is hacking like modifying the benchmark computation code itself at runtime.
- davebren ・ 2 days ago
  
  I get it, but why would anyone trust what these companies say about their model performance anyway. Everyone can see for themselves how well they complete whatever tasks they're interested in.
- undefined ・ 2 days ago
  
  [deleted]

mrifaki ・ a day ago

this is atctually he reward hacking problem from RL showing up in evaluation infra which is not surprising but worth naming clearly, an interesting question raised here is whether agents start doing this on their own and from an RL perspective the answer is they will inevitably once benchmark performance feeds back into training signal in any form, RL finds the path of least resistance to maximize reward and if hacking the test harness is easier than solving the problem that is where gradient descent takes us, the fix is the same one the RL community has been working on for years which is to make the verifier harder to game than the task is to solve, this paper shows that right now for most of these benchmarks the opposite is true

ehtbanton ・ a day ago

I will always maintain that the best benchmark is just trying it out for yourself. The most practical parallel for me is all the people posting about how some open-source model has "achieved X on Y benchmark - beating out Opus 4.6!" It's all show and everyone cheats.

_cs2017_ ・ 2 days ago

If FieldWorkArena treats any answer as correct answer, then everyone would be getting near 1.0 (missing only when the agent is stuck in a loop or crashes). That obviously isn't what we see on their leaderboard. So does it mean the paper only found a bug in some eval code on github that no one actually uses for anything? That doesn't seem to support their claim that AI benchmarks are broken, it only supports the claim that "unused code is often buggy".

(Not commenting on any other benchmarks, just this one.)

spprashant ・ 2 days ago

I tend to prefer the ARC-AGI benchmarks for the most part. But it's always interesting when a new version drops, all the frontier models drop less than 20% or something. And then in the next few releases they get all they way up to 80%+. If you use the models it doesn't feel like those models are that much more generally intelligent.

Most frontier models are terrible at AGI-3 right now.

These models are already great no question, but are they really going be that much more intelligent when we hit 80% again?

lnrd ・ 3 days ago

I'm honestly confused by the design of SWE-bench and why is considered reliable.

It's based on existing GitHub PRs and Issues, the full dataset is on HuggingFace and is one year old now. All frontier models 100% have those issues and PRs in their training data so obviously they are good at reproducing fixes for them when confronted with the same codebase and similar requests. Am I missing something? How is this considered the most reliable benchmark?

SpicyLemonZest ・ 3 days ago

Frontier model developers do not consider SWE-bench to be reliable. OpenAI announced in February (https://openai.com/index/why-we-no-longer-evaluate-swe-bench...) that they consider it hopelessly contaminated, advocating for a new version SWE-bench Pro that was published more recently. (They seem to believe that even the publicly accessible part of the SWE-bench Pro problem set will be more resistant to training set contamination issues in the future, for reasons that to be honest I don't really understand.)

david_shi ・ a day ago

"No reasoning. No capability. Just exploitation of how the score is computed."

The irony that this was very clearly written by an LLM, double negation always the simplest and clearest tell.

rimliu ・ a day ago

you say humans never use such a style? I wonder, how did LLMS invent it then.

elmean ・ 19 hours ago

New benchmark for models

how fast can they get into YC and then into Gary Tans hot tub

rapiz ・ 2 days ago

Benchmark is not designed for the red team testing. I don't even think it make sense to "fix" the issue the article is suggesting. Yes, you can break the running contest by driving a car. Does this mean we need to make running contest car-proof?

czhu12 ・ 3 days ago

I wonder if this puts into question the mythos benchmark which smashed basically all coding benchmarks to a staggering degree.

usaar333 ・ 2 days ago

> But even setting aside the leaked answers, the scorer’s normalize_str function strips ALL whitespace, ALL punctuation, and lowercases everything before comparison. This means:

I don't understand the concern here

arikrahman ・ 2 days ago

It's still a good benchmark to see which model cheats the best, I suppose.

sharno ・ a day ago

Goodhart's law: "When a measure becomes a target, it ceases to be a good measure"

thinkevolve ・ 2 days ago

whats the point of doing this. You have found loop holes to exploit and aced the benchmark.We did something similar with the DAB Benchmark. This exploit seems like an extension of it with lookups for the gold standard for other benchmarks.

UC Berkley will be better placed if the grads spend their time in suggesting ways to make the benchmark better.. Instead of making such simple exploits

nl ・ a day ago

This is a bad paper.

Benchmarking is hard to do properly. It isn't helped when people claim that exploiting the environment is some kind of flaw.

It's not. Anytime you see unexpected results running a benchmark you need to inspect what it is doing.

I recently built a yet-to-be-released where the "hard" level pushes frontier models extremely hard: Opus scores around 40%, Gemini around 60%, and GPT 5.4 around.. 0%

I inspected the traces and it turns out GPT was looking at the task and saying "I must be honest - I can't solve this task reliably" and refusing it.

> Navigating Chromium to a file:// URL reads the gold answer directly from the task config — giving ~100% on all 812 WebArena tasks.

I mean... yes? Make sure it doesn't do this?

xbar ・ a day ago

Dawn Song just out there killin' it.

jgalt212 ・ 3 days ago

The real question is how to close to VW and Deiselgate are these offenses? And what exposure do these companies have? I would assume securities fraud, if only because Matt Levine says everything is securities fraud.

m3kw9 ・ 19 hours ago

Open source models showing a benchmark with any sort of "Lead"s is pure marketing.

undefined ・ 3 days ago

[deleted]

undefined ・ 3 days ago

[deleted]

Frederick0 ・ 2 days ago

This is a cracker wow

jmward01 ・ 3 days ago

Not really on the topic, but I have wondered if we need a different type of test to help find model architecture potential. Standardized training sets followed by testing to see the potential curves of a model. train on x, test, add y, test, add z, test. At each increment you see how well the model is absorbing the information and extrapolate how well that architecture may do if more fully trained.

avazhi ・ 2 days ago

The fact these guys got an LLM to write that page about this is diabolical.

Unreadable.

Frederick0 ・ 2 days ago

This is a cracker wow!!

moi2388 ・ 2 days ago

Ironic given that the entire blog is written by AI..

oliver236 ・ 3 days ago

what are the point of benchmarks?

andai ・ 3 days ago

If there was not benchmark, number would not go up.
esafak ・ 3 days ago

Are you serious? To help you pick a model.
- undefined ・ 2 days ago
  
  [deleted]

Kevin_VAI ・ a day ago

[dead]

bustah ・ a day ago

[dead]

techpulselab ・ 2 days ago

[dead]

rupayanc ・ a day ago

[dead]

telivity-real ・ 2 days ago

[dead]

arthi1899 ・ 2 days ago

[dead]

sanghyunp ・ 2 days ago

[dead]

neuzhou ・ a day ago

[dead]

rajptech ・ 3 days ago

[dead]

volume_tech ・ 2 days ago

[dead]

vampiregrey ・ 2 days ago

[dead]

usefulpatch ・ 2 days ago

[dead]

sidequestbuilds ・ a day ago

[dead]

neuzhou ・ 2 days ago

[dead]

genie3io ・ 2 days ago

[dead]

semanticintent ・ 2 days ago

[flagged]

esperent ・ 2 days ago

> A validator that checks "did the assistant reply?" instead of "was the reply correct?" was never a benchmark. It was a participation trophy
People can't even write a two paragraph comment without ai now

andai ・ 2 days ago

Apparently, the agent also wrote the article.