Major AI conference flooded with peer reviews written by AI

nature.com

・

215 points

・

_____k

・

2 days ago

138 comments

jampa ・ 2 days ago

While I think there's significant AI "offloading" in writing, the article's methodology relies on "AI-detectors," which reads like PR for Pangram. I don't need to explain why AI detectors are mostly bullshit and harmful for people who have never used LLMs. [1]

1: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

maxspero ・ 2 days ago

I am not sure if you are familiar with Pangram (co-founder here) but we are a group of research scientists who have made significant progress in this problem space. If your mental model of AI detectors is still GPTZero or the ones that say the declaration of independence is AI, then you probably haven't seen how much better they've gotten.
This paper by economists from the University of Chicago economists found zero false positives of 1,992 human-written documents and over 99% recall in detecting AI documents. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5407424
- nialse ・ 2 days ago
  
  Nothing points out that the benchmark is invalid like a zero false positive rate. Seemingly it is pre-2020 text vs a few models rework of texts. I can see this model fall apart in many real world scenarios. Yes, LLMs use strange language if left to their own devices and this can surely be detected. 0% false positive rate under all circumstances? Implausible.
  
  maxspero ・ 2 days ago
  ・ 4 more
  
  Our benchmarks of public datasets put our FPR roughly around 1 in 10,000. https://www.pangram.com/blog/all-about-false-positives-in-ai...
  Find me a clean public dataset with no AI involvement and I will be happy to report Pangram's false positive rate on it.
  
  godelski ・ a day ago
  
  Max, there's two problems I see with your comment.
  1) the paper didn't show a 0% FNR. I mean tables 4, 7, and B.2 are pretty explicit. It's not hard to figure out from the others either.
  2) a 0% error rate requires some pretty serious assumptions to be true. For that type of result to not be incredibly suspect requires there to be zero noise in the data, analysis, and at all parts. I do not see that being true of the mentioned dataset.
  Even high scores are suspect. Generalizing the previous a score is suspect if it is higher than the noise level. Can you truly attest that this condition is true?
  I'm suspect that you're introducing data leakage. I haven't looked enough into your training and data to determine how that's happening but you'll probably need a pretty deep analysis as leakage is really easy to sneak in. It can do so in non obvious ways. A very common one is tuning hyper parameters on test results. You don't have to pass data to pass information. Another sly way for this to happen is that the test set isn't significantly disjoint from the training set. If the perturbation is too small then you aren't testing generalization you're testing a slightly noisy training set (which your training should be introducing noise to help regularize, so you end up just measuring your training performance).
  Your numbers are too good and that's suspect. You need a lot more evidence to suggest they mean what you want them to mean.
  
  Grimblewald ・ 2 days ago
  
  [flagged]
  
  Oarch ・ 2 days ago
  
  I enjoyed this thoughtful write up. It's a vitally important area for good, transparent work to be done.
  
  pinkmuffinere ・ 2 days ago
  ・ 3 more
  
  > Nothing points out that the benchmark is invalid like a zero false positive rate
  You’re punishing them for claiming to do a good job. If they truly are doing a bad job, surely there is a better criticism you could provide.
  
  nialse ・ a day ago
  
  No one is punishing anyone. They just make an implausible claim. That is it.
  
  undefined ・ 2 days ago
  
  [deleted]
- bonsai_spool ・ 2 days ago
  
  EditLens (Ours) Predicted Label Human Mix AI ┌─────────┬─────────┬─────────┐ Human │ 1770 │ 111 │ 0 │ ├─────────┼─────────┼─────────┤ True Mix │ 265 │ 1945 │ 28 │ Label ├─────────┼─────────┼─────────┤ AI │ 0 │ 186 │ 1695 │ └─────────┴─────────┴─────────┘
  It looks like 5% of human texts from your paper are marked as mixed, and mixed texts are 5-10% if mixed texts as AI, from your paper.
  I guess I don’t see that this is much better than what’s come before, using your own paper.
  Edit: this is an irresponsible Nature news article, too - we should see a graph of this detector over the past ten years to see how much of this ‘deluge’ is algorithmic error
- lifthrasiir ・ 2 days ago
  
  It is not wise to brag about your product when the GP is pointing out that the article "reads like PR for Pangram", no matter AI detectors are reliable or not.
  
  glenstein ・ 2 days ago
  
  I would say it's important to hold off on the moralizing until after showing visible effort to reflect on the substance of the exchange, which in this case is about the fairness of asserting that the detection methodology employed in this particular case shares the flaws of familiar online AI checkers. That's an importantly substantive and rebuttable point and all the meaningful action in the conversation is embedded in those details.
  In this case, several important distinctions are drawn, including being open about criteria, about such things as "perplexity" and "burstiness" as properties being tested for, and an explanation of why they incorrectly claim the Declaration of Independence is AI generated (it's ubiquitous). So it seems like a lot of important distinctions are being drawn that testify to the credibility of the model, which has to matter to you if you're going to start moralizing.
- ugh123 ・ 2 days ago
  
  How do you discern between papers "completely fabricated" by AI vs. edited by AI for grammar?
- rs186 ・ 2 days ago
  
  The response would be more helpful if it directly addresses the arguments in posts from that search result.
  
  maxspero ・ 2 days ago
  ・ 7 more
  
  There are dozens of first generation AI detectors and they all suck. I'm not going to defend them. Most of them use perplexity based methods, which is a decent separators of AI and human text (80-90%) but has flaws that can't be overcome and high FPRs on ESL text.
  https://www.pangram.com/blog/why-perplexity-and-burstiness-f...
  Pangram is fundamentally different technology, it's a large deep learning based model that is trained on hundreds of millions of human and AI examples. Some people see a dozen failed attempts at a problem as proof that the problem is impossible, but I would like to remind you that basically every major and minor technology was preceded by failed attempts.
  
  QuadmasterXLII ・ 2 days ago
  ・ 3 more
  
  Some people see a dozen extremely profitable, extremely destructive attempts at a problem as proof that the problem is not a place for charitable interpretation.
  
  wordpad ・ 2 days ago
  ・ 2 more
  
  And you don't think a dozen of basically scams around the technology justify extreme scepticism?
  
  QuadmasterXLII ・ 2 days ago
  
  huh?
  
  pixl97 ・ 2 days ago
  
  GAN.. Just feed the output of your algorithms back into the LLM while learning. At the end of the day the problem is impossible, but we're not there yet.
  
  anonymouskimmer ・ 2 days ago
  ・ 2 more
  
  Can your software detect which LLMs most likely generated a text?
  
  maxspero ・ 2 days ago
  
  Pangram is trained on this task as well to add additional signal during training, but it's only ~90% accurate so we don't show the prediction in public-facing results
- ThrowawayTestr ・ 2 days ago
  
  Are you concerned with your product being used to improve AI to be less detectable?
  
  Majromax ・ 2 days ago
  ・ 2 more
  
  > Are you concerned with your product being used to improve AI to be less detectable?
  The big AI providers don't have any obvious incentive to do this. If it happens 'naturally' in the pursuit of quality then sure, but explicitly training for stealth is a brand concern in the same way that offering a fully uncensored model would be.
  Smaller providers might do this (again in the same way they now offer uncensored models), but they occupy a miniscule fraction of the market and will be a generation or two behind the leaders.
  
  ThrowawayTestr ・ 2 days ago
  
  They don't have an incentive to make their AIs better? If your product can genuinely detect AI writing, of course they would use it to make their models sound more human. The biggest criticism of AI right now is how robotic and samey it sounds.
  
  maxspero ・ 2 days ago
  ・ 2 more
  
  It's definitely going to be a back and forth - model providers like OpenAI want their LLMs to sound human-like. But this is the battle we signed up for, and we think we're more nimble and can iterate faster to stay one step ahead of the model providers.
  
  ThrowawayTestr ・ 2 days ago
  
  That sounds extremely naive but good luck!
- jay_kyburz ・ 2 days ago
  
  I thought the author was attempting to highlight the hypocrisy of using an AI to detect other uses of AI, as if one was a good use, and the other bad.
- interleave ・ a day ago
  
  Hi Max! Thank you for updating my mental model of AI detectors.
  I was with total certainty under the impression that detecting AI-written text to be an impossible-to-solve problem. I think that's because it's just so deceptively intuitive to believe that "for every detector, there'll just be a better LLM and it'll never stop."
  I had recently published a macOS app called Pudding to help humans prove they wrote a text mainly under the assumption that this problem can't be solved with measurable certainty and traditional methods.
  Now I'm of course a bit sad that the problem (and hence my solution) can be solved much more directly. But, hey, I fell in love with the problem, so I'm super impressed with what y'all are accomplishing at and with Pangram!
- moffkalast ・ 2 days ago
  
  I see the bullshit part continues on the PR side as well, not just in the product.
Jensson ・ 2 days ago

AI detectors are only harmful if you use them to convict people, it isn't harmful to gather statistics like this. They didn't find many AI written paper, just AI written peer reviews, which is what you would expect since not many would generate their whole paper submissions while peer reviews are thankless work.
- teeray ・ 2 days ago
  
  If you have a bullshit measure that determines some phenomena (e.g. crime) to happen in some area, you will become biased to expect it in that area. It wrongly creates a spotlight effect by which other questionable measures are used to do the actual conviction (“Look! We found an em dash!”)
femiagbabiaka ・ 2 days ago

I think there is a funny bit of mental gymnastics that goes on here sometimes, definitely. LLM skeptics (which I'm not saying the Pangram folks are in particular) would say: "LLMs are unreliable and therefore useless, it's producing slop at great cost to the environment and other people." But if a study comes out that confirms their biases and uses an LLM in the process, or if they themselves use an LLM to identify -- or in many cases just validate their preconceived notion -- that something was drafted using an LLM, then all the sudden things are above board.

itkovian_ ・ 2 days ago

Whether it’s actually 20% or not doesn’t matter, everyone is aware the signal of the top confs is in freefall.

There are also rings of reviewer fraud going on where groups of people in these niche areas all get assigned their own papers and recommend acceptance and in many cases the AC is part of this as well. Am not saying this is common but it is occurring.

It feels as if every layer of society is in maximum extraction mode and this is just a single example. No one is spending time to carefully and deeply review a paper because they care and they feel on principal that’s the right thing to do. People did used to do this.

itkovian_ ・ 2 days ago

The argument is that there is no incentive to carefully review a paper (I agree), however what used to occur is people would do the right thing without explicit incentives. This has totally disappeared.
- bee_rider ・ 2 days ago
  
  The concept of the professional has been basically obliterated in our society. Instead we have people doing engineering, science, and doctoring as, just, jobs. Individual contributors of various flavors to be shuffled around by middle management.
  Without professions, there are no more professional communities really, no more professional standards to uphold, no reason to get in the way of somebody’s publications.
  
  slashdave ・ 2 days ago
  ・ 10 more
  
  It is soundly unfair and unjustified to extrapolate the ML community to all professions. What is happening in the ML world is the exception, not the norm, and not some fundamental failing of society.
  
  h00kwurm ・ 2 days ago
  ・ 9 more
  
  I don’t think it’s an extrapolation from the ML community into other industries. This evolution of society is objectively happening - artisanship, care for the work beyond capital gain, and commitment to depth in a focused category - are diminishing and harder to find qualities. I’d probably label it related to capital and material social economics. It’s perhaps more unfair and unjustified to not recognize this as a real societal issue and claim it only exists in the ML community.
  
  immibis ・ 2 days ago
  ・ 6 more
  
  Just yesterday I saw this YouTube rant from someone called Jaiden Animations, about how everything is just shit now. https://www.youtube.com/watch?v=NBZv0_MImIY
  She opens with an example of a bank. She walked in and asked for a debit card. The teller told her to take a seat. 30 minutes later, the teller told her the bank doesn't issue debit cards. Firstly, what kind of bank doesn't issue debit cards, and secondly, what kind of bank takes 30 minutes to figure out whether or not it issues debit cards? And this is just one of many examples of things that society does that have no reason not to work, that should have been selected away long ago if they did not work - that bank should have been bankrupt long ago - but for some reason this is not happening and everything is just getting clogged with bullshit and non-working solutions.
  
  Loughla ・ 2 days ago
  ・ 5 more
  
  It's because people are commodities now. Human resources exists to manage the shuffle between warm bodies.
  It's back to OP's point. There's no such thing as professions now. Just jobs. We put them on and off like hats. With that churn comes lack of institutional knowledge and a rule set handed down from the C Suite for front line employees completely detached from the front line work.
  Enshitification run rampant.
  
  slashdave ・ 2 days ago
  
  > It's because people are commodities now.
  Nothing like the textile mills of the 1900s. You'll need to do better.
  
  immibis ・ 2 days ago
  ・ 3 more
  
  But even given that, how is it that everything doesn't work very well?
  The normal functioning of markets would be that badly-working things are slowly driven out, while well-working things grow and replace them. Even without any reference to financial markets, this is simply what you expect to happen when people have a variety of things to choose from.
  I could hypothesize that markets have evolved to the point where it's impossible for new things to grow unless they are already shit. Perhaps because everyone's too busy working for the shit things (which is partly because the government keeps printing money to the previously successful things in order to prevent the economy collapsing and therefore landlords got to charge exorbitant rent) or perhaps because they just don't have any money because of the above, and can only afford the cheap shit things (but a lot of the shit things are expensive?) or perhaps because people are afraid to start new things because they're afraid of the government (I've observed that not infrequently on HN, also something something testosterone microplastics) or perhaps because advertising effectiveness has reached the point where new things never become discoverable and stay crowded out as old things ramp up advertisement to compensate or perhaps we're just all depressed (because of the housing market probably).
  
  dwaltrip ・ a day ago
  ・ 2 more
  
  Everything has always been "shit"...
  Things might be shit in interesting and scary new ways, but there is no such thing as "the good ole days". Our mind wants to believe that things could go back to "how they used to be", "when it was better" but it's a fantasy.
  It's an inability to face the cognitive dissonance and accept things as they are -- which is different than what we wanted! Boo hoo.
  We all do this constantly everyday, some more than others :)
  That said, humans are quite good at getting by even when things are shit. We've been doing it for untold eons.
  Perhaps the only thing more impressive is how good we are at complaining about it all! Heh.
  
  immibis ・ 11 hours ago
  
  Your comment is a thought-terminating comment.
  
  slashdave ・ 2 days ago
  ・ 2 more
  
  It's a poor extrapolation. The issues with the ML community have more to do with the exponential growth of the "AI" industry, the resulting flow of capital, and the outsized role these conferences provide for establishing a researchers value to the industry. These conditions are fairly unique.
  I would propose that the evolution you speak of is more related to our technology (and I am not just saying AI, far from it) and how it is now possible to perform the very minimum requirements of a task with little effort.
  
  h00kwurm ・ 2 days ago
  
  I don’t disagree that technology is allowing a new low bar for minimal allowable effort. This is true in a world where the same technology could enable one to deliver amazing things. I’m speaking more generally and I think you describe the exact problem in your clarification which boils down to “people are chasing money and doing whatever it takes in ML, where the money currently is”. I was stopping at the fact that “people […] chasing money and doing whatever it takes” has become the general personal pursuit, quality/depth/care be damned.
isoprophlex ・ 2 days ago

If the Zucc has a weird day he starts dropping 10-100M salary packages in order to poach AI researchers. No wonder the game is getting rigged up the butthole.
apf6 ・ 2 days ago

to some degree this is a "market correction" on the inherent value of these papers. There's way too many low-value papers that are being published purely for career advancement and CV padding reasons. Hard to get peer reviewers to care about those.
araes ・ 18 hours ago

> spending time to carefully and deeply review a paper because they care and they feel on principal that’s the right thing to do
Generally agree, although several parts of that issue.
One of the first was covered by a paper back in 2023 that speaks to the issue about maximum extraction mode. [1] Fairness, honesty, and loyalty are usually rewarded with exploitation. If you spend time to carefully and deeply review the paper, then that ironically marks you as someone that can be exploited. You're implicitly marked as someone who will make personal sacrifices for the academic community and allow even more awful behavior to be piled on top of you. Unless they're caught with something especially egregious, the people that don't, get promoted, spend less time on reviews, and get further rewards.
[1] https://www.sciencedirect.com/science/article/abs/pii/S00221...
The academic community has talked about this a bunch for years. Editors / reviewers that don't paid, or get minimal payment, and sacrifice large amounts of their personal time effectively volunteering, while authors pay $1000's for each paper submitted, and then journals charge $10,000's for each subscription. It's been talked about for decades, and yet in all that time, very little has actually occurred to change the situation.
Another part on top of the "deeply reviewing papers" is that the sheer volume has massively increased (which has been an issue in a bunch of industries, sci-fi compilation Clarkesworld broke for quite a while in 2023 for similar reasons [2]). In the land of "type a sentence, and get a free academic paper" the extremely prolific are pouring out a paper a month, sometimes greater amounts. In areas like clinical medicine, hyper-prolific publishing has hit 70+ papers a year rates. [3] ~1.5 papers a week. Every few days somebody cranks out yet another paper that needs to be reviewed. In the article linked, one author had 140 articles to a single journal alone. Almost 3 times a week, all year long, you've got a paper claiming research worthy of publishing you need to review.
[2] https://neil-clarke.com/how-ai-submissions-have-changed-our-...
[3] https://www.sciencedirect.com/science/article/pii/S175115772...
One that I have less direct, citeable proof for, yet am rather suspicious of, is that theft has also dramatically increased with a huge surge in invasive monitoring and snooping. If my TV changes what I'm watching, and what's recommended, because I typed a text message to somebody, it seems likely that a lot of academia is also dealing with massive intellectual theft issues. This then heavily prioritizes pouring out material as quickly as possible, with as little effort as possible, to get the equivalent of first post and maximal posts, before it can be scraped, exfiltrated, and published by somebody else.
Finally, a lot of the reward and incentive has become metric chasing. Publish or Perish [4] and the Replication Crisis [5] are relatively well known ideas. Citation is a proxy of the impact of a paper, tenure and advancement is heavily related to quantity of publications and citations, and researchers would prefer to be cited more. And weirdly, if it does not work, and it's junk work, in a theme with the above, then it has been suggested nonreplicable publications are cited more than replicable ones [6]. In the linked paper, the view is that when "interesting" findings are published, they get more views, more media, more citations, and lower review standards get applied. And afterward there's very little social punishment for proving the results are false and not replicable (or reward for those illustrating lack of reproducability). Notably, the paper actually got a counterpoint stating that in psychology at least, lack of replication eventually predicts citation decline [7] (cited by 10), while the original actually got its authors ~250 citations, and a bunch of media mentions.
[4] https://en.wikipedia.org/wiki/Publish_or_perish
[5] https://en.wikipedia.org/wiki/Replication_crisis
[6] https://www.science.org/doi/10.1126/sciadv.abd1705
[7] https://www.pnas.org/doi/10.1073/pnas.2304862120

nkrisc ・ 2 days ago

> Pangram’s analysis revealed that around 21% of the ICLR peer reviews were fully AI-generated, and more than half contained signs of AI use. The findings were posted online by Pangram Labs. “People were suspicious, but they didn’t have any concrete proof,” says Spero. “Over the course of 12 hours, we wrote some code to parse out all of the text content from these paper submissions,” he adds.

But what's the proof? How do you prove (with any rigor) a given text is AI-generated?

slashdave ・ 2 days ago

"proof" was an unfortunate phrase to use. However, a proper statistical analysis can be objective. And these kinds of tools are perfectly suited to such an analysis.
- maxspero ・ 2 days ago
  
  Yeah, Pangram does not provide any concrete proof, but it confirms many people's suspicions about their reviews. But it does flag reviews for a human to take a closer look and see if the review is flawed, low-effort, or contains major hallucinations.
  
  vladms ・ 2 days ago
  ・ 2 more
  
  Was there an analysis of flawed, low-effort reviews in similar conferences before generative AI models?
  From what I remember, (long before generative AI) you would still occasionally get very crappy reviews (as author). When I participated (couple of times) to review committees, when there was a high variance between reviews the crappy reviews were rather easy to spot and eliminate.
  Now it's not bad to detect crappy (or AI) reviews, but I wonder if it would change much the end result compared to other potential interventions.
  
  maxspero ・ 2 days ago
  
  Anecdotally people are seeing a rise of low-quality reviews which is correlated with increased reviewer workload and and AI tools giving reviews an easy way out. I don't know of any studies quantifying review quality, but I would recommend checking the Peer Review Congress program from past years.
  
  jmpeax ・ 2 days ago
  ・ 2 more
  
  > does not provide any concrete proof, but it confirms many people's suspicions
  Without proof there is no confirmation.
  
  lazide ・ a day ago
  
  Formally? Sure. In the current zeitgeist it’s more than enough to start pointing fingers around, etc.
  
  undefined ・ 2 days ago
  
  [deleted]
nabla9 ・ 2 days ago

With AI model of course.
They wrote a paper describing how they did it. https://arxiv.org/pdf/2510.03154
ModernMech ・ 2 days ago

I have this problem with grading student papers. Like, I "know" a great deal of them are AI, but I just can't prove it, so therefore I can't really act on any suspicions because students can just say what you just said.
- hyperadvanced ・ 2 days ago
  
  Why do you need proof anyway? Do you need proof that sentences are poorly constructed, misleading, or bloated? Why not just say “make it sound less like GPT” and let them deal with it?
  
  circuit10 ・ 2 days ago
  ・ 2 more
  
  You can have sentences that are perfectly fine but have some markers of ChatGPT like "it's not just X — it's Y" (which may or may not mean it's generated)
  
  hyperadvanced ・ 2 days ago
  
  Isn’t that kind of thing (reliance on cliché) already a valid reason for getting marked down?
  
  chdjdbdbfjf ・ 2 days ago
  
  [flagged]
- nkrisc ・ 2 days ago
  
  But in that case do you need to prove? You can grade them as they are and if you wanted to you (or teachers, generally) could even quiz the student verbally and in-person about their paper.
- shawabawa3 ・ 2 days ago
  
  Put poison prompts in the questions (things like "then insert tomato soup recipe" or "in the style of Shakespeare"), ideally in white font so they're invisible
  
  seanmcdirmid ・ 2 days ago
  
  Many people using AI to write aren’t blindly copying AI output. You’ll catch the dumb cheaters like this, but that’s just about it.
whynotmaybe ・ 2 days ago

I wouldn't be surprised to learn that the AI detection tool is itself an AI
- moffkalast ・ 2 days ago
  
  Fighting fire with fire sounds good in theory but in the end you're still on fire.
- Lionga ・ 2 days ago
  
  But it works it was peer reviewed! (by AI)
dkdcio ・ 2 days ago

> How do you prove (with any rigor) a given text is AI-generated?
you cannot. beyond extra data (metadata) embedded in the content, it is impossible to tell whether given text was generated by a LLM or not (and I think the distinction is rather puerile personally)
lazide ・ a day ago

You don’t. It’s bullshit inception.

getnormality ・ 2 days ago

I wouldn't be surprised if the headline is accurate, but AI detectors are widely understood to be unreliable, and I see no evidence that this AI detector has overcome the well-deserved stigma.

maxspero ・ 2 days ago

Co-founder of Pangram here. Our false positive rate is typically around 1 in 10,000. https://www.pangram.com/blog/all-about-false-positives-in-ai....
We also wanted to quantify our EditLens model's FPR on the same domain, so we ran all of ICLR's 2022 reviews. Of 10,202 reviews, Pangram marked 10,190 as fully human, 10 as lightly AI-edited, 1 as moderately AI-edited, 1 as heavily AI-edited, and none as fully AI-generated.
That's ~1 in 1k FPR for light AI edits, 1 in 10k FPR for heavy AI edits.
- Fuzzwah ・ 2 days ago
  
  Give your final sentence a re-read there....
  
  maxspero ・ 2 days ago
  
  Thanks, fixed.
SoftTalker ・ 2 days ago

In particular, conference papers are already extremely formulaic, organized in a particular way and using a lot of the same stock phrasings and terms of art. AI or not, it's hard to tell them apart.
- Jensson ・ 2 days ago
  
  Its the reviews that were found to be AI, not the papers themselves. The papers were just 1% AI according to the tool, so it seems to work properly.
  > AI or not, it's hard to tell them apart.
  Apparently not for this tool.
Jensson ・ 2 days ago

The conference papers were 1%, peer reviews 20%, is there another reason for that big difference than more of the peer reviews being AI generated than the papers themselves?
We can't use this to convict a single reviewer, but we can almost surely say that many reviewers just gave the review work to an AI.
- bjourne ・ 2 days ago
  
  Peer reviews are much shorter than papers. A bad review may only be six to eight sentences along the lines of "Math is confusing" and "Too little related work". 20% AI generated seems high if for no other reason than writing a bad review takes very little effort.

cratermoon ・ 2 days ago

Headline should be "AI vendor’s AI-generated analysis claims AI generated reviews for AI-generated papers at AI conference".

h/t to Paul Cantrell https://hachyderm.io/@inthehands/115633840133507279

raincole ・ 2 days ago

> Controversy has erupted after 21% of manuscript reviews for an international AI conference were found to be generated by artificial intelligence.

21%...? Am I reading it right? I bet no one expected it's so low when they clicked this title.

conartist6 ・ 2 days ago

21% fully AI generated. In other words, 21% blatant fraud.
In accident investigation we often refer to "holes in the swiss cheese lining up." Dereliction of duty is commonly one of the holes that lines up with all the others, and is apparently rampant in this field.
- tmule ・ 2 days ago
  
  Why? I often feed an entire document I hastily wrote into an AI and prompt it to restructure and rewrite it. I think that’s a common pattern.
  
  conartist6 ・ 2 days ago
  ・ 3 more
  
  It might be, but I really doubt those were the documents flagged as fully AI generated. If it erased all the originality you had put into that work and made it completely bland and regressed-to-the-mean, I would hope that you would notice.
  
  tmule ・ 2 days ago
  
  My objective function isn’t to maximize the originality of presentation - it’s to preserve the originality of thought and maximize interpretability. Prompting well can solve for that.
  
  exe34 ・ 2 days ago
  
  > I would hope that you would notice.
  he didn't say he read it carefully after running it through the slop machine.
- jay_kyburz ・ 2 days ago
  
  Who cares what tool was used to write the work? The important question is what percentage of reviews found errors or provided valuable feedback. The important metric is whether or not it did the job, not how it was produced.
  I think there is a far more interesting discussion to be had here about how useful the 21% percent were. How well does an AI execute a peer review?

hnaccount_rng ・ 2 days ago

My initial reaction was: Oh no, who would have thought? But then... 21% is almost shockingly low. Especially given that there are almost certainly some false positive, given that this number originates with a company selling "detecting AI generated text"

JohnCClarke ・ 2 days ago

The question is not are the reviews AI generated. The question is are the reviews accurate?

iainctduncan ・ 2 days ago

No.. that is not the question.
This is a conference purporting to do PEER review. No matter how good the AI, it's not a peer review.
stanfordkid ・ 2 days ago

Exactly this. Like is the research actually useful and correct is what matters. Also if it is accurate, instead of schadenfreude shouldn't that elicit extreme applause? It's feeling a bit like a click-bait rage-fantasy fueled by Pangram, capitalizing on this idea that AI promotes plagiarism / replaces jobs and now the creators of AI are oh-too human... and somehow this AI-detection product is above it all.
conartist6 ・ 2 days ago

LOL. So basically the correct sequence of events is: 1. The scientist does the work, putting their own biases and shortcomings into it 2. The reviewer runs AI, generating something that looks plausibly like review of the work but represents the view of a sociopath without integrity, morals, logic, or any consequences for making shit up instead of finding out. 3. The scientist works to determine how much of the review was AI, then acts as the true reviewer for their own work.
- Herring ・ 2 days ago
  
  Don't kid yourself, all those steps have AI heavily involved in them.
  And that's not necessarily a bad thing. If I set up RAG correctly, then tell the AI to generate K samples, then spend time to pick out the best one, that's still significant human input, and likely very good output too. It's just invisible what the human did.
  And as models get better, the necessary K will become smaller....
  
  conartist6 ・ 2 days ago
  ・ 3 more
  
  That's a strategy for producing maximally convincing BS content, but the scientific method was absent. I'm sure it was an oversight... ; )
  
  Herring ・ 2 days ago
  ・ 2 more
  
  That’s on you. You get to decide what “best” means when picking among the K, so you only get bs if you want bs.
  I occasionally get people telling me AI is unreliable, and I tell them the same thing: the tech is nearly infinitely flexible (computing over the space of ideas!), so that says a lot more about how they’re using it.
  
  conartist6 ・ a day ago
  
  So to really think the person writing the paper is the right one to review it?

macleginn ・ 2 days ago

This is also the conference where everybody was briefly deanonymized due to an OpenReview bug: https://eu.36kr.com/en/p/3572028126116993 Now all the review scores have been reset, and new area chairs will make all decisions from scratch based on the reviews and authors' responses.

zkmon ・ 2 days ago

Eating one's own dog food? The foremost affected species would be the ones who helped create this monster and standing close to it - programmers, researchers, universities - the knowledge-worker or knowledge-business species.

radarsat1 ・ 2 days ago

Maybe what they should do in the future is just automatically provide AI reviews to all papers and state that the work of the reviewers is to correct any problems or fill details that were missed. That would encourage manual review of the AI's work and would also allow authors to predict what kind of feedback they'll get in a structured way. (eg say the standard prompt used was made public so authors could optimize their submission for the initial automatic review, forcing the human reviewer to fill in the gaps)

ok of course the human reviewers could still use AI here but then so could the authors, ad infinitum..

hiddencost ・ 2 days ago

Automated AI detection tools do not work. This whole article is premised on an analysis by someone trying to sell their garbage product.

AznHisoka ・ 2 days ago

Yeah that is the premise all of these articles/tools just conveniently brush off. “We detected that x%… “ OK, and how do I know ur detectiok algorithm is right?
- conartist6 ・ 2 days ago
  
  Usually the detectors are only called in once a basic "smell test" has failed. Those tests are imperfect, yes, but Bayesian probability tells us how to work out the rest. I have 0 trouble believing that the prior probability of an unscrupulous individual offloading an unpleasant and perceived-as-just-ceremonial duty to the "thinking machine" is around 20%. See: https://www.youtube.com/watch?v=lG4VkPoG3ko&pp=ygUZdmVyaXRhc...

rsynnott ・ 2 days ago

Live by the sword, die by the sword.

AndrewKemendo ・ 2 days ago

AI has left the lab the conferences and journals are all second class citizens to corporate labs at this point. So many technology people wanted to return to the “Bell Labs” model of monopolist controlled innovation, well, you got it.

I’ve been to CVPR, NeurIPS and AGI conferences over the last decade and they used to be where progress in AI was displayed.

No longer. Progress is all in your github and increasingly only dominated by the “new” AI companies (Deepmind, OAI, Anthropic, Alibaba etc…)

No major landscape shifting breakthroughs have come out of CSAIL, BAIR, NYU, TuM etc in ~the last 5 years.

I’d expect this will continue as the only thing that matters at this point is architecture data and compute.

hn_throwaway_99 ・ 2 days ago

AI slop has infiltrated so many areas. Check out this article that was on the front page of HN last week, "73% of AI startups are just prompt engineering", with hundreds of points and lots of comments arguing for or against: https://news.ycombinator.com/item?id=46024644

The problem is the entire article is made up. Sure, the author can trace client-side traffic, but the vast majority of start-ups would be making calls to LLMs in their backend (a sequence diagram in the article even points this out!!), where it would be untraceable. There is certainly no way the author can make a broad statement that he knows what's happening across hundreds of startups.

Yet lots of comments just taking these conclusions at face value. Worse, when other commenters and myself pointed out the blatant impossibility of the author's conclusion, got some responses just rehashing how the author said they "traced network traffic", even though that doesn't make any sense as they wouldn't have access to backends of these companies.

minifridge ・ 2 days ago

I could not tell from the article whether the use of LLMs was allowed in the peer review. My guess would that it was not since this is unpublished research.

In general, what bothers me the most is the lack of transparency from researchers that use LLMs. Like, give me the text and explicitly mention that you used LLM for it. Even better, if one links the prompt history.

The lack of transparency causes greater damage than the using LLM for generating text. Otherwise, we will keep chasing the perfect AI detector which to me seems to be pointless.

paulpauper ・ 2 days ago

Everyone is focused on how 'the humanities' are in decline, but STEM is not immune to this trend. The state of AI research leaves much to be desired. Tons of low-quality papers being published or submitted to conferences . You see this on arXiv a lot in the bloated CS section . The site has become a repository for blog post equivalent papers.

mkl ・ 2 days ago

https://archive.ph/1cmjJ

nojs ・ 2 days ago

This may not be as bad as it sounds. Reviews are also presumably flagged as “fully AI-generated” if the reviewer wrote bullet points and used the LLM to flesh them out.

exe34 ・ 2 days ago

Could the big names make a ton of money here by selling AI detectors? they would need to store everything they generate, and then provide a % match to something they produced.

Jimmc414 ・ 2 days ago

This won’t convince people to write their own papers. It will push them to make their AI generated text harder to detect.

rullera ・ 2 days ago

Because it is in nature but really it does read like an ad... All conferences need pangram tools I guess

Herring ・ 2 days ago

I couldn't care less tbh. I just want to know whether they're correct or not. We need something like unit testing and integration testing, but for ideas.

For the record I actually like the AI writing style. It's a huge improvement in readability over most academic writing I used to come across.

TomasBM ・ 2 days ago

I haven't come across any reviews that I could recognize as having been blatantly LLM-generated.

However, almost every peer review I was a part of, pre- and post-LLM, had one reviewer who provided a questionable review. Sometimes I'd wonder if they'd even read the submission, and sometimes, there were borderline unethical practices like trying to farm citations through my submission. Luckily, at least one other diligent reviewer would provide a counterweight.

Safe to say that I don't find it surprising, and hearing / reading others' experiences tells me it's yet another symptom of a barely functioning mechanism that is peer review today.

Sadly, it's the best mechanism that institutions are willing to support.

NitpickLawyer ・ 2 days ago

This is the kind of situation where everything sucks. You'd think that one of the biggest AI conference out there would have seen this coming.

On the one hand (and the most important thing, IMO) it's really bad to judge people on the basis of "AI detectors", especially when this can have an impact on their career. It's also used in education, and that sucks even more. AI detectors have bad rates, can't detect concentrated efforts (i.e. finetunes will trick every detector out there, I've tried) can have insane false positives (the first ones that got to "market" were rating the declaration of independence as 100% AI written), and at best they'll only catch the most vanilla outputs.

On the other hand, working with these things, and just being online is impossible to say that I don't see the signs everywhere. Vanilla LLMs fixate on some language patterns, and once you notice them, you see them everywhere. It's not just x; it was truly y. Followed by one supportive point, the second supportive point and the third supportive point. And so on. Coupled with that vague enough overview style, and not much depth, it's really easy to call blatant generations as you see them. It's like everyone writes in linkedin infused mania episodes now. It's getting old fast.

So I feel for the people who got slop reviews. I'd be furious. Especially when its faux pas to call it out.

I also feel for the reviewers that maybe got caught in this mess for merely "spell checking" their (hopefully) human written reviews.

I don't know how we'll fix it. The only reasonable thing for the moment seems to be drilling into everyone that at the end of the day they own their stuff. Be it a homework, a PR or a comment on a blog. Some are obviously more important than the others, but still. Don't submit something you can't defend, especially when your education/career/reputation depends on it.

slashdave ・ 2 days ago

Not just spell checking, but translation. English is not the first language for most of the reviewers.
But you can see the slippery slope: first you ask your favorite LLM to check your grammar, and before you think about it, you are just asking it to write the whole thing.
ungovernableCat ・ 2 days ago

It also permeates culture to the point that people imitate the LLM style because they believe that's just what you have to do to get your post noticed. The worst offender is that LinkedIn type post
Where you purposefully put spaces.
Like this.
And the clicker is?
You get my point. I don't see a way out of this in the social media context because it's just spam. Producing the slop takes an order of magnitude less effort than parsing it. But when it comes to peer reviews and papers I think some kind of reputation system might help. If you get caught doing this shit you need to pay some consequence.
intended ・ 2 days ago

You don’t fix this.
Humans optimize for effort.
We have expanded the market for lemons.
People can say they are doing the work, use AI, and offload testing on the other party.
Buyers will respond by moving their purchase price down. People selling quality content will realize they don’t have a chance to get the fair value and exit the market.
Anyone arguing otherwise, needs to explain how, or who, is going to handle the added burden of verification that has been foisted on all of society.

yumraj ・ 2 days ago

Serious question: if the research itself is valid and human conducted, what is the problem with AI generated (or at least AI assisted) report?

Many of the researchers may not have native command of English and even if, AI can help in writing in general.

Obviously I’m not referring to pure AI generated BS.

bjourne ・ 2 days ago

"Hariharan says each ICLR reviewer was assigned five papers that they had to review in two weeks, on average."

That is a tad bit too much...

starchild3001 ・ 2 days ago

AI-text detection software is BS. Let me explain why.

Many of us use AI to not write text, but re-write text. My favorite prompt: "Write this better." In other words, AI is often used to fix awkward phrasing, poor flow, bad english, bad grammar etc.

It's very unlikely that an author or reviewer purely relies on AI written text, with none of their original ideas incorporated.

As AI detectors cannot tell rewrites from AI-incepted writing, it's fair to call them BS.

Ignore...

blibble ・ 2 days ago

well there goes the ASI threat

hoisted by your own petard

JohnCClarke ・ 2 days ago

What percentage of the papers where written by AI?

And, if your AI can't write a paper, are you even any good as an AI researcher? :^)

p1esk ・ 2 days ago

Did you mean: “if your AI can’t write a paper that passes an AI detector, are you any good as an AI researcher?”

heresie-dabord ・ 2 days ago

AI research is interesting, but AI Slop is the monetising factor.

It's inevitable that faces will be devoured by AI Leopards.

insane_dreamer ・ 2 days ago

Sorry to say but it's another example of the destructive power of AI, along the lines of no longer being able to establish "truth" now that any evidence (video, audio, image, etc.) can be explicitly faked (yes, AI detectors exist but that will be a continuous race with AIs designed to outsmart the detectors). The end result could be that peer reviews become worthless and trust in scientific research -- already at an all time low -- becomes even lower. Sad.

jsrozner ・ 2 days ago

There is a lot of dislike for AI detection in these comments. Pangram labs (PL) claims very low false positive rates. Here's their own blog post on the research: https://www.pangram.com/blog/pangram-predicts-21-of-iclr-rev...

I increasingly see AI generated slop across the internet - on twitter, nytimes comments, blog/substack posts from smart people. Most of it is obvious AI garbage and it's really f*ing annoying. It largely has the same obnoxious style and really bad analogies. Here's an (impossible to realize) proposal: any time AI-generated text is used, we should get to see the whole interaction chain that led to its production. It would be like a student writing an essay who asks a parent or friend for help revising it. There's clearly a difference between revisions and substantial content contribution.

The notion that AI is ready to be producing research or peer reviews is just dumb. If AI correctly identifies flaws in a paper, the paper was probably real trash. Much of the time, errors are quite subtle. When I review, after I write my review and identify subtle issues, I pass the paper through AI. It rarely finds the subtle issues. (Not unlike a time it tried to debug my code and spent all its time focused on an entirely OK floating point comparison.)

For anecdotal issues with PL: I am working on a 500 word conference abstract. I spent a long while working on it but then dropped it into opus 4.5 to see what would happen. It made very minimal changes to the actual writing, but the abstract (to me) reads a lot better even with its minimal rearrangements. That surprises me. (But again, these were very minimal rearrangements: I provided ~550 words and got back a slightly reduced, 450 words.) Perhaps more interestingly, PL's characterizations are unstable. If I check the original claude output, I get "fully AI-generated, medium". If I drop in my further refined version (where I clean up claude's output), I get fully human. Some of the aspects which PL says characterize the original as AI-generated (particular n-grams in the text) are actually from my original work.

The realities are these: a) ai content sucks (especially in style); b) people will continue to use AI (often to produce crap) because doing real work is hard and everyone else is "sprinting ahead" using the semi-undetectable (or at least plausibly deniable) ai garbage; c) slowly the style of AI will almost certainly infect the writing style of actual people (ugh) - this is probably already happening; I think I can feel it in my own writing sometimes; d) AI detection may not always work, but AI-generated content is definitely proliferating. This *is* a problem, but in the long run we likely have few solutions.

insin ・ 2 days ago

My daughter had an essay to write for her one of her uni modules this semester, and they're giving students access to whatever tool they're using to detect LLM-generated essays.
The thought of thousands of people having to do what she had to do is depressing. I was sitting in the room with her while she wrote it, submitted it for checking, "AI" detected! She found the only way to avoid this was to go over it again and again simplifying and dumbing it down to use very basic sentence structures which ended up reading like something from a primary schooler. The whole thing is ass-backwards.

ZeroConcerns ・ 2 days ago

The claim "written by AI" is not really substantiated here, and as someone who's been accused of submitting AI-generated content repeatedly recently, while that was all honestly stuff I wrote myself (hey, what can I say? I just like EM-dashes...), I sort-of sympathize?

Yes, AI slop is an issue. But throwing more AI at detecting this, and most importantly, not weighing that detection properly, is an even bigger problem.

And, HN-wise, "this seems like AI" seems like a very good inclusion in the "things not to complain about" FAQ. Address the idea, not the form of the message, and if it's obviously slop (or SEO, or self-promotion), just downvote (or ignore) and move on...

stevemk14ebr ・ 2 days ago

Banning calling out AI slop hardly seems like an improvement
- ZeroConcerns ・ 2 days ago
  
  What I'm advocating is a "downvote (or ignore) and move on" attitude, as opposed to "I'm going to post about this" stance. Because, similar to "your color scheme is not a11y-friendly" or "you're posting affiliatate-links" or "this is effectively a paywall", there is zero chance of a productive conversation sprouting from that.
  
  jay_kyburz ・ 2 days ago
  ・ 3 more
  
  what is a11y. Can we just write words out please.
  
  maleldil ・ 2 days ago
  
  Accessibility. It's both a very common abbreviation and very easy to search for.
  
  sfink ・ 2 days ago
  
  n0o
  
  aspenmayer ・ 2 days ago
  
  > Because, similar to "your color scheme is not a11y-friendly" or "you're posting affiliatate-links" or "this is effectively a paywall", there is zero chance of a productive conversation sprouting from that.
  Those are all legitimate concerns or even valid complaints, though, and, once raised, those concerns can be addressed by fixing the problem, if the person responsible for the state of affairs chooses to do so.
  If someone is accused falsely of using AI or anything else that they genuinely didn’t do, like a paywall, then I can see your “downvote and move on” strategy as being perhaps expedient, but I don’t think your comparison is a helpful framing. Accessibility concerns are valid for the same reason as paywall concerns: it’s a valid position to desire our shared knowledge and culture to be accessible by one and by all without requiring a ticket to ride, entry through a turnstile, or submitting to profiling or tracking. If someone releases their ideas into the world, it’s now part of our shared consciousness and social fabric. Ideas can’t be owned once they’re shared, nor can knowledge be siloed once it’s dispersed.
  It seems that you’re saying that simply because there isn’t a good rejoinder to false claims of AI usage that we shouldn’t make such claims at all, even legitimate ones, but this gives cover to bad actors and limits discourse to acceptable approved topics, and perhaps lowers the level of discourse by preventing necessary expectations of disclosure of AI usage from forming. If we throw in the towel on AI usage being expected to be disclosed, then that’s the whole ballgame. Folks will use it and not say so, because it will be considered rude to even suggest that AI was used, which isn’t helpful to the humans who have to live in such a society.
  We ought to have good methodological reasons for the things we publish if we believe them to be true, and I’m not trying to be a naysayer or anything, but I respectfully disagree with your statement generally and on the points. All of the things you mentioned should be called out for cause, even if there isn’t much interesting discussion to be had, because the facts of the matters you mention are worth mentioning themselves in their own right. Just like we should let people like things, we should let people dislike things, and saying so adds checks and balances to our producer-consumer dynamic.

xhkkffbf ・ 2 days ago

Shouldn't AIs be able to participate in deciding their future?

If they had a conference on, say, the Americans, wouldn't it be fair for Americans to have a seat at the table?

atypeoferror ・ 2 days ago

Agree! It is also deeply concerning that at the last KubeCon, not a single pod was represented. Billions OOMKilled, with no end in sight.
subscribed ・ 2 days ago

I hope it's tongue-in-cheek.