While I think there's significant AI "offloading" in writing, the article's methodology relies on "AI-detectors," which reads like PR for Pangram. I don't need to explain why AI detectors are mostly bullshit and harmful for people who have never used LLMs. [1]
1: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
I am not sure if you are familiar with Pangram (co-founder here) but we are a group of research scientists who have made significant progress in this problem space. If your mental model of AI detectors is still GPTZero or the ones that say the declaration of independence is AI, then you probably haven't seen how much better they've gotten.
This paper by economists from the University of Chicago economists found zero false positives of 1,992 human-written documents and over 99% recall in detecting AI documents. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5407424
Nothing points out that the benchmark is invalid like a zero false positive rate. Seemingly it is pre-2020 text vs a few models rework of texts. I can see this model fall apart in many real world scenarios. Yes, LLMs use strange language if left to their own devices and this can surely be detected. 0% false positive rate under all circumstances? Implausible.
Our benchmarks of public datasets put our FPR roughly around 1 in 10,000. https://www.pangram.com/blog/all-about-false-positives-in-ai...
Find me a clean public dataset with no AI involvement and I will be happy to report Pangram's false positive rate on it.
Max, there's two problems I see with your comment.
1) the paper didn't show a 0% FNR. I mean tables 4, 7, and B.2 are pretty explicit. It's not hard to figure out from the others either.
2) a 0% error rate requires some pretty serious assumptions to be true. For that type of result to not be incredibly suspect requires there to be zero noise in the data, analysis, and at all parts. I do not see that being true of the mentioned dataset.
Even high scores are suspect. Generalizing the previous a score is suspect if it is higher than the noise level. Can you truly attest that this condition is true?
I'm suspect that you're introducing data leakage. I haven't looked enough into your training and data to determine how that's happening but you'll probably need a pretty deep analysis as leakage is really easy to sneak in. It can do so in non obvious ways. A very common one is tuning hyper parameters on test results. You don't have to pass data to pass information. Another sly way for this to happen is that the test set isn't significantly disjoint from the training set. If the perturbation is too small then you aren't testing generalization you're testing a slightly noisy training set (which your training should be introducing noise to help regularize, so you end up just measuring your training performance).
Your numbers are too good and that's suspect. You need a lot more evidence to suggest they mean what you want them to mean.
[flagged]
I enjoyed this thoughtful write up. It's a vitally important area for good, transparent work to be done.
> Nothing points out that the benchmark is invalid like a zero false positive rate
You’re punishing them for claiming to do a good job. If they truly are doing a bad job, surely there is a better criticism you could provide.
No one is punishing anyone. They just make an implausible claim. That is it.
- [deleted]
It looks like 5% of human texts from your paper are marked as mixed, and mixed texts are 5-10% if mixed texts as AI, from your paper.EditLens (Ours) Predicted Label Human Mix AI ┌─────────┬─────────┬─────────┐ Human │ 1770 │ 111 │ 0 │ ├─────────┼─────────┼─────────┤ True Mix │ 265 │ 1945 │ 28 │ Label ├─────────┼─────────┼─────────┤ AI │ 0 │ 186 │ 1695 │ └─────────┴─────────┴─────────┘I guess I don’t see that this is much better than what’s come before, using your own paper.
Edit: this is an irresponsible Nature news article, too - we should see a graph of this detector over the past ten years to see how much of this ‘deluge’ is algorithmic error
It is not wise to brag about your product when the GP is pointing out that the article "reads like PR for Pangram", no matter AI detectors are reliable or not.
I would say it's important to hold off on the moralizing until after showing visible effort to reflect on the substance of the exchange, which in this case is about the fairness of asserting that the detection methodology employed in this particular case shares the flaws of familiar online AI checkers. That's an importantly substantive and rebuttable point and all the meaningful action in the conversation is embedded in those details.
In this case, several important distinctions are drawn, including being open about criteria, about such things as "perplexity" and "burstiness" as properties being tested for, and an explanation of why they incorrectly claim the Declaration of Independence is AI generated (it's ubiquitous). So it seems like a lot of important distinctions are being drawn that testify to the credibility of the model, which has to matter to you if you're going to start moralizing.
How do you discern between papers "completely fabricated" by AI vs. edited by AI for grammar?
The response would be more helpful if it directly addresses the arguments in posts from that search result.
There are dozens of first generation AI detectors and they all suck. I'm not going to defend them. Most of them use perplexity based methods, which is a decent separators of AI and human text (80-90%) but has flaws that can't be overcome and high FPRs on ESL text.
https://www.pangram.com/blog/why-perplexity-and-burstiness-f...
Pangram is fundamentally different technology, it's a large deep learning based model that is trained on hundreds of millions of human and AI examples. Some people see a dozen failed attempts at a problem as proof that the problem is impossible, but I would like to remind you that basically every major and minor technology was preceded by failed attempts.
Some people see a dozen extremely profitable, extremely destructive attempts at a problem as proof that the problem is not a place for charitable interpretation.
And you don't think a dozen of basically scams around the technology justify extreme scepticism?
huh?
GAN.. Just feed the output of your algorithms back into the LLM while learning. At the end of the day the problem is impossible, but we're not there yet.
Can your software detect which LLMs most likely generated a text?
Pangram is trained on this task as well to add additional signal during training, but it's only ~90% accurate so we don't show the prediction in public-facing results
Are you concerned with your product being used to improve AI to be less detectable?
> Are you concerned with your product being used to improve AI to be less detectable?
The big AI providers don't have any obvious incentive to do this. If it happens 'naturally' in the pursuit of quality then sure, but explicitly training for stealth is a brand concern in the same way that offering a fully uncensored model would be.
Smaller providers might do this (again in the same way they now offer uncensored models), but they occupy a miniscule fraction of the market and will be a generation or two behind the leaders.
They don't have an incentive to make their AIs better? If your product can genuinely detect AI writing, of course they would use it to make their models sound more human. The biggest criticism of AI right now is how robotic and samey it sounds.
It's definitely going to be a back and forth - model providers like OpenAI want their LLMs to sound human-like. But this is the battle we signed up for, and we think we're more nimble and can iterate faster to stay one step ahead of the model providers.
That sounds extremely naive but good luck!
I thought the author was attempting to highlight the hypocrisy of using an AI to detect other uses of AI, as if one was a good use, and the other bad.
Hi Max! Thank you for updating my mental model of AI detectors.
I was with total certainty under the impression that detecting AI-written text to be an impossible-to-solve problem. I think that's because it's just so deceptively intuitive to believe that "for every detector, there'll just be a better LLM and it'll never stop."
I had recently published a macOS app called Pudding to help humans prove they wrote a text mainly under the assumption that this problem can't be solved with measurable certainty and traditional methods.
Now I'm of course a bit sad that the problem (and hence my solution) can be solved much more directly. But, hey, I fell in love with the problem, so I'm super impressed with what y'all are accomplishing at and with Pangram!
I see the bullshit part continues on the PR side as well, not just in the product.
AI detectors are only harmful if you use them to convict people, it isn't harmful to gather statistics like this. They didn't find many AI written paper, just AI written peer reviews, which is what you would expect since not many would generate their whole paper submissions while peer reviews are thankless work.
If you have a bullshit measure that determines some phenomena (e.g. crime) to happen in some area, you will become biased to expect it in that area. It wrongly creates a spotlight effect by which other questionable measures are used to do the actual conviction (“Look! We found an em dash!”)
I think there is a funny bit of mental gymnastics that goes on here sometimes, definitely. LLM skeptics (which I'm not saying the Pangram folks are in particular) would say: "LLMs are unreliable and therefore useless, it's producing slop at great cost to the environment and other people." But if a study comes out that confirms their biases and uses an LLM in the process, or if they themselves use an LLM to identify -- or in many cases just validate their preconceived notion -- that something was drafted using an LLM, then all the sudden things are above board.