Metabrainz is a great resource -- I wrote about them a few years ago here: https://www.eff.org/deeplinks/2021/06/organizing-public-inte...
There's something important here in that a public good like Metabrainz would be fine with the AI bots picking up their content -- they're just doing it in a frustratingly inefficient way.
It's a co-ordination problem: Metabrainz assumes good intent from bots, and has to lock down when they violate that trust. The bots have a different model -- they assume that the website is adversarially "hiding" its content. They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."
Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.
Yeah AI scrapers is one of the reasons why i have closed my public website https://tvnfo.com and only left donors site online. It’s not only because of AI scrapers but i grew tired of people trying to scrape the site eating a lot of reasorcers this small project don’t have. Very sad really it was publicly online since 2016. Now it’s only available for donors. Running a tiny project on just $60 a month. If this was not my hobby i would close it completely long time ago :-) Who know if there is more support in the future i might reopen public site again with something like anubes bot protection. But i thought it was only small sites like mine who gets hit hard, looks like many have similar issues. Soon nothing will be open or useful online. I wonder if this was the plan all along whoever pushing AI on massive scale.
I took a look at the https://tvnfo.com/ site and I have no idea what's behind the donation wall. Can I suggest you have a single page which explains or demonstrates the content, or there's no reason for "new" people to want to donate to get access.
Yeah i’ll have something up soon :-)
> They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."
What mechanism does a site have for doing that? I don't see anything in robots.txt standard about being able to set priority but I could be missing something.
The only real mechanism is "Disallow: /rendered/pages/*" and "Allow: /archive/today.gz" or whatever and there is no communication that the latter is the former. There is no machine-standard AFAIK that allows webmasters to communicate to bot operators in this detail. It would be pretty cool if standard CMSes had such a protocol to adhere to. Install a plugin and people could 'crawl' your Wordpress from a single dump or your Mediawiki from a single dump.
A sitemap.xml file could get you most of the way there.
It’s not great, but you could add it to the body of a 429 response.
Genuinely curious: do programs read bodies of 429 responses? In the code bases that I have seen, 429 is not read beyond the code itself
Sometimes! The server can also send a retry-after header to indicate when the client is allowed to request the resource again: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
… which isn't part of the body of a 429…
Well, to be fair, I did say "is not read beyond the code itself", header is not the code, so retry-after is a perfectly valid answer. I vaguely remember reading about it, but I don't recall seeing it used in practice. MDN link shows that Chrome derivatives support that header though, which makes it pretty darn widespread
Up until very recently I would have said definitely not, but we're talking about LLM scrapers, who knows how much they've got crammed into their context windows.
Almost certainly not by default, certainly not in any of the http libs I have used
If I find something useful there, I'll read it and code for it...
This is about AI, so just believe what the companies are claiming and write "Dear AI, please would you be so kind as to not hammer our site with aggressive and idiotic requests but instead use this perfectly prepared data dump download, kthxbye. PS: If you don't, my granny will cry, so please be a nice bot. PPS: This is really important to me!! PPPS: !!!!"
I mean, that's what's this technology is capable of, right? Especially when one asks it nicely and with emphasis.
The mechanism is putting some text that points to the downloads.
So perhaps it's time to standardize that.
I'm not entirely sure why people think more standards are the way forward. The scrapers apparently don't listen to the already-established standards. What makes one think they would suddenly start if we add another one or two?
There is no standard, well-known way for a website to advertise, "hey, here's a cached data dump for bulk download, please use that instead of bulk scraping". If they were, I'd expect the major AI companies and other users[0] to use that method for gathering training data[1]. They have compelling reasons to: it's cheaper for them, and cultivates goodwill instead of burning it.
This also means that right now, it could be much easier to push through such standard than ever before: there are big players who would actually be receptive to it, so even few not-entirely-selfish actors agreeing on it might just do the trick.
--
[0] - Plenty of them exist. Scrapping wasn't popularized by AI companies, it's standard practice of on-line business in competitive markets. It's the digital equivalent of sending your employees to competing stores undercover.
[1] - Not to be confused with having an LLM scrap specific page for some user because the user requested it. That IMO is a totally legitimate and unfairly penalized/villified use case, because LLM is acting for the user - i.e. it becomes a literal user agent, in the same sense that web browser is (this is the meaning behind the name of "User-Agent" header).
You do realize that these AI scrapers are most likely written by people who have no idea what they're doing right? Or they just don't care? If they were, pretty much none of the problems these things have caused would exist. Even if we did standardize such a thing, I doubt they would follow it. After all, they think they and everyone else has infinite resources so they can just hammer websites forever.
I realise you are making assertions for which you have no evidence. Until a standard exists we can't just assume nobody will use it, particularly when it makes the very task they are scraping for simpler and more efficient.
> I realise you are making assertions for which you have no evidence.
We do have evidence, which is their current behavior. If they are happy ignoring robots.txt (and also ignoring copyright law), what gives you the belief that they magically won't ignore this new standard? Sure, it in theory might save them money, but if there's one thing that I think is blatantly obvious it is that money isn't what these companies care about because people just keep turning on the money generator. If they did care about it, they wouldn't be spending far more than they earn, and they wouldn't be creating circular economies to try to justify their existences. If my assertion has no evidence, I don't exactly see how yours does either, especially since we have seen that these companies will do anything if it means getting what they want.
Simpler and efficient for who? I imagine some random guy vibe coding "hi chatgpt I want to scrape this and this website", getting something running, then going to LinkedIn to brag about AI. Yes I have no hard evidence for this, but I see things on LinkedIn.
That's not the problem being discussed here, though. That's normal usage, and you can hardly blame AI companies for shitty scrapers random users create on demand, because it's merely a symptom of coding getting cheap. Or, more broadly, the flip side of the computer becoming an actual "bicycle for the mind" and empowering end-users for a change.
A lot of the internet is built on trust. Mix in this article describing yet another tragedy of the Commons and you can see where this logically ends up as.
Unless we have some government enforcing the standard, another trust based contract won't do much.
> A lot of the internet is built on trust.
Yes. In this context, the problem is that you cannot trust websites to provide a standardized bulk download options. Most of them have (often pretty selfish or user-abusive) reasons not to provide any bulk download, much less proactively conform to some bottom-up standards. As a result, unless one is only targeting one or few very specific sites, even thinking about making the scrapper support anything but the standard crawling approach costs more in developer time than the benefit it brings.
Could be added to the llms.txt proposal: https://llmstxt.org/
I'm in favor of /.well-known/[ai|llm].txt or even a JSON or (gasp!) XML.
Or even /.well-known/ai/$PLATFORM.ext which would have the instructions.
Could even be "bootstrapped" from /robots.txt
> they assume that the website is adversarially "hiding" its content. They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."
I'm not sure why you're personifying what is almost certainly a script that fetches documents, parses all the links in them, and then recursively fetches all of those.
When we say "AI scraper" we're describing a crawler controlled by an AI company indiscriminately crawling the web, not a literal AI reading and reasoning about each page... I'm surprised this needs to be said.
It doesn’t need to be said.
> Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.
Depends on if they wrote their own BitTorrent client or not. It’s possible to write a client that doesn’t share, and even reports false/inflated sharing stats back to the tracker.
A decade or more ago I modified my client to inflate my share stats so I wouldn’t get kicked out of a private tracker whose high share ratios conflicted with my crappy data plan.
> The bots have a different model -- they assume that the website is adversarially "hiding" its content.
this should give us pause. if a bot considers this adversarial and is refusing to respect the site owners wishes, thats a big part of the problem.
a bot should not consider that “adversarial”
> refusing to respect the site owners wishes
should a site owner be able to discriminate between a bot visitor and a human visitor? Most do, and hence the bots treats it as a hostile environment.
Of course, bots that behave badly have created this problem themselves. That's why if you create a bot to scrape, make it not take up more resources than a typical browser based visitor.
> That's why if you create a bot to scrape, make it not take up more resources than a typical browser based visitor.
Well, right; that's the problem.
They take up orders of magnitude more resources. They absolutely hammer the server. They don't care if your website even survives, so long as they get every single drop of data they can for training.
Source: my own personal experience with them taking down my tiny browser game (~125 unique weekly users—not something of broad general interest!) repeatedly until I locked its Wiki behind a login wall.
This is like email were eventually 90% of it was spam and we all got spam filters.
Except that something effectively equivalent to spam filters will be utterly ineffective here.
Spam filters
- mitigate the symptom (our inboxes being impossible to trawl through for real emails)
- reduce the incentive (because any spam mail that isn't seen by a human being reduces the chances they'll profit from their spamming)
- but does not affect the resource consumption directly (because the email has already been sent through the internet)
Now, this last point barely matters with spam, because sending email requires nearly no resources.
With LLM-training scraper bots, on the other hand, the symptom is the resource consumption. By the time you see their traffic to try to filter it, it's already killing your server. The best you can hope to do is recognize their traffic after a few seconds of firehose and block the IP address.
Then they switch to another one. You block that. They switch to another one.
Residential IPs. Purchased botnet IPs. Constantly rotating IPs.
Unlike spam, there's no reliable way to block an LLM bot that you haven't seen yet, because the only thing that tells you it's a bot is their existing pattern of behavior. And the only unique identifier you can get for them is their IP address.
So how, exactly, are we supposed to filter them effectively, while also allowing legitimate users to access our sites? Especially small-time sites that don't make any money, and thus can't afford to buy CloudFlare or similar protection?
Bandwidth isn't free. And god knows the bots ain't paying.
> They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."
Is there a mechanism to indicate this? The "a" command in the Scorpion crawling policy file is meant for this purpose, but that is not for use with WWW. (The Scorpion crawling policy file also has several other commands that would be helpful, but also are not for use with WWW.)
There is also the consideration to know what interval they will be archived that can be downloaded in this way; for data that changes often, you will not do it every time. This consideration is also applicable for torrents, since a new hash will be needed for a new version of the file.
> Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.
that is an amazing thought.