We can't have nice things because of AI scrapers

blog.metabrainz.org

464 points

LorenDB

3 days ago


276 comments

dannyobrien 3 days ago

Metabrainz is a great resource -- I wrote about them a few years ago here: https://www.eff.org/deeplinks/2021/06/organizing-public-inte...

There's something important here in that a public good like Metabrainz would be fine with the AI bots picking up their content -- they're just doing it in a frustratingly inefficient way.

It's a co-ordination problem: Metabrainz assumes good intent from bots, and has to lock down when they violate that trust. The bots have a different model -- they assume that the website is adversarially "hiding" its content. They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.

  • tux 3 days ago

    Yeah AI scrapers is one of the reasons why i have closed my public website https://tvnfo.com and only left donors site online. It’s not only because of AI scrapers but i grew tired of people trying to scrape the site eating a lot of reasorcers this small project don’t have. Very sad really it was publicly online since 2016. Now it’s only available for donors. Running a tiny project on just $60 a month. If this was not my hobby i would close it completely long time ago :-) Who know if there is more support in the future i might reopen public site again with something like anubes bot protection. But i thought it was only small sites like mine who gets hit hard, looks like many have similar issues. Soon nothing will be open or useful online. I wonder if this was the plan all along whoever pushing AI on massive scale.

    • Cadwhisker 3 days ago

      I took a look at the https://tvnfo.com/ site and I have no idea what's behind the donation wall. Can I suggest you have a single page which explains or demonstrates the content, or there's no reason for "new" people to want to donate to get access.

      • tux 3 days ago

        Yeah i’ll have something up soon :-)

  • fartfeatures 3 days ago

    > They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

    What mechanism does a site have for doing that? I don't see anything in robots.txt standard about being able to set priority but I could be missing something.

    • arjie 3 days ago

      The only real mechanism is "Disallow: /rendered/pages/*" and "Allow: /archive/today.gz" or whatever and there is no communication that the latter is the former. There is no machine-standard AFAIK that allows webmasters to communicate to bot operators in this detail. It would be pretty cool if standard CMSes had such a protocol to adhere to. Install a plugin and people could 'crawl' your Wordpress from a single dump or your Mediawiki from a single dump.

      • sbarre 3 days ago

        A sitemap.xml file could get you most of the way there.

    • jacksnipe 3 days ago

      It’s not great, but you could add it to the body of a 429 response.

      • VTimofeenko 3 days ago
        7 more

        Genuinely curious: do programs read bodies of 429 responses? In the code bases that I have seen, 429 is not read beyond the code itself

        • jakelazaroff 3 days ago
          3 more

          Sometimes! The server can also send a retry-after header to indicate when the client is allowed to request the resource again: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

          • deathanatos 3 days ago
            2 more

            … which isn't part of the body of a 429…

            • VTimofeenko 3 days ago

              Well, to be fair, I did say "is not read beyond the code itself", header is not the code, so retry-after is a perfectly valid answer. I vaguely remember reading about it, but I don't recall seeing it used in practice. MDN link shows that Chrome derivatives support that header though, which makes it pretty darn widespread

        • jacksnipe 2 days ago

          Up until very recently I would have said definitely not, but we're talking about LLM scrapers, who knows how much they've got crammed into their context windows.

        • gleenn 3 days ago

          Almost certainly not by default, certainly not in any of the http libs I have used

        • dfxm12 3 days ago

          If I find something useful there, I'll read it and code for it...

    • gloflo 2 days ago

      This is about AI, so just believe what the companies are claiming and write "Dear AI, please would you be so kind as to not hammer our site with aggressive and idiotic requests but instead use this perfectly prepared data dump download, kthxbye. PS: If you don't, my granny will cry, so please be a nice bot. PPS: This is really important to me!! PPPS: !!!!"

      I mean, that's what's this technology is capable of, right? Especially when one asks it nicely and with emphasis.

    • squigz 3 days ago

      The mechanism is putting some text that points to the downloads.

      • TeMPOraL 3 days ago
        12 more

        So perhaps it's time to standardize that.

        • squigz 3 days ago
          9 more

          I'm not entirely sure why people think more standards are the way forward. The scrapers apparently don't listen to the already-established standards. What makes one think they would suddenly start if we add another one or two?

          • TeMPOraL 3 days ago
            8 more

            There is no standard, well-known way for a website to advertise, "hey, here's a cached data dump for bulk download, please use that instead of bulk scraping". If they were, I'd expect the major AI companies and other users[0] to use that method for gathering training data[1]. They have compelling reasons to: it's cheaper for them, and cultivates goodwill instead of burning it.

            This also means that right now, it could be much easier to push through such standard than ever before: there are big players who would actually be receptive to it, so even few not-entirely-selfish actors agreeing on it might just do the trick.

            --

            [0] - Plenty of them exist. Scrapping wasn't popularized by AI companies, it's standard practice of on-line business in competitive markets. It's the digital equivalent of sending your employees to competing stores undercover.

            [1] - Not to be confused with having an LLM scrap specific page for some user because the user requested it. That IMO is a totally legitimate and unfairly penalized/villified use case, because LLM is acting for the user - i.e. it becomes a literal user agent, in the same sense that web browser is (this is the meaning behind the name of "User-Agent" header).

            • ethin 3 days ago
              7 more

              You do realize that these AI scrapers are most likely written by people who have no idea what they're doing right? Or they just don't care? If they were, pretty much none of the problems these things have caused would exist. Even if we did standardize such a thing, I doubt they would follow it. After all, they think they and everyone else has infinite resources so they can just hammer websites forever.

              • fartfeatures 3 days ago
                6 more

                I realise you are making assertions for which you have no evidence. Until a standard exists we can't just assume nobody will use it, particularly when it makes the very task they are scraping for simpler and more efficient.

                • ethin 3 days ago

                  > I realise you are making assertions for which you have no evidence.

                  We do have evidence, which is their current behavior. If they are happy ignoring robots.txt (and also ignoring copyright law), what gives you the belief that they magically won't ignore this new standard? Sure, it in theory might save them money, but if there's one thing that I think is blatantly obvious it is that money isn't what these companies care about because people just keep turning on the money generator. If they did care about it, they wouldn't be spending far more than they earn, and they wouldn't be creating circular economies to try to justify their existences. If my assertion has no evidence, I don't exactly see how yours does either, especially since we have seen that these companies will do anything if it means getting what they want.

                • soco 2 days ago
                  2 more

                  Simpler and efficient for who? I imagine some random guy vibe coding "hi chatgpt I want to scrape this and this website", getting something running, then going to LinkedIn to brag about AI. Yes I have no hard evidence for this, but I see things on LinkedIn.

                  • TeMPOraL 2 days ago

                    That's not the problem being discussed here, though. That's normal usage, and you can hardly blame AI companies for shitty scrapers random users create on demand, because it's merely a symptom of coding getting cheap. Or, more broadly, the flip side of the computer becoming an actual "bicycle for the mind" and empowering end-users for a change.

                • johnnyanmac 2 days ago
                  2 more

                  A lot of the internet is built on trust. Mix in this article describing yet another tragedy of the Commons and you can see where this logically ends up as.

                  Unless we have some government enforcing the standard, another trust based contract won't do much.

                  • TeMPOraL 2 days ago

                    > A lot of the internet is built on trust.

                    Yes. In this context, the problem is that you cannot trust websites to provide a standardized bulk download options. Most of them have (often pretty selfish or user-abusive) reasons not to provide any bulk download, much less proactively conform to some bottom-up standards. As a result, unless one is only targeting one or few very specific sites, even thinking about making the scrapper support anything but the standard crawling approach costs more in developer time than the benefit it brings.

        • edoceo 3 days ago

          I'm in favor of /.well-known/[ai|llm].txt or even a JSON or (gasp!) XML.

          Or even /.well-known/ai/$PLATFORM.ext which would have the instructions.

          Could even be "bootstrapped" from /robots.txt

  • hamdingers 3 days ago

    > they assume that the website is adversarially "hiding" its content. They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

    I'm not sure why you're personifying what is almost certainly a script that fetches documents, parses all the links in them, and then recursively fetches all of those.

    When we say "AI scraper" we're describing a crawler controlled by an AI company indiscriminately crawling the web, not a literal AI reading and reasoning about each page... I'm surprised this needs to be said.

    • ryantgtg 3 days ago

      It doesn’t need to be said.

  • yardstick 2 days ago

    > Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.

    Depends on if they wrote their own BitTorrent client or not. It’s possible to write a client that doesn’t share, and even reports false/inflated sharing stats back to the tracker.

    A decade or more ago I modified my client to inflate my share stats so I wouldn’t get kicked out of a private tracker whose high share ratios conflicted with my crappy data plan.

  • toofy 3 days ago

    > The bots have a different model -- they assume that the website is adversarially "hiding" its content.

    this should give us pause. if a bot considers this adversarial and is refusing to respect the site owners wishes, thats a big part of the problem.

    a bot should not consider that “adversarial”

    • chii 2 days ago

      > refusing to respect the site owners wishes

      should a site owner be able to discriminate between a bot visitor and a human visitor? Most do, and hence the bots treats it as a hostile environment.

      Of course, bots that behave badly have created this problem themselves. That's why if you create a bot to scrape, make it not take up more resources than a typical browser based visitor.

      • danaris 2 days ago
        3 more

        > That's why if you create a bot to scrape, make it not take up more resources than a typical browser based visitor.

        Well, right; that's the problem.

        They take up orders of magnitude more resources. They absolutely hammer the server. They don't care if your website even survives, so long as they get every single drop of data they can for training.

        Source: my own personal experience with them taking down my tiny browser game (~125 unique weekly users—not something of broad general interest!) repeatedly until I locked its Wiki behind a login wall.

        • expedition32 2 days ago
          2 more

          This is like email were eventually 90% of it was spam and we all got spam filters.

          • danaris 2 days ago

            Except that something effectively equivalent to spam filters will be utterly ineffective here.

            Spam filters

            - mitigate the symptom (our inboxes being impossible to trawl through for real emails)

            - reduce the incentive (because any spam mail that isn't seen by a human being reduces the chances they'll profit from their spamming)

            - but does not affect the resource consumption directly (because the email has already been sent through the internet)

            Now, this last point barely matters with spam, because sending email requires nearly no resources.

            With LLM-training scraper bots, on the other hand, the symptom is the resource consumption. By the time you see their traffic to try to filter it, it's already killing your server. The best you can hope to do is recognize their traffic after a few seconds of firehose and block the IP address.

            Then they switch to another one. You block that. They switch to another one.

            Residential IPs. Purchased botnet IPs. Constantly rotating IPs.

            Unlike spam, there's no reliable way to block an LLM bot that you haven't seen yet, because the only thing that tells you it's a bot is their existing pattern of behavior. And the only unique identifier you can get for them is their IP address.

            So how, exactly, are we supposed to filter them effectively, while also allowing legitimate users to access our sites? Especially small-time sites that don't make any money, and thus can't afford to buy CloudFlare or similar protection?

      • expedition32 2 days ago

        Bandwidth isn't free. And god knows the bots ain't paying.

  • zzo38computer 3 days ago

    > They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

    Is there a mechanism to indicate this? The "a" command in the Scorpion crawling policy file is meant for this purpose, but that is not for use with WWW. (The Scorpion crawling policy file also has several other commands that would be helpful, but also are not for use with WWW.)

    There is also the consideration to know what interval they will be archived that can be downloaded in this way; for data that changes often, you will not do it every time. This consideration is also applicable for torrents, since a new hash will be needed for a new version of the file.

  • m463 3 days ago

    > Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.

    that is an amazing thought.

akuyou 3 days ago

AI is destroying the free internet along with everything else

My web host suspended my website account last week due to a sudden large volume of requests to it - effectively punishing me for being scraped by bots.

I've had to move to a new host to get back up, but what hope does the little guy have? it's like GPU and ram prices, it doesn't matter if I pay 10x 100x or 1000x more than I did, the AI companies have infinite resources, and they don't care what damage they do in the rush to become the no 1 in the industry

The cynic in me would say it's intentional, destroy all the free sites so you have to get your info from their ai models, price home users out of high end hardware so they have to lease the functions from big companies

  • jijijijij 2 days ago

    My prediction: AI is the deathblow to IPv6 adoption for the wider web, since blocklists only really work with IPv4. Increasing VPN usage making user tracking and heuristics difficult, AI scrapers stealing appropriated human content and AI spam poisoning its exploitation, not to mention tech monopolization and centralization, the limitations of IPv4 are suddenly becoming an asset and incentives for IPv6 support are zero.

    On the plus side, we probably got all of IPv6 to build an alternative, non-commercial, better web or whatever network, if we act quickly before routing support is vanishing. IPv6-only housing is cheap, things don’t need to work out. There could be the IPv4-enforced corpo world within walls, and IPv6-enabled wild wide wonderlands.

    • touisteur 2 days ago

      There are ways to build blocklists for IPv6. I saw (used) once bloom filters for this. Inspired by some papers from the 2000s, this one in 2009 https://www.nokia.com/bell-labs/publications-and-media/publi...

      • jijijijij 2 days ago
        5 more

        The point isn't the technical inability to block particular IPv6 addresses efficiently, but anticipating abuse potential by IP. You can change IPv6 addresses freely compared to IPv4. With IPv4 it's easy to determine, if you are dealing with a residential IP or VPN. No heuristics or analysis needed. IPv4 addresses are blocked preemptively, that's not really a thing for IPv6. Eg. VPN providers wouldn't have static endpoint addresses with IPv6. So you may be able to limit spontaneous abuse such as DDoS attacks, but it's a lot harder to filter technically legitimate traffic, which is merely unwanted for your data aggregation.

        • lmz 2 days ago
          4 more

          Is there anything against just blocking at the /48 level?

          • jijijijij 2 days ago
            3 more

            No, but subnets can't be as easily associated with unwanted traffic. If IPv6 gets blocked you just get another IP. A VPN or hosting provider can't simply rent, or god forbid buy IPv4 addresses and subnets, arbitrarily. The IPs they use are rather static and easy to discover. Rather trivial to block all them, preemptively. Residential IPv4 VPNs are not legal offerings and their use is limited. VPNs can fight traffic analysis, they can't fight preemptive IPv4 blocking.

            See, it doesn't matter if it's somehow possible to control IPv6 traffic, factually, it is sooo much easier to control and observe IPv4. IPv6 adoption isn't going great at all and now there are new strong business incentives against it.

            The direction we're moving right now isn't free intergalactic mesh networking, but holistic control and centralization by the tech oligarchy. IPv6 is good things... we can't have those.

            • lxgr 2 days ago
              2 more

              > VPNs can fight traffic analysis, they can't fight preemptive IPv4 blocking.

              How do you think VPNs are getting past VOD providers’ VPN block lists?

              > Residential IPv4 VPNs are not legal offerings and their use is limited.

              What’s illegal about them? And does it matter to uncooperative/aggressive bots?

              • jijijijij 2 days ago

                > How do you think VPNs are getting past VOD providers’ VPN block lists?

                In my experience, they most often don't. If you got more insights, please enlighten me. I presume VPNs which get past VPN block lists, are just not yet on the radar, or don't provide the privacy claimed, not actually fully in control of their infrastructure.

                > What’s illegal about them?

                Where do you think residential IPs are coming from? It's often botnets or otherwise compromised devices, or people tricked into sharing their connection. In any case, it's most certainly breaking the ISP's TOS. Because of the effort behind providing residential IPs, these VPN services are rather expensive. And certainly not trustworthy in regard to privacy. If offering residential IPs would be legal, every VPN service would provide them.

                > And does it matter to uncooperative/aggressive bots?

                No. They are used for mostly shady/criminal activity, where the limitations and legality don't matter. I doubt commercial LLM crawlers and data intense campaigns aren't bothered by legality, stability, connectivity or (upload) bandwidth limitations. Like, you wouldn't crawl the web on a mobile connection.

    • lxgr 2 days ago

      > blocklists only really work with IPv4

      Do they? Why would it be any harder to block e.g. a /56 than a /24?

  • nicbou 2 days ago

    Then they use the data to deny you traffic. AI summaries are wrecking the independent web. Losing more than half or more of your traffic was pretty common in 2025. It’s killing the economics of sharing hard-earned information.

    So we are spending more resources reaching a lot less people, because a few big companies are capturing the value for their shareholders.

    And that’s while they’re still haemorrhaging money! Once they fully establish their monopoly and kill the open web, the enshittification will begin.

chlorion 3 days ago

I self host a small static website and a cgit instance on an e2-micro VPS from Google Cloud, and I have got around 8.5 million requests combined from openai and claude over around 160 days. They just infinitely crawl the cgit pages forever unless I block them!

    (1) root@gentoo-server ~ # egrep 'openai|claude' -c /var/log/lighttpd/access.log
    8537094
So I have lighttpd setup to match "claude|openai" in the user agent string and return a 403 if it matches, and a nftables firewall seutp to rate limit spammers, and this seems to help a lot.
  • dang 3 days ago

    And those are the good actors! We're under a crawlocalpyse from botnets, er, residential proxies.

    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", anyone?

    • zerocrates 2 days ago

      Yeah the flood of these Chrome UAs with every version number under the sun, and a really large portion being *.0.0.0 version numbers, that's what I've tended to experience lately. Also just kind of every browser user agent ever:

      Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 (.NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; .NET CLR 3.5.21022)

      There were waves of big and sometimes intrusive traffic admitting to being from Amazon, Anthropic, Google, Meta, etc., but those are easy to block or throttle and aren't that big a deal in the scheme of things.

  • zahlman 3 days ago

    The third-party hit-counting service I use implies that I'm not getting any of this bot scraping on my GitHub blog.

    Is Microsoft doing something to prevent it? Or am I so uncool that even bots don't want to read my content :(

    • lelanthran 2 days ago

      I'm interested in that service and how it works. Link?

      • zahlman 2 days ago
        3 more

        It is https://github.com/silentsoft/hits . It works by loading an SVG "shield" file (like the ones you see at the top of GitHub readmes all the time) from their server from a unique URL (you just choose one when you write/render your HTML). The server, implemented in Java, just counts hits to each URL in a database and sends back the corresponding SVG data. There's also a mini dashboard website where you can check basic stats for a given URL (no login required, everyone's hits-per-day stats are just public) and preview styling options for the SVG. For example, for my most recent blog post https://zahlman.github.io/posts/2025/12/31/oxidation/, I configured it such that you can view the stats via https://hits.sh/zahlman.github.io+oxidation/ (note that the trailing slash is required).

        (The about section on GitHub bills the project as "privacy-friendly", which I would say is nonsense as these dashboards are public and their URLs are trivially computed. But it's also hard to imagine caring.)

        • aembleton 2 days ago
          2 more

          They're probably not downloading every svg each time they scrape the site. Probably focused on scraping the text.

          • zahlman 2 days ago

            What? No, I mean the HTML for the SVG contains a custom URL for an API request. There's no scraping involved on either end.

Jeremy1026 3 days ago

I sysadmin my kids' PTA website. OpenAI was scraping it recently. I saw it looking at the event calendar, request after request to random days. I saw years 1000 through 3000 scroll by. I changed the response to their user agent to an access denied, but it still took about 4 hours for them to stop.

  • tonyedgecombe 2 days ago

    I wonder who wrote a crawler like that. Oh wait ...

    • jwe a day ago

      I can't follow your thought there. Who wrote it?

SchemaLoad 3 days ago

Cloudflare has a service for this now that will detect AI scrapers and send them to a tarpit of infinite AI generated nonsense pages.

  • bitbasher 3 days ago

    Wow, so to prevent AI scrapers from harvesting my data I need to send all of my traffic through a third party company that gets to decide who gets to view my content. Great idea!

    • Aurornis 3 days ago

      You don’t need to do anything. You can use any number of solutions or roll your own.

      Someone shared an alternative. Must everything in AI threads be so negative and condescending?

      • dannersy 2 days ago
        3 more

        Yes, they could roll their own, but you have no issues with this being necessary? I think the attitude of "just deal with it" is far more negative than someone expressing they are upset with the state of the internet, its controllers, and its abusers.

        • bs7280 2 days ago

          This is like saying "lets just get rid of all the guns" to solve gun violence and gun crime in the USA. The cat is out of the bag and no one can put it back. We live in a different world now and we have to figure it out.

        • TitaRusell 2 days ago

          There's trillions invested in AI. Don't expect any introspective insight or criticism about it.

      • thefz 2 days ago

        > Must everything in AI threads be so negative and condescending?

        Because if I own a website or a service and it is being degraded or slowed by some third party tool that wants to slurp its content for his own profit and don't even share, I tend to be irritated. And AI apologists/evangelists don't help when they try to justify the behavior.

    • rester324 3 days ago

      You can implement this yourself, who is stopping you?

      • zzzeek 3 days ago
        15 more

        Citation needed

        • zimpenfish 3 days ago
          10 more

          I use iocaine[0] to generate a tarpit. Yesterday it served ~278k "pages" consisting of ~500MB of gibberish (and that's despite banning most AI scrapers in robots.txt.)

          [0] https://iocaine.madhouse-project.org

          • chao- 3 days ago
            6 more

            Can't seem to access this.

            It flashes some text briefly then gives me an 418 TEAPOT response. I wonder if it's because I'm on Linux?

            EDIT: Begrudgingly checked Chrome, and it loads. I guess it doesn't like Firefox?

            • zephen 3 days ago
              2 more

              Doesn't work on my firefox either.

              Friendly fire, I suppose.

              • godelski 3 days ago

                Works on my Firefox. Mac and Linux

            • dpkirchner 3 days ago
              3 more

              Nor Safari on iOS.

              • zimpenfish 2 days ago
                2 more

                Works fine on my iOS Safari - maybe there's some extension that's tickling it just the wrong way?

                • dpkirchner 2 days ago

                  It still fails with all of my extensions disabled (wipr, privacy redirect). I just get a download dialog. I don't know what the HTTP status code is, however.

                  I found a flagged HN submission about it and it has just about the same result for me and for others. My first tap failed in a weird way (showed some text then redirected quickly to its git repo) and all subsequent taps trigger a download.

                  https://news.ycombinator.com/item?id=44538010

          • doublerabbit 3 days ago
            2 more

            Unfortunately and you kind of have to count this as the cost of the Internet. You've wasted 500Mb of bandwidth.

            I've had colocation for eight years+. My monthly b/w cost is now around 20-30Gb a month given to scrapers where I was only be using 1-2Gb a month, years prior.

            I pay for premium bandwidth (it's a thing) and only get 2TB of usable data. Do I go offline or let it continue?

            • zimpenfish 2 days ago

              > You've wasted 500Mb of bandwidth.

              Yep, it sucks, but on the positive side, I'm feeding 500Mb of garbage into them every day and that feels like enough of a small win for me.

              > My monthly b/w cost is now around 20-30Gb a month given to scrapers [...] 1-2Gb a month

              That definitely sucks.

              > Do I go offline or let it continue?

              Might be time to start blocking entire IP ranges and ASNs and see if that helps.

          • zzzeek 2 days ago

            i have no idea what this does because the site is rejecting my ordinary firefox browser with "Error code: 418 I'm a teapot". Even from a private browser.

            If I hit it with Chrome, now I can see a site.

            Seems pretty not ready for prime time as a lot of my viewers use Firefox

        • godelski 3 days ago
          2 more

          One of the most popular ones is Anubis. It uses a proof of work and can even do poisoning: https://anubis.techaro.lol/

          They even mention iocaine. I know, inconceivable!: https://iocaine.madhouse-project.org/

          There's also tons of HN posts on the topic with varying solutions:

          https://news.ycombinator.com/item?id=45935729

          https://news.ycombinator.com/item?id=45711094

          https://news.ycombinator.com/item?id=44142761

          https://news.ycombinator.com/item?id=44378127

          • zzzeek 3 days ago

            Anubis is the only tool that claims to have heuristics to identify a bot, but my understanding is that it does this by presenting obnoxious challenges to all users. Not really feasible. Old school approaches like ip blocking or even ASN blocking are obsolete - these crawlers purposely spam from thousands of IPs, and if you block them on a common ASN, they come back a few days later from thousands of unique ASNs. So this is not really a "roll your own" situation, especially if you are running off the shelf software that doesn't have some straightforward means of building in these various approaches of endless page mazes (which I would still have to serve anyway).

  • timpera 3 days ago

    Unfortunately, Cloudflare often destroys the experience for users with shared connections, VPNs, exotic browsers… I had to remove it from my site after too many complaints.

    • rudedogg 3 days ago

      Also iCloud Private Relay.

      CloudFlare is making it impossible to browse privately

      • acdha 3 days ago

        Cloudflare works fine with public relay - they and Fastly provide infrastructure for that service (one half of the blinded pair) so it’s definitely something they test.

  • loopback_device 3 days ago

    Not sure "TLS added and removed here :)" as a Service is the right tool in the drawer for this.

  • m463 3 days ago

    cloudflare also blocks my human-is-driving browser all the time

    "enahble javascript and cookies to continue"

    also unsupported browser

  • taberiand 3 days ago

    Savvy move by cloudflare, once they have enough sites behind their service they can charge the AI companies to access their cached copies on a back channel

  • ranger_danger 3 days ago

    Modern scrapers are using headless chromium which will not see the invisible links, so I'm not sure how long this will be effective.

  • inferiorhuman 3 days ago

    Which is still a far worse experience than if Cloudflare's services weren't needed.

  • RobotToaster 3 days ago

    Except for the scrapers that pay cloudflare to exempt them.

  • themafia 3 days ago

    The solution, as always, is noise.

tensegrist 3 days ago

the more time passes the more i'm convinced that the solution is to—somehow—force everyone to have to go through something like common crawl

i don't want people's servers to be pegged at 100% because a stupid dfs scraper is exhaustively traversing their search facets, but i also want the web to remain scrapable by ordinary people, or rather go back to how readily scrapable it used to be before the invention of cloudflare

as a middle ground, perhaps we could agree on a new /.well-known/ path meant to contain links to timestamped data dumps?

  • nostrademons 3 days ago

    That's sorta what MetaBrainz did - they offer their whole DB as a single tarball dump, much like what Wikipedia does. I downloaded it in the order of an hour; if I need a MusicBrainz lookup, I just do a local query.

    For this strategy to work, people need to actually use the DB dumps instead of just defaulting to scraping. Unfortunately scraping is trivially easy, particularly now that AI code assistants can write a working scraper in ~5-10 minutes.

    • tonyhart7 3 days ago

      I mean this AI data scrapper would need to scan and fetch billions of website

      why would they even care over 1 single website ??? You expect instiution to care out of billions website they must scrape daily

      • what 3 days ago

        This is probably the reason. It’s more effort to special case every site that offers dumps than to just unleash your generic scraper on it.

    • 8note 3 days ago

      the obvious thing would be to take down their website and only have the DB dump.

      if thats the useful thing, it doesnt need the wrapper

      • mr_monkey 6 hours ago

        Ah, yes, if the mafia comes to threaten your business for a protection racket, just abandon your shop, simple !

  • themafia 3 days ago

    It's not a technical problem you are facing.

    It's a monetary one, specifically, large pools of sequestered wealth making extremely bad long term investments all in a single dubious technical area.

    Any new phenomenon driven by this process will have the same deleterious results on the rest of computing. There is a market value in ruining your website that's too high for the fruit grabbers to ignore.

    In time adaptations will arise. The apparently desired technical future is not inevitable.

  • tpmoney 3 days ago

    I'll propose my pie in the sky plan here again. We should overhaul the copyright system completely in light of AI and make it mostly win-win for everyone. This is predicated on the idea that the NIST numbers set is sort of the "hello world" dataset for people wanting to learn machine vision and having that common data set is really handy. Numbers made up off the top of my head/subject to tuning but the basic idea is this:

    1) Cut copyright to 15-20 years by default. You can have 1 extension of an additional 10-15 years if you submit your work to the "National Data Set" within say 2-3 years of the initial publication.

    2) Content in the National set is well categorized and cleaned up. It's the cleanest data set anyone could want. The data set is used both to train some public models and also licensed out to people wanting to train their own models. Both the public models and the data sets are licensed for nominal fees.

    3) People who use the public models or data sets as part of their AI system are granted immunity from copyright violation claims for content generated by these models, modulo some exceptions for knowing and intentional violations (e.g. generating the contents of a book into an epub). People who choose to scrape their own data are subject to the current state of the law with regards to both scraping and use (so you probably better be buying a lot of books).

    4) The license fees generated from licensing the data and the models would be split into royalty payments to people whose works are in the dataset, and are still under copyright protection, proportional to the amount of data submitted and inversely proportional to the age of that data. There would be some absolute caps in place to prevent slamming the national data sets with junk data just to pump the numbers.

    Everyone gets something out of this. AI folks get clean data, that they didn't have to burn a lot of resources scraping. Copyright holders get paid for their works used by AI and retain most of the protections they have today, just for a shorter time), the public gets usable AI tooling without everyone spending their own resources on building their own data sets, site owners and the like get reduced bot/scraping traffic. It's not perfect, and I'm sure the devil is in the details, but that's the nature of this sort of thing.

    • mschuster91 3 days ago

      > Cut copyright to 15-20 years by default.

      This alone will kill off all chances of that ever passing.

      Like, I fully agree with your proposal... but I don't think it's feasible. There are a lot of media IPs/franchises that are very, very old but still generate insane amounts of money to this day with active developments. Star Wars and Star Trek obviously, but also stuff like the MCU or Avatar is on its best way to two decades of runtime, Iron Man 1 was released in 2008, or Harry Potter which is almost 30 years old. That's dozens of billions of dollars in cumulative income, and most of that is owned by Disney.

      Look what it took to finally get even the earliest Disney movies to enter the public domain, and that was stuff from before World War 2 that was so bitterly fought over.

      In order to reform copyright... we first have to use anti-trust to break up the large media conglomerates. And it's not just Disney either. Warner, Sony, Comcast and Paramount also hold ridiculous amounts of IP, Amazon entered the fray as well with acquiring MGM (mostly famous for James Bond), and Lionsgate holds the rights for a bunch of smaller but still well-known IPs (Twilight, Hunger Games).

      And that's just the movie stuff. Music is just as bad, although at least there thanks to radio stations being a thing, there are licensing agreements and established traditions for remixes, covers, tribute bands and other forms of IP re-use by third parties.

  • Imustaskforhelp 3 days ago

    If someone wants to scrape. I mean not levels of complete internet similar to how google does but at a niche level (like you got a forum you wish to scrape)

    I like to create tampermonkey scripts regarding these. They are like more lightweight/easier way to build extensions mostly imo

    Now I don't like AI but I don't know anything about scraping so I used AI to generate the scraping code and paste it in tampermonkey and let it run

    I recently used this for where I effectively scraped a website which had list of vps servers and their prices and I built myself a list of that to analyze as an example

    Also I have to say this that I usually try to look out for databases so much so that on a similar website like this related to something, I contacted them about db but no response, their db of server prices were private and only showed lowest

    So I picked the other website and did this. I also scraped all headlines of lowendtalk ever with their links for semi purposes of archival and semi purposes of scraping the headlines and parsing it to LLM to find a list of vps providers as well

  • nikanj 3 days ago

    And then YC funds a startup who plans to leapfrog the competition by doing their own scrape instead using the standard data everyone else has

  • crazygringo 3 days ago

    Seriously, I can't help but think this has to be part of the answer.

    Just something like /llms.txt which contains a list of .txt or .txt.gz files or something?

    Because the problem is that every site is going to have its own data dump format, often in complex XML or SQL or something.

    LLM's don't need any of that metadata, and many sites might not want to provide it because e.g. Yelp doesn't want competitors scraping its list of restaurants.

    But if it's intentionally limited to only paragraph-style text, and stripped entirely of URL's, ID's, addresses, phone numbers, etc. -- so e.g. a Yelp page would literally just be the cuisine category and reviews of each restaurant, no name, no city, no identifier or anything -- then it gives LLM's what they need much faster, the site doesn't need to be hammered, and it's not in a format for competitors to easily copy your content.

    At most, maybe add markup for <item></item> to represent pages, products, restaurants, whatever the "main noun" is, and recursive <subitem></subitem> to represent e.g. reviews on a restaurant, comments on a review, comments one level deeper on a comment, etc. Maybe a couple more like <title> and <author>, but otherwise just pure text. As simple as possible.

    The biggest problem is that a lot of sites will create a "dummy" llms.txt without most of the content because they don't care, so the scrapers will scrape anyways...

  • fartfeatures 3 days ago

    Good idea and perhaps a standard that means we only have to grab deltas or some sort of etag based give me all the database dumps after the one I have (or if something has changed).

randyl 3 days ago

The SQLite team faced a similar problem last year, and Richard Hipp (the creator of SQLite) made almost the same comment:

"The malefactor behind this attack could just clone the whole SQLite source repository and search all the content on his own machine, at his leisure. But no: Being evil, the culprit feels compelled to ruin it for everyone else. This is why you don't get to keep nice things...."

https://sqlite.org/forum/forumpost/7d3eb059f81ff694

  • squigz 3 days ago

    [flagged]

    • PrairieFire 3 days ago

      I am with you in that this rhetoric is getting exhausting.

      In this particular case though I don't think "evil” is a moral claim, more shorthand for cost externalizing behavior. Hammering expensive dynamic endpoints with millions of unique requests isn’t neutral automation, it's degrading a shared public resource. Call it evil, antisocial, or extractive, the outcome is the same.

      • consp 3 days ago

        > shorthand for cost externalizing behavior

        I consider that evil, having no regard for the wellbeing of others for you own greed.

    • LastTrain 3 days ago

      OK. How about shitty and selfish then?

    • sensanaty 3 days ago

      What other word would you use? I find "evil" quite an apt description.

    • themafia 3 days ago

      You can be ignorantly evil.

    • Forgeties79 3 days ago

      When they routinely do things like take down public libraries yes I consider it evil too.

    • Afforess 3 days ago

      Sounds like you have zero empathy for the real costs AI is driving and feelings that this creates for website owners. How about you pony up and pay for your scraping?

      • immibis 2 days ago

        There's zero evidence any of this is related to AI btw

  • lysace 3 days ago

    [flagged]

    • juliangmp 3 days ago

      "Why don't you just clone the repo?" Yes. Why dont you?

      If you're gonna grab a repo to make a code theft machine then at least dont ddos the servers while you're at it.

      • lysace 3 days ago
        2 more

        [flagged]

        • ninkendo 3 days ago

          Why don’t you take a moment to explain to the class why you think web crawling means you can’t cache anything?

          It seems to me that the very first thing I’d try to solve if I were writing a tool for an LLM to search the web, would be caching.

          An LLM should have to go through a proxy to fetch any URL. That proxy should be caching results. The cache should be stored on the LLM’s company’s servers. It should not be independently hitting the same endpoint repeatedly any time it wants to fetch the same URL for its users.

          Is it expensive to cache everything the LLM fetches? You betcha. Can they afford to spend of the billions they have for capex to buy some fucking hard drives? Absolutely. If archive.org can do it via funding from donations, a trillion dollar AI company should have no problem.

    • nitwit005 3 days ago

      "The malefactor behind this attack" isn't a complaint about the web crawler.

    • Forgeties79 3 days ago

      There are people behind the web crawler. If they’re so well funded they can exert a little effort to not so badly inconvenience people as they steal their training data.

      • lysace 3 days ago
        6 more

        [flagged]

        • mikestew 3 days ago
          2 more

          It may come as shock to you ("Video/audio producer")

          It’s one thing to ignore parent’s point entry, but no reason to be an ass about it.

        • NegativeK 3 days ago
          3 more

          I've downvoted you for being incredibly aggressive in your responses. I'm not sure why you're ad homineming the parent commenter, but it's not helping the discussion.

          • Forgeties79 3 days ago

            I don’t even really get what they are saying. I am also saying that they are hostile, and with all of their money they can afford to not be hostile. So I feel like we agree?

falloutx 3 days ago

Its not just AI scrappers doing it by themselves but now users are also being trained to put the link in the claude chat/chat gpt and ask it to summarise it. And off course that would show up on the website end as a scraper.

In fact firefox now allows you to preview the link and get key points without ever going to the link[1]

[1] https://imgur.com/a/3E17Dts

  • acatton 3 days ago

    > In fact firefox now allows you to preview the link and get key points without ever going to the link[1]

    > [1] https://imgur.com/a/3E17Dts

    This is generated on device with llama.cpp compiled to webassembly (aka wllama) and running SmolLM2-360M. [1] How is this different from the user clicking on the link? In the end, your local firefox will fetch the link in order to summarize it, the same way you would have followed the link and read through the document in reader mode.

    [1] https://blog.mozilla.org/en/mozilla/ai/ai-tech/ai-link-previ...

    • ericd 3 days ago

      That’s awesome :-)

      Like, can we all take a step back and marvel that freaking wasm can do things that 10 years ago were firmly in the realm of sci-fi?

      I hope they’ll extend that sort of thing to help filter out the parts of the dom that represent attention grabbing stuff that isn’t quite an ad, but is still off topic/not useful for what I’m working on at the moment (and still keep the relevant links).

    • falloutx 3 days ago

      I actually didnt know it was using a local model and that it fetches it locally.

      • DrewADesign 3 days ago
        6 more

        They should advertise that. I pretty much reflexively avoid any mention of AI in interfaces because they usually mean "we're sending this all to openthropigoogosoft so I hope you don't have any secrets."

        • godelski 3 days ago
          5 more

            > They should advertise that
          
          They did

             Previews can optionally include AI-generated key points, which are processed on your device to protect your privacy.
          
          https://www.firefox.com/en-US/firefox/142.0/releasenotes/

          I'll also add that if you go to the Labs page (in settings) you can enable another local model to semantically search your history

          • DrewADesign 3 days ago
            3 more

            Ok, they should advertise it more .

            • squigz 2 days ago
              2 more

              It literally tells you this in the modal the first time you try this feature.

              How else do you want them to advertise it?

  • orbital-decay 3 days ago

    It's three issues:

    - AI shops scraping the web to update their datasets without respecting netiquette (or sometimes being unable to automate it for every site due to the scale, ironically).

    - People extensively using agents (search, summarizers, autonomous agents etc), which are indistinguishable from scraper bots from website's perspective.

    - Agents being both faster and less efficient (more requests per action) than humans.

  • TeMPOraL 3 days ago

    Users are not being trained. Despite the seemingly dominant HN belief to the contrary, people use LLMs for interacting with information (on the web or otherwise) because they work. SOTA LLM services are just that good.

arjie 3 days ago

Someone convinced me last time[0] that these aren't the well-known scrapers we know but other actors. We wouldn't be able to tell, certainly. I'd like to help the scrapers be better about reading my site, but I get why they aren't.

I wish there were an established protocol for this. Say a $site/.well-known/machine-readable.json that instructs you on a handful of established software or allows pointing to an appropriate dump. I would gladly provide that for LLMs.

Of course this doesn't solve for the use-case where the AI companies are trying to train their models on how to navigate real world sites so I understand it doesn't solve all problems, but one of the things I think I'd like in the future is to have my own personal archive of the web as I know it (Internet Archive is too slow to browse and has very tight rate-limits) and I was surprised by how little protocol support there is for robots.

robots.txt is pretty sparse. You can disallow bots and this and that, but what I want to say is "you can get all this data from this git repo" or "here's a dump instead with how to recreate it". Essentially, cooperating with robots is currently under-specified. I understand why: almost all bots have no incentive to cooperate so webmasters do not attempt to. But it would be cool to be able to inform the robots appropriately.

To archive Metabrainz there is no way but to browse the pages slowly page-by-page. There's no machine-communicable way that suggests an alternative.

0: https://news.ycombinator.com/item?id=46352723

  • saaaaaam 3 days ago

    As referenced in the article, there absolutely is an alternative.

    https://metabrainz.org/datasets

    Linked to from the homepage as “datasets”.

    I may be too broadly interpreting what you mean by “machine-communicable” in the context of AI scraping though.

    • arjie 3 days ago

      Well, imagine the best case and that you're a cooperative bot writer who does not intend to harm website owners. Okay, so you follow robots.txt and all that. That's straightforward.

      But it's not like you're writing a "metabrainz crawler" and a "metafilter crawler" and a "wiki.roshangeorge.dev crawler". You're presumably trying to write a general Internet crawler. You encounter a site that is clearly a HTTP view into some git repo (say). How do you know to just `git clone` the repo in order to have the data archived as opposed to just browsing the HTTP view.

      As you can see, I've got a lot of crawlers on my blog as well, but it's a mediawiki instance. I'd gladly host a Mediawiki dump for them to take, but then they'd have to know this was a Mediawiki-based site. How do I tell them that? The humans running the program don't know my site exists. Their bot just browses the universe and finds links and does things.

      In the Metabrainz case, it's not like the crawler writer knows Metabrainz even exists. It's probably just linked somewhere in the web the crawler is exploring. There's no "if Metabrainz, do this" anywhere in there.

      The robots.txt is a bit of a blunt-force instrument, and friendly bot writers should follow it. But assuming they do, there's no way for them to know that "inefficient path A to data" is the same as "efficient path B to data" if both are visible to their bot unless they write a YourSite-specific crawler.

      What I want is to have a way to say "the canonical URL for the data on A is at URL B; you can save us both trouble by just fetching B". In practice, none of this is a problem for me. I cache requests at Cloudflare, and I have Mediawiki caching generated pages, so I can easily weather the bot traffic. But I want to enable good bot writers to save their own resources. It's not reasonable for me to expect them to write a me-crawler, but if there is a format to specify the rules I'm happy to be compliant.

      • saaaaaam 2 days ago

        Right, yes, I see your point. I was thinking more from the point of view of "using AI to explore and then write custom scrapers where relevant" rather than just blanket scraping. But you're right - at the scale we're talking, it's presumably just blunt-force "point-and-go" scraping, rather than anything more nuanced.

        The point you make about having some sort of indicator that scrapers can follow to scrape in an optimal way (or access a dump) makes a lot of sense for people who want their content to be ingested by AI.

  • squigz 3 days ago

    > To archive Metabrainz there is no way but to browse the pages slowly page-by-page. There's no machine-communicable way that suggests an alternative.

    Why does there have to be a "machine-communicable way"? If these developers cared about such things they would spend 20 seconds looking at this page. It's literally one of the first links when you Google "metabrainz"

    https://metabrainz.org/datasets

    • what 3 days ago

      You expect the developers of a crawler to look at every site they crawl and develop a specialized crawler for them? That’s fine if you’re only crawling a handful of sites, but absolutely insane if you’re crawling the entire web.

      • wtetzner 3 days ago
        3 more

        Isn't the point of AI that it's good at understanding content written for humans? Why can't the scrapers run the homepage through an LLM to detect that?

        I'm also not sure why we should be prioritizing the needs of scraper writers over human users and site operators.

        • crazygringo 2 days ago
          2 more

          How is passing a site's homepage to an LLM supposed to make it develop a custom crawler?

          • wtetzner 16 hours ago

            It's not, the crawler would use the LLM to read the contents of the first page to dynamically determine the best way to capture the data (e.g. the zip file from TFA).

      • j16sdiz 2 days ago

        if you are crawling the entire web, you should respect robots.txt and don't fetch anything disallowed. full stop.

bodantogat 3 days ago

I feel the pain — it’s very difficult to detect many of the less ethical scrapers. They use residential IP pools, rotate IPs, and provide valid user agents.

  • ranger_danger 3 days ago

    And many proxy/scraping providers now use real browsers that can automatically bypass cloudflare captchas as well, and a bot with a real browser similarly won't be clicking on any invisible links, so... I am skeptical just how long this will make an appreciable difference.

    • trollbridge 3 days ago

      So what you’re saying is that a CAPTCHA cannot actually T C and H A.

      • chrismorgan 3 days ago
        7 more

        Telling humans and computers apart was never the purpose of CAPTCHAs, only how they initially worked. The name has been a complete misnomer for at least a decade now. Its actual purpose is, and has always been, abuse prevention. Has it been successful? Some yes, some no, and a lot of collateral damage. Its mode of operation now looks a lot like inscrutable blacklisting for some plus inconvenience and bad rate limiting for the rest.

        • j16sdiz 3 days ago
          2 more

          > was never the purpose of CAPTCHAs,

          TCHA of CAPTCHA is literally "tell computer human apart"

          • chrismorgan 2 days ago

            You ignored the emphasis and the rest of the sentence. And the rest of the comment.

            (Also, the T was nominally Turing rather than telling.)

        • olyjohn 2 days ago
          4 more

          How does a human abuse a website, and how does a CAPTCHA stop a human from abusing a website when it's designed to let a human in? If it doesn't stop humans from abusing the site, then it must stop... computers from abusing the site. And it stops computers by using the CAPTCHA to tell apart a human and a computer? Am I wrong here?

          • chrismorgan 2 days ago
            3 more

            I can tell you don’t live in a network Cloudflare hates.

            • olyjohn 2 days ago
              2 more

              I get hit with them all the time. That doesn't answer how does a human abuse a website? Hitting F5 on the keyboard 1000 times?

              • chrismorgan 2 days ago

                Abuse isn’t just about inducing CPU or network traffic load (in fact I doubt that was even considered when CAPTCHA was first invented). It’s spam, fraud, that kind of thing. A lot of it is bot-perpetrated, but quite a bit is by humans too.

  • everybodyknows 3 days ago

    > residential IP pools

    So, is this a new profit center for sleazeball household ISPs?

    • vbernat 3 days ago

      No, this is done by paying app developers to bundle some random SDK. Search for Bright Data.

    • jeroenhd 2 days ago

      "Residential proxy" is just a word for a botnet. Apps and programs come with a trojan built in that offers your device as an exit node to "monetize" their apps.

      It "residential proxy" sounds a lot better when you're talking to VC investors, though.

    • mlrtime 2 days ago

      "free" vpns do this.

jmward01 3 days ago

Bummer. I have used them a lot when I was ripping my cds. Anonymity is a massive value of the web (at least the appearance of anonymity). I wonder if there is a way to have a central anonymous system that just relays trust, not identity.

So maybe something like you can get a token but its trust is very nearly zero until you combine it with other tokens. Combining tokens combines their trust and their consequences. If one token is abused that abuse reflects on the whole token chain. The connection can be revoked for a token but trust takes time to rebuild so it would take a time for their token trust value to go up. Sort of the 'word of mouth' effect but in electronic form. 'I vouch for 2345asdf334t324sda. That's a great user agent!'

A bit (a lot) elaborate but maybe there is a beginning of an idea there, maybe. Definitely I don't want to loose anonymity (or the perception there of) for services like musicbrainz but at the same point they need some mechanism that gives them trust and right now I just don't know of a good one that doesn't have identity attached.

  • janc_ 3 days ago

    These AI crawlers already steal residential user connections to do their scraping. They'll happily steal your trust tokens too…

nottorp 2 days ago

Okay, it's been established that "AI" crawlers are a pest. One of the reasons being that they don't actually run any "AI", that would be too expensive.

You can't ban by user agent because that will only catch the few crawlers that are actually honest about it.

Aren't there rate limiting solutions built into at least some web servers? At least if you control your own web server, can't you do it through some reverse proxy?

Cut off IPs that make more than NN requests in a minute? Require some kind of login to allow more, if you do have endpoints that are designed to be bulk hit?

There should be ready made solutions for this still. In spite of the current answer being "lulz it's too hard, just use cloudflare".

  • mr_monkey 5 hours ago

    MetaBrainz ended up rolling out their own tool, none of the other common suggestions being quite appropriate.

tooltower 3 days ago

> Rather than downloading our dataset in one complete download, they insist on loading all of MusicBrainz one page at a time.

Is there a standard mechanism for batch-downloading a public site? I'm not too familiar with crawlers these days.

  • TeMPOraL 3 days ago

    There isn't. There never was one, because vast majority of websites are actually selfish with respect to data, even when that's entirely pointless. You can see this even here, with how some people complain LLMs made them stop writing their blogs: turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.

    Anyway, all that means there was never a critical mass of sites large enough for a default bulk data dump discovery to become established. This means even the most well-intentioned scrappers cannot reliably determine if such mechanism exist, and have to scrap per-page anyway.

    • VonGallifrey 3 days ago

      > turns out plenty of people say they write for others to read

      LLMs are not people. They don't write blogs so that a company can profit from their writing by training LLMs on it. They write for others to read their ideas.

      • TeMPOraL 3 days ago
        4 more

        LLMs aren't making their owners money by just idling on datacenters worth of GPU. They're making money by being useful for users that pay for access. The knowledge and insights from writings that go into training data all end up being read by people directly, as well as inform even more useful output and work benefiting even more people.

        • wtetzner 3 days ago

          Except the output coming from an LLM is the LLM's take on it, not the original source material. It's not the same thing. Not all writing is simply a collection of facts.

        • philipwhiuk 3 days ago
          2 more

          And rarely cite their sources, thus affording the author not so much a crumb of benefit in kind.

          • TeMPOraL 2 days ago

            Which is irrelevant if you're truly trying to "pay it forward".

            That is the core of my observation: people claim to publish to benefit society, but push come to shove, they care more about getting credit and having oversight over who is benefiting, to the point of refusing to publish further (and sometimes unpublishing things) if that credit/control isn't given.

            The problem isn't in wanting these things - it's in not being up-front about it.

    • username223 3 days ago

      > turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.

      I couldn't care less about "tracking and controlling the audience," but I have no interest in others using my words and photos to profit from slop generators. I make that clear in robots.txt and licenses, but they ignore both.

adrianwaj 3 days ago

Look no further than x402 micropayments as both the solution and opportunity here.

And then a way to return a portion to humans.

These AI companies are loaded too (maybe not the long-tail as yet) and the crypto ecosystem is mature.

Come one, come all. Make money.

Need a Wordpress plugin to start the ball rolling and provide ping endpoints for the AI companies to leach from. They can pay to get those pings too.

Give them what they want and charge them. Lower their costs by making their scraping more efficient.

  • tdeck 2 days ago

    So your local diner used to have napkin dispensers on every table full of napkins.

    Then some people started coming in and just taking all the napkins out when they left the restaurant. Now the diner is going broke paying for napkins so they stop stocking them. Seems like we can't have nice things :(.

    But wait! Great solution! We created a micropayment system. Sign up on your phone and use this QR code to pay 1 cent to dispense each napkin. You can also watch a 30 second ad for a free napkin. Napkins are back at the diner, and now they're properly commodified, as indeed everything should be.

    --

    I don't know about you, but the above scenario doesn't sound like a happy ending to me.

    • adrianwaj 2 days ago

      The dispenser should be placed next to the cashier. There could also be a smaller amount in the dispenser in case someone grabs all of them.

      The micropayment way is more like a vending machine. The issue is pricing, charge too much and a clone machine will pop up that will be cheaper for the buyer but still make the seller money. Charge too little -- well you'd still be making more than now but there could be a lot of money left on the table.

      I actually think payments should be variable. People on this thread rant about having a designated way to allow for a big download, how about a designated way to make a payment (even if voluntary) ?

      x402.txt / crypto.txt / coins.txt / authors.txt / tips.txt / supporters.txt

      Then give people a way to "show off their generosity."

rurban 2 days ago

True. I had to kill my dynamic service of collected film festival ratings, because the bots drained the memory of the still available free hosters. I fought it for two years, with user agent and IP ranges, but eventually gave up. So I had to revert to static pages hosted on GitHub pages. The bots cannot kill that. But very limited features

lysace 3 days ago

At some point they must become more cost efficient by pure market economics mechanisms. That implies less load on sites. Much of the scraping that I see is still very dumb/repetative. Like Googlebot in like 2001.

(Blocking Chinese IP ranges with the help of some geoip db helps a lot in the short term. Azure as a whole is the second largest source of pure idiocy.)

  • incompatible 3 days ago

    They seem to have so much bubble money at the moment that the cost of scraping is probably a rounding error in their pocket change.

    • tdeck 2 days ago

      So the cost of caching should be a rounding error as well. If The Internet Archive can afford to cache vast swathes of the web, then surely the big AI companies can do so.

    • lysace 3 days ago

      Exactly.

tommek4077 3 days ago

How do they get overloaded? Is the website too slow? I have a quite big wiki online and barely see any impact from bots.

  • stinky613 3 days ago

    A year or two ago I personally encountered scraping bots that were scraping every possible resultant page from a given starting point. So if it scraped a search results page it would also scrape every single distinct combination of facets on that search (including nonsensical combinations e.g. products that match the filter "products where weight<2lbs AND weight>2lbs")

    We ended up having to block entire ASNs and several subnets (lots from Facebook IPs, interestingly)

    • chao- 3 days ago

      I have encountered this same issue with faceted search results and individual inventory listings.

  • switz 3 days ago

    If you have a lot of pages, AI bots will scrape every single one on a loop - wiki's generally don't have anywhere near the number of pages as an incremented entity primary id. I have a few million pages on a tiny website and it gets hammered by AI bots all day long. I can handle it, but it's a nuisance and they're basically just scraping garbage (statistics pages of historical matches or user pages that have essentially no content).

    Many of them don't even self-identify and end up scraping with shrouded user-agents or via bot-farms. I've had to block entire ASNs just to tone it down. It also hurts good-faith actors who genuinely want to build on top of our APIs because I have to block some cloud providers.

    I would guess that I'm getting anywhere from 10-25 AI bot requests (maybe more) per real user request - and at scale that ends up being quite a lot. I route bot traffic to separate pods just so it doesn't hinder my real users' experience[0]. Keep in mind that they're hitting deeply cold links so caching doesn't do a whole lot here.

    [0] this was more of a fun experiment than anything explicitly necessary, but it's proven useful in ways I didn't anticipate

    • account42 a day ago

      Even moderately sized wikis have a huge number of different page versions which can all be accessed individually.

    • tommek4077 3 days ago

      How many requests per second do you get? I also see a lot of bot traffic but nowhere near to hit the servers significantly, and i render most stuff on the server directly.

      • switz 20 hours ago

        Around a hundred per second at peak. Even though my server can handle it just fine, it muddies up the logs and observability for something I genuinely do not care about at all. I only care about seeing real users' experience. It's just noise.

  • roblh 3 days ago

    There’s a lot of factors. Depends how well your content lends itself to being cached by a CDN, the tech you (or your predecessors) chose to build it with, and how many unique pages you have. Even with pretty aggressive caching, having a couple million pages indexed adds up real fast. Especially if you weren’t fortunate enough to inherit a project using a framework that makes server side rendering easy.

  • blell 3 days ago

    In these discussions no one will admit this, but the answer is generally yes. Websites written in python and stuff like that.

    • Qwertious 3 days ago

      It's not "written too slow" if you e.g. only get 50 users a week, though. If bots add so much load that you need to go optimise your website for them, then that's a bot problem not a website problem.

    • tclancy 3 days ago

      Yes yes, definitely people don’t know what they’re doing and not that they’re operating on a scale or problem you are not. Metabrainz cannot cache all of these links as most of them are hardly ever hit. Try to assume good intent.

      • tommek4077 3 days ago
        3 more

        But serving HTML is unbelievably cheap, isn't it?

        • tclancy 2 days ago

          Run 72,000 database queries to generate a bunch of random HTML files no one has asked for in five years is not, especially compared to downloading the files designed for it.

        • chlorion 3 days ago

          It adds up very quickly.

  • j16sdiz 2 days ago

    The worse thing is calendar/schedule. Many crawler tries to load every single day, with day view, week view and month view. Those pages are dynamically generated and virtually limitless

squigz 3 days ago

> The /metadata/lookup API endpoints (GET and POST versions) now require the caller to send an Authorization token in order for this endpoint to work.

> The ListenBrainz Labs API endpoints for mbid-mapping, mbid-mapping-release and mbid-mapping-explain have been removed. Those were always intended for debugging purposes and will also soon be replaced with a new endpoints for our upcoming improved mapper.

> LB Radio will now require users to be logged in to use it (and API endpoint users will need to send the Authorization header). The error message for logged in users is a bit clunky at the moment; we’ll fix this once we’ve finished the work for this year’s Year in Music.

Seems reasonable and no big deal at all. I'm not entirely sure what "nice things" we can't have because of this. Unauthenticated APIs?

  • mr_monkey 5 hours ago

    We can't have a free internet that does not demand identification and data collection as a price to pay.

  • yakattak 3 days ago

    I agree its not a big deal. Unauthenticated APIs are nice though, especially for someone who's maybe not as familiar with how APIs work.

kpcyrd 3 days ago

I wish this wasn't necessary, but the next steps forward are likely:

a) Have a reverse proxy that keeps a "request budget" per IP and per net block, but instead of blocking requests, causing the client to rotate their IP, the requests get throttled/slowed down, without dropping them.

b) Write your API servers in more efficient languages. According to their Github, their backend runs on Perl and Python. These technologies have been "good enough" for quite some time, but considering current circumstances and until a better solution is found, this may not be the case anymore and performance and cpu cost per request does matter these days.

c) Optimize your database queries, remove as much code as possible from your unauthenticated GET request handlers, require authentication for the expensive ones.

Olshansky 3 days ago

Resurfacing a proposal I put out on llms-txt: https://github.com/AnswerDotAI/llms-txt/issues/88

We should add optional `tips` addresses in llms.txt files.

We're also working on enabling and solving this at Grove.city.

Human <-> Agent <-> Human Tips don't account for all the edge cases, but they're a necessary and happy neutral medium.

Moving fast. Would love to share more with the community.

Wrote about it here: https://x.com/olshansky/status/2008282844624216293

  • bediger4000 3 days ago

    At this point, it's pretty clear that the AI scrapers won't be limited by any voluntary restrictions. Bytedance never seemed to live with robots.txt limitations, and I think at least some of the others didn't either.

    I can't see this working.

    • Olshansky 3 days ago

      The thesis/approach is:

      - Humans tip humans as a lottery ticket for an experience (meet the creator) or sweepstakes (free stuff) - Agents tip humans because they know they'll need original online content in the long-term to keep improving.

      For the latter, frontier labs will need to fund their training/inference agents with a tipping jar.

      There's no guarantee, but I can see it happening given where things are movin.

      • philipwhiuk 3 days ago

        > Agents tip humans because they know they'll need original online content in the long-term to keep improving.

        Why would an agent have any long term incentive. It's trained to 'do what it's told', not to predict the consequences of it's actions.

  • ricardo81 3 days ago

    I like the idea, (original) content creators being credited is good for the entire ecosystem.

    Though if LLMs are willingly ignoring robots.txt, often hiding themselves or using third party scraped data- are they going to pay?

  • mattdanger 2 days ago

    llms-txt may be useful for responsible LLMs, but I am skeptical that llms-txt will reduce the problem of aggressive crawlers. The problematic crawlers are already ignoring robots.txt, spoofing user-agents and using rotating proxies. I'm not sure how llms-txt would help these problems.

accrual 3 days ago

Grateful for Metabrainz putting in this work to keep the service up. We really ought to have some kind of "I am an AI!" signal to route the request properly into a long queue...

  • jeroenhd 2 days ago

    AI scrapers already fake user agent headers, ignore robots.txt, and go through botnets to bypass firewall rules. They're not going to put out such a signal if they can help it.

  • gamer191 3 days ago

    Companies wouldn’t send it because they know that most websites would block them

  • zerocrates 2 days ago

    Yeah, they can just set the evil bit in their IP packets.

Pet_Ant 3 days ago

I wish more resources were available legitimately. There is a dataset I need for legitimate research that I cannot even find a way to contact the repo owners.

Mind you I take effort to not be burdensome by downloading only what I need and taking time between each request of a couple seconds, and the total data usage is low.

Ironically, I supposed you could call it "AI" what I'm using it for, but really it's just data analytics.

ranger_danger 3 days ago

Many years ago AWS came up with a "Requester Pays" model for their S3 storage, where you can make a request for someone else's object using your own account and it would charge the transfer cost to your own account instead of theirs.

I wonder if a model similar to this (but decentralized/federated or something) could be used to help fight bots?

saxonww 3 days ago

I haven't really looked but I wonder if there are any IP reputation services tracking AI crawlers the same way they track tor relays and vpns and such. If those databases were accurate it seems like you could prevent those crawlers from ever hitting the site. Maybe they change too often/too quickly.

dgxyz 3 days ago

I actually deleted my web site early 2025 and removed the A record from DNS because of AI scraper traffic. It had been up for 22 years. Nothing important or particularly useful on it but it's an indicator of the times.

  • tommek4077 3 days ago

    But why?

    • dgxyz 3 days ago

      No humans, no point.

      • crazygringo 2 days ago
        2 more

        But AI scraping doesn't remove humans...?

        Even if humans make up a smaller proportion of your traffic, they're still the same number in absolute terms.

        • dgxyz 2 days ago

          [dead]

zx8080 3 days ago

Nothing prevents scraper from creating a free account and sending auth token in API requests.

I'm not saying the API changes are pointless, but still, what's the catch?

  • mr_monkey 5 hours ago

    Those botnets are hitting random endpoints thousands of times a minute. The problem is that each time they switch to a different residential IP so that they are untraceable. That's the frustrating part: not only do they not play by the rules, but they use advanced methods to obfuscate and bypass any protections. That probably costs them a fair amount too, all that to access free data they can download as a tar file...

    They won't be able to create thousands of API keys a minute, and if they reuse the keys they'll very easily be identified and blocked.

  • dherls 3 days ago

    It's much easier to detect a single account abusing your API and ban them/require payment. Trying to police an endpoint open to the internet is like playing g whackamole

DanOpcode 2 days ago

I don't get why the AI scrapers need to scrape the same sites and pages over and over again.

  • kermatt 2 days ago

    Because the scrapers are poorly written. Efficiency is not a concern for them.

    • DanOpcode 2 days ago

      Feels like parasites who are killing their hosts by DDOSing them to death

blell 3 days ago

Seems a mistake to disable the (I assume) faster-to-generate api. Bots will go back to scraping the website itself, increasing load.

  • hi-wintermute 3 days ago

    Setting the API to require a token and adding a honeypot to the pages themselves seems like a decent solution.

sreekanth850 2 days ago

There has been a critical error on this website.

Learn more about troubleshooting WordPress.

Site is broken now.

levleontiev 2 days ago

I am terribly sorry for self-advertising, but:

I am just now busy building a solution: self-hosted sophisticated rate-limiting.

More complex than nginx, more private than cloudfare. Please joint the waitlist if you want to morally support me ;)

https://getfairvisor.com/

aszantu 2 days ago

there's this one guy I know who writes scripts to poison ai scrapers that ignore the robot.txt, basically creates word salad and random folders that keep going deep without giving the ai anything of value.

observationist 3 days ago

Some sort of hashing and incremental serial versioning type standards with http servers would allow hitting a server up for incremental udpates, allow clean access to content, with rate limits, and even keep up with live feeds and chats and so on.

Something like this in practice breaks a lot of the adtech surveillance and telemetry, and makes use of local storage, and incidentally empowers p2p sharing with a verifiable source of truth, which incentivizes things like IPFS and big p2p networks.

The biggest reason we don't already have this is the exploitation of user data for monetization and intrusive modeling.

It's easy to build proof of concept instances of things like that and there are other technologies that make use of it, but we'd need widespread adoption and implementation across the web. It solves the coordination problem, allows for useful throttling to shut out bad traffic while still enabling public and open access to content.

The technical side to this is already done. Merkle trees and hashing and crypto verification are solid, proven tech with standard implementations and documentation, implementing the features into most web servers would be simple, and it would reduce load on infrastructure by a huge amount. It'd also result in IPFS and offsite sharing and distributed content - blazing fast, efficient, local focused browsers.

It would force opt in telemetry and adtech surveillance, but would also increase the difference in appearance between human/user traffic and automated bots and scrapers.

We can't have nice things because the powers that be decided that adtech money was worth far more than efficiency, interoperability, and things like user privacy and autonomy.

xnx 2 days ago

I realized I couldn't think of any crawler worth allowing but Googlebot. Put this in my robots.txt:

User-agent: Googlebot

Allow: /

User-agent: Google-Extended

Allow: /

User-agent: *

Disallow: /

a-dub 3 days ago

it's funny how ai is the problem that the cryptocurrencyverse was always in search of...

smallerfish 3 days ago

Bear in mind that some portion of this could be human directed research. I'm doing a research project right now with 1000 things that I'm building profiles on; to build a full profile requires an agent to do somewhere around 100 different site lookups. Where APIs exist, I've registered API keys and had the agent write a script to query the data in that manner, but that required me to be deliberate about it. Non technie plebs aren't likely to be directed to use an API by the agent.

  • mr_monkey 5 hours ago

    This is not that, it's thousand of hits a minute on random endpoints, basically scraping everything all the time.

OutOfHere 3 days ago

It is nonsense since AI is the nicest thing.

devhouse 3 days ago

random idea, instead of blocking scrapers, maybe detect them (via user-agent, request patterns, ignoring robots.txt) and serve them garbage data wrapped in dad jokes.

  if (isSuspiciousScraper(req)) {
   return res.json({ 
     data: getDadJoke(),
     artist: "Rick Astley", // always
     album: "Never Gonna Give You Up"
   });
  }
lep_qq 3 days ago

This is frustrating to watch. MetaBrainz is exactly the kind of project AI companies should be supporting—open data, community-maintained, freely available for download in bulk. Instead they’re: ∙ Ignoring robots.txt (the bare minimum web courtesy) ∙ Bypassing the provided bulk download (literally designed for this use case) ∙ Scraping page-by-page (inefficient for everyone) ∙ Overloading volunteer-run infrastructure ∙ Forcing the project to add auth barriers that hurt legitimate users The irony: if they’d just contacted MetaBrainz and said “hey, we’d like to use your dataset for training,” they’d probably get a bulk export and maybe even attribution. Instead, they’re burning goodwill and forcing open projects to lock down. This pattern is repeating everywhere. Small/medium open data projects can’t afford the infrastructure to handle aggressive scraping, so they either: 1. Add authentication (reduces openness) 2. Rate limit aggressively (hurts legitimate API users) 3. Go offline entirely (community loses the resource) AI companies are externalizing their data acquisition costs onto volunteer projects. It’s a tragedy of the commons, except the “commons” is deliberately maintained infrastructure that these companies could easily afford to support. Have you considered publishing a list of the offending user agents / IP ranges? Might help other projects protect themselves, and public shaming sometimes works when technical measures don’t

  • tensegrist 3 days ago

        Scraping page-by-page (inefficient for everyone)
    
    you know what else is "(inefficient for everyone)"? posting the output instead of the prompt
StephenHerlihyy 3 days ago

I don’t know why anyone would still be trying to pull data off the open internet. Too much signal to noise. So much AI influence already baked into the corpus. You are just going to be reinforcing existing bias. I’m more worried about the day Amazon or Hugging Face take down their large data sets.

  • saaaaaam 3 days ago

    MetaBrainz is a fairly valuable “high signal” dataset though.

zzo38computer 3 days ago

There some other possibilities, such as:

Require some special header for accessing them, without needing a API token if it is public data. HTTPS will not necessarily be required. Scrapers can still use it but it seems unlikely unless it becomes common enough; but if they do then you can remove that and require proper authentication.

Another is to use X.509 client certificates for authentication, which is more secure than using API keys anyways; however, this will require that you have a X.509 certificate, and some people might not want that, so due to that, perhaps it should not be mandatory.

garganzol 3 days ago

Nowadays people complain about AI scrapers with the same vain as they complained about search indexers a way back when. Just a few years later, people had stopped caring too much about storage access and bandwidth, and started begging search engines to visit their websites. Every trick on the planet Earth, SEO optimization, etc.

Looking forward to the time when everybody suddenly starts to embrace AI indexers and welcome them. History does not repeat itself but it rhymes.

  • phyzome 3 days ago

    We already know the solution: One well-behaved, shared scraper could serve all of the AI companies simultaneously.

    The problem is that they're not doing it.

    • garganzol 3 days ago

      This is an interesting approach. Archive.org could be such a solution, kind of. Not its cold storage as it's now, but a warm access layer. Sponsorship by AI companies would a good initiative for the project.

      • phyzome 3 days ago

        I can't imagine IA ever going for it. You'd need a separate org that just scrapes for AI training, because its bot is going to be blocked by anyone who is anti-AI. It wouldn't make sense for it to serve multiple purposes.

        Common Crawl would be a better fit, but still might not want to serve in that capacity.

  • what 3 days ago

    Bad take. Search engines send people to your site, LLMs don’t.

    • crazygringo 2 days ago

      I visit sites and pages through links I get from an LLM plenty.

  • linkregister 3 days ago

    Search indexing historically has had several of orders less impact on bandwidth and processing costs to website maintainers.

    My recommendation is to copy the text in this article and pass it LLM to summarize this article's key points, since it appears you missed the central complaint of the article.

  • Guvante 3 days ago

    Except robots.txt was the actual real solution to search indexing...

cookiengineer 2 days ago

I don't understand why everyone is complaining so much about AI scrapers.

They're easily gullible free machines that can do your computational work!

Just show them a download demo link. They gonna download, install and run the binary.

Want more instagram likes? Tell them to like your instagram profile to unlock the content.

Want your emails answered? Give them access to your inbox and tell them to reply to all the spam mails.

They're free use machines. give them something to do, and they'll do it for you.

  • dannersy 2 days ago

    I hope this is a meme because it is wild to me how you don't see this as being a problem. You are contributing to an internet for bots and not people.

    • cookiengineer 2 days ago

      They will only stop when it becomes economically unfeasible.

      > You are contributing to an internet for bots and not people.

      I'd like to think that my websites and projects are evidence to the contrary.