There may actually be some utility here. LLM agents refuse to traverse the links. Tested with gemini-3-pro, gpt-5.2, and opus 4.5.
edit: gpt-oss 20B & 120B both eagerly visit it.
I wish this came a day earlier.
There is a current "show your personal site" post on top of HN [1] with 1500+ comments. I wonder how many of those sites are or will be hammered by AI bots in the next few days to steal/scrape content.
If this can be used as a temporary guard against AI bots, that would have been a good opportunity to test it out.
AI bots (or clients claiming to be one) appear quite fast on new sites, at least that's what I saw recently in few places. They probably monitor Certificate Transparency logs - you won't hide by avoiding linking. Unless you are ok with staying in the shadow of naked http.
Get a wildcard cert and use it behind a reverse proxy.
Okay, but then what? Host your sites on something other than 'www' or '*', exclude them from search engines, and never link to them? Then, the few people who do resolve these subdomains, you just gotta hope they don't do it using a DNS server owned by a company with an AI product (like Google, Microsoft, or Amazon)?
I really don't know how you're supposed to shield your content from AI without also shielding it from humanity.
Don't have any index pages or heavy cross-linking between pages.
None of that matters. AI bots can still figure out how to navigate the website.
The biggest problem I have seen with AI scrapping is that they blindly try every possible combination of URLs once they find your site and blast it 100 times per second for each page they can find.
They don’t respect robots.txt, they don’t care about your sitemap, they don’t bother caching, just mindlessly churning away effectively a DDOS.
Google at least played nice.
And so that is why things like anubis exist, why people flock to cloudflare and all the other tried and true methods to block bots.
I don't see how that is possible. The web site is a disconnected graph with a lot of components. If they get hold of a url, maybe that gets them to a few other pages, but not all of them. Most of the pages on my personal site are .txt files with no outbound links, for that matter. Nothing to navigate.
how? if you don't have a default page and index listings are disabled, how can they derive page names?
I posted my site on the thread.
My site is hosted on Cloudflare and I trust its protection way more than flavor of the month method. This probably won't be patched anytime soon but I'd rather have some people click my link and not just avoid it along with AI because it looks fishy :)
I've been considering how feasible it would be to build a modern form of the denial of service low orbit ion cannon by having various LLMs hammer sites until they break. I'm sure anything important already has Cloudflare style DDOS mitigation so maybe it's not as effective. Still, I think it's only a matter of time before someone figures it out.
There have been several amplification attacks using various protocols for DDOS too...
Yeah I meant using it as an experiment to test with two different links(or domains) and not as a solution to evade bot traffic.
Still, I think it would be interesting to know if anybody noticed a visible spike in bot traffic(especially AI) after sharing their site info in that thread.
I didn't: no traffic before sharing, none since.
funny. what's your site?
FYI Cloudflare protection doesn't mean much nowadays if someone is slightly determined to scrape the site
Unless you mean DDoS protection, this one helps for sure
Glad I’m not the only one who felt icky seeing that post.
I agree my tinfoil hat signal told me this was the perfect way to ask people for bespoke, hand crafted content - which of course AI will love to slurp up to keep feeding the bear.
Not producing or publishing creative works out of fear that someone will find them and build on top of them is such a strange position to me, especially on a site that has it's cultural basis in hacker culture.
AI has driven a lot of people mad and not just its end users.
I think that something specifically intended for this, like Anubis, is a much better option.
Anubis flatly refuses me access to several websites when I'm accessing them with a normal Chromium with enabled JS and whatnot, from a mainstream, typical OS, just with aggressive anti-tracking settings.
Not sure if that's the intended use case. At least Cloudflare politely masks for CAPTCHA.
What do you mean "refuses"? The worst it should do is serve up a high difficulty proof of work. Unless it gained new capabilities recently?
Are you sure the block isn't due to the authors of those websites using some other tool in addition?
I thought that Anubis solely is proof of work, so I'm very curious as to what's going on here.
Of course, the downside is that people might not even see your site at all because they’re afraid to click on that suspicious link.
Site should add a reverse lookup. Provide the poison and antidote.
Bitly does that, just add '+' to Bitly URL (probably other shorteners, too).
How is AI viewing content any different from Google? I don’t even use Google anymore because it’s so filled with SEO trash as to be useless for many things.
Try hosting a cgit server on a 1u server in your bedroom and you'll see why.
LLM led scraping might not as it requires an LLM to make a choice to kick it off, but crawling for the purpose of training data is unlikely to be affected.
Sounds like a useful signal for people building custom agents or models. Being able to control whether automated systems follow a link via metadata is an interesting lever, especially given how inconsistent current model heuristics are.
I can confirm Mistral refuse to traverse the links