Zarf Updates: AI web scrapers: a data point

AI web scrapers: a data point

Wednesday, June 4, 2025

Comments: 9 (latest 7 hours later)

Tagged: ai, llms, web, ifarchive

We all know that the Web is currently under attack by AI companies trying to turn scraped data into venture capital. I'd link to the early article I saw sounding the alarm, but I can't find it because there are hundreds of search hits on "ai bot scraper problems". I guess this article (arstechnica, March) was a big one.

This hit home for me when IFWiki started to show intermittent errors from server load. The server admins for IFTF and IFWiki are currently looking into solutions for that, so I will say no more about it. (I'm not the IFTF tech guy any more!)

However, I am still the IF Archive guy, so I took a look at its logs. Turns out the Archive is getting hammered in the same way. It's just not causing any problems. The IF Archive is entirely static files (except for the search widget). Cloudflare over Apache on static files can handle this load without breaking a sweat.

But I spent a bit of time analyzing the log data. Here's 15 hours of user-agent strings from yesterday:

count	from
111784	hits total
48050	Scrapy
16211	GPTBot
15097	(misc strings containing "bot")
11782	ClaudeBot
7530	Amazonbot
4377	(no user-agent string)

That leaves 8737 hits that are human or even vaguely bothering to pass. Not great! I'm not differentiating between LLM scrapers and old-fashioned search crawlers, but it's obviously mostly LLM stuff.

(Note that Cloudflare is set up to cover a lot of the site, but not the index pages, which change frequently. I know this is imperfect practice but it's been okay. Apache and static pages! So the above numbers are representative of our traffic, but they're not all of our traffic.)

But the interesting thing is, they're mostly not bothering to pass. Half of them are openly Scrapy, which is an open-source tool. So the question is, how much of this traffic is well-behaved? I know the common wisdom is "none of it" but it's worth checking, right?

So I added a simple robots.txt file that explicitly blocked Scrapy and a couple of the other top user-agents. What do we find? Another 15-hour block, one day later:

count	from
146434	hits total
38009	(one common version of Safari)
34123	Scrapy
26825	(randomized versions of Safari)
9243	Amazonbot
4079	ClaudeBot
5374	(no user-agent string)
699	GPTBot

When I say "randomized version of Safari", I mean like

Mozilla/5.0 (Linux; U; Android 15; zh-CN; V2364A Build/AP3A.240905.015.A2) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/123.0.6312.80 Quark/7.8.0.751 Mobile Safari/537.36

...but with the build numbers randomized and other strings jammed in.

In other words, a lot of these bots are checking for a robots.txt file. When they see one, they jumble up their user-agent and keep going. Yesterday's user-agent file had 641 unique user-agents. Today's had almost 18000. It would be hilarious if it weren't assholes destroying the Internet for speculative profit.

The numbers also imply a lot of bots that don't do this -- they ignore the robots.txt entirely. (Which is what I expected.) Looks like Scrapy and Amazonbot are most prone to ignore. In contrast, the appearance of GPTBot and ClaudeBot dropped way off.

(Going by the user-agent strings! I have no reason to think "ClaudeBot" hits are really from the company Anthropic, or "GPTBot" from OpenAI. I haven't made any attempt to geolocate IP addresses.)

I guess you could ask why the bots don't always randomize or hide their user-agent strings. Maybe the CPU cost of string randomization is noticeable at the scale they're running.

As I said, the IF Archive doesn't have a load problem right now. So I don't need to change anything. The robots.txt file was an experiment. The only practical change (since yesterday) is that total hits went up 30%. No clue if the robots.txt caused that, but it certainly didn't help any, so I deleted it.

The numbers imply a possible strategy where you don't use robots.txt, but instead configure your server to block the worst user-agent strings. This isn't a simple fix though. I gather that if you throw a 403 error, the bots will retry with different strategies until they get through. So you need to provide some fake-real content. I haven't tried this.

At the high end, this turns into the "AI labyrinth" strategy, which gets a lot of attention these days. I am faintly skeptical -- it seems like an arms race which will waste CPU time on both sides. I don't have AI VC money to burn on that race. However, I haven't tried any of those solutions. We might; Cloudflare is pushing such a feature, and like I said, we use Cloudflare for some things. We shall discuss it.

The other anti-bot strategy that comes up is the client-side proof-of-work challenge, like Anubis. (Aka "force your browser to solve sudokus for ~~heroin~~ access.") That one is going to hose me in particular, because I do a lot of browsing with Javascript turned off. And the IF Archive is committed to serving content without requiring Javascript. That's not an IFTF policy, though -- other services have different requirements.

We'll see if the proof-of-work strategy gets widely adopted. The Anubis docs say "In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin." Again, I have no practical experience here.

More mitigation news as it happens.

Comments from Bluesky

Brian Uri! (June 4, 2025 at 3:40 PM):

A wiki I run is constantly hit by Scrapy. Did whack-a-mole with iptables at the /16 or /24 level for several months. Now, Scrapy visits twice a month in one hour surprise raids that are over before I wake up. At least now the effect on bandwidth is contained, if still annoying.

Andrew Plotkin (June 4, 2025 at 4:23 PM):

That's useful to hear, thanks.

Comments from Mastodon

Mr Creosote (June 4, 2025 at 3:56 PM):

@zarfeblong Hm, I thought I had replied to this, but now it's not showing up. Anyway, on my website, the AI scraper hits outnumber genuine visitors by about 100:1. The server is clearly struggling, not laid out for this load. I seriously wonder why they keep doing it. They all must have grabbed all I have thousands of times by now.

Andrew Plotkin (June 4, 2025 at 4:03 PM):

@mr_creosote I wonder if there’s a “number go up” factor. You don’t make money by training an AI model, you make money by giving presentations to the investors with charts saying “We added 500 zigabytes of training data this month!”

Mr Creosote (June 4, 2025 at 4:14 PM):

@zarfeblong Then why not just make up numbers ofr the charts and leave us alone?

jmac (June 4, 2025 at 8:10 PM):

@mr_creosote @zarfeblong My non-expert expectation is that it's a very wide "they". Every dingaling anywhere in the world can come up with an idea that starts with "First, we set up and train an LLM”, and the world is a very big place.

yin yang yoink (June 4, 2025 at 9:52 PM):

I wonder if it's feasible to use CSS text obfuscation: https://shkspr.mobi/blog/2023/02/how-to-password-protect-a-static-html-page-with-no-js/

(JS for supporting screen readers, pure CSS for supporting sighted users with JS disabled)

Right now JS sudoku is unfortunately the best deterrent and user-agent blocks are the second best.
@zarfeblong

Jason Dyer (June 4, 2025 at 10:02 PM):

@zarfeblong .... 15 hours!?!?!

I'm trying to figure out if

a.) Wordpress has automated stuff to protect against bots and it is triggering (I know it does against spam pretty well, but that's comments, not visits)

b.) Wordpress is getting that many hits but it doesn't bother to show me

c.) My site just isn't on the radar of AI stuff (I know I've had the very occasional visit from chatgpt.com, so it can't be zero, but maybe it's them scraping from somewhere else that has linked me)

Andrew Plotkin (June 4, 2025 at 10:21 PM):

@jdyer My personal site and blog get almost none of this bot traffic. I don’t know what the difference is.

Zarf Updates

Interactive fiction, narrative in games, and so on

Posts by

My links

Search (via DDG)

Blog archive

Tag archive

Previous post

Next post

Feeds

Games are for everyone