AI web scrapers: a data point
Wednesday, June 4, 2025
Comments: 9 (plus live) (latest 7 hours later)
Tagged: ai, llms, web, ifarchive
We all know that the Web is currently under attack by AI companies trying to turn scraped data into venture capital. I'd link to the early article I saw sounding the alarm, but I can't find it because there are hundreds of search hits on "ai bot scraper problems". I guess this article (arstechnica, March) was a big one.
This hit home for me when IFWiki started to show intermittent errors from server load. The server admins for IFTF and IFWiki are currently looking into solutions for that, so I will say no more about it. (I'm not the IFTF tech guy any more!)
However, I am still the IF Archive guy, so I took a look at its logs. Turns out the Archive is getting hammered in the same way. It's just not causing any problems. The IF Archive is entirely static files (except for the search widget). Cloudflare over Apache on static files can handle this load without breaking a sweat.
But I spent a bit of time analyzing the log data. Here's 15 hours of user-agent strings from yesterday:
count | from |
---|---|
111784 | hits total |
48050 | Scrapy |
16211 | GPTBot |
15097 | (misc strings containing "bot") |
11782 | ClaudeBot |
7530 | Amazonbot |
4377 | (no user-agent string) |
That leaves 8737 hits that are human or even vaguely bothering to pass. Not great! I'm not differentiating between LLM scrapers and old-fashioned search crawlers, but it's obviously mostly LLM stuff.
(Note that Cloudflare is set up to cover a lot of the site, but not the index pages, which change frequently. I know this is imperfect practice but it's been okay. Apache and static pages! So the above numbers are representative of our traffic, but they're not all of our traffic.)
But the interesting thing is, they're mostly not bothering to pass. Half of them are openly Scrapy, which is an open-source tool. So the question is, how much of this traffic is well-behaved? I know the common wisdom is "none of it" but it's worth checking, right?
So I added a simple robots.txt
file that explicitly blocked Scrapy and a couple of the other top user-agents. What do we find? Another 15-hour block, one day later:
count | from |
---|---|
146434 | hits total |
38009 | (one common version of Safari) |
34123 | Scrapy |
26825 | (randomized versions of Safari) |
9243 | Amazonbot |
4079 | ClaudeBot |
5374 | (no user-agent string) |
699 | GPTBot |
When I say "randomized version of Safari", I mean like
Mozilla/5.0 (Linux; U; Android 15; zh-CN; V2364A Build/AP3A.240905.015.A2) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/123.0.6312.80 Quark/7.8.0.751 Mobile Safari/537.36
...but with the build numbers randomized and other strings jammed in.
In other words, a lot of these bots are checking for a robots.txt
file. When they see one, they jumble up their user-agent and keep going. Yesterday's user-agent file had 641 unique user-agents. Today's had almost 18000. It would be hilarious if it weren't assholes destroying the Internet for speculative profit.
The numbers also imply a lot of bots that don't do this -- they ignore the robots.txt
entirely. (Which is what I expected.) Looks like Scrapy and Amazonbot are most prone to ignore. In contrast, the appearance of GPTBot and ClaudeBot dropped way off.
(Going by the user-agent strings! I have no reason to think "ClaudeBot" hits are really from the company Anthropic, or "GPTBot" from OpenAI. I haven't made any attempt to geolocate IP addresses.)
I guess you could ask why the bots don't always randomize or hide their user-agent strings. Maybe the CPU cost of string randomization is noticeable at the scale they're running.
As I said, the IF Archive doesn't have a load problem right now. So I don't need to change anything. The robots.txt
file was an experiment. The only practical change (since yesterday) is that total hits went up 30%. No clue if the robots.txt
caused that, but it certainly didn't help any, so I deleted it.
The numbers imply a possible strategy where you don't use robots.txt
, but instead configure your server to block the worst user-agent strings. This isn't a simple fix though. I gather that if you throw a 403 error, the bots will retry with different strategies until they get through. So you need to provide some fake-real content. I haven't tried this.
At the high end, this turns into the "AI labyrinth" strategy, which gets a lot of attention these days. I am faintly skeptical -- it seems like an arms race which will waste CPU time on both sides. I don't have AI VC money to burn on that race. However, I haven't tried any of those solutions. We might; Cloudflare is pushing such a feature, and like I said, we use Cloudflare for some things. We shall discuss it.
The other anti-bot strategy that comes up is the client-side proof-of-work challenge, like Anubis. (Aka "force your browser to solve sudokus for heroin access.") That one is going to hose me in particular, because I do a lot of browsing with Javascript turned off. And the IF Archive is committed to serving content without requiring Javascript. That's not an IFTF policy, though -- other services have different requirements.
We'll see if the proof-of-work strategy gets widely adopted. The Anubis docs say "In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin." Again, I have no practical experience here.
More mitigation news as it happens.
Comments from Bluesky
Comments from Mastodon
@zarfeblong Hm, I thought I had replied to this, but now it's not showing up. Anyway, on my website, the AI scraper hits outnumber genuine visitors by about 100:1. The server is clearly struggling, not laid out for this load. I seriously wonder why they keep doing it. They all must have grabbed all I have thousands of times by now.
@mr_creosote I wonder if there’s a “number go up” factor. You don’t make money by training an AI model, you make money by giving presentations to the investors with charts saying “We added 500 zigabytes of training data this month!”
@zarfeblong Then why not just make up numbers ofr the charts and leave us alone?
@mr_creosote @zarfeblong My non-expert expectation is that it's a very wide "they". Every dingaling anywhere in the world can come up with an idea that starts with "First, we set up and train an LLM”, and the world is a very big place.
I wonder if it's feasible to use CSS text obfuscation: https://shkspr.mobi/blog/2023/02/how-to-password-protect-a-static-html-page-with-no-js/
(JS for supporting screen readers, pure CSS for supporting sighted users with JS disabled)
Right now JS sudoku is unfortunately the best deterrent and user-agent blocks are the second best.
@zarfeblong
@zarfeblong .... 15 hours!?!?!
I'm trying to figure out if
a.) Wordpress has automated stuff to protect against bots and it is triggering (I know it does against spam pretty well, but that's comments, not visits)
b.) Wordpress is getting that many hits but it doesn't bother to show me
c.) My site just isn't on the radar of AI stuff (I know I've had the very occasional visit from chatgpt.com, so it can't be zero, but maybe it's them scraping from somewhere else that has linked me)
@jdyer My personal site and blog get almost none of this bot traffic. I don’t know what the difference is.
Comments from Mastodon (live)
Please wait...
This comment thread exists on Mastodon. (Why is this?) To reply, paste this URL into your Mastodon search bar: