Download the whole IF Archive
Wednesday, February 21, 2024
Comments: 24 (latest April 10)
Tagged: if, interactive fiction, ifarchive, archiving
I help run the IF Archive. I have for, oh, about 25 years now.
It's not a demanding job. Mostly the server just runs itself. We have a cadre of volunteers who file the games and write up the descriptions. (Thank them!)
Occasionally we change out some of the underlying server configuration, like when we started using a CDN for load balancing. But that's, like, once every few years.
Low maintenance is great. However, it means that we don't respond to feature requests very quickly. Or at all, sometimes.
Here's one we've never had a good answer for:
Dear IF Archive: I would like to download all your files so I can play all the games. How do I do that? Love, Suzie.
(Simulated request on closed track. Real-life Suzies may vary in their IF enthusiasm.)
When people ask this question, we mostly shrug and point at web-scraper tools. It's not hard to find all the files by browsing the folders. Or you can look at Master-Index.xml
, which lists every individual file in easy-to-parse form. (Well, "easy" -- that's 13 megabytes of XML right there. Sorry, I hadn't heard of JSON at that point.)
For a while in the early 2000s we allowed a few people to use rsync
to copy the Archive files and offer mirror servers. However, this was a serious hassle (early-2000s Linux firewall config? Not friendly) and it never entirely seemed worth the trouble. The CDN is much easier.
But then, how hard would it be to shove all the files into one big package and make that available for downloading? Disk space is cheap.
Long story short, on an experimental basis, we did that. The documentation is here, but it's short so I'll just play you the chorus.
If you want to download everything on the IF Archive in a single massive chunk, use this URL:
https://iftf-ifarchive-download.s3.amazonaws.com/ifarchive-all.tar.gz
That's 30 gigabytes, no foolin', which is why I haven't made that a hyperlink. If you grab that puppy, it should be on purpose. (And I don't particularly want automated web crawlers to grab it either. I mean, they will, but I'm not going to encourage them. AWS download cost is pennies but the pennies add up.)
That 30 GB file is updated weekly. It will grow over time, of course, but it's hard to say how fast.
Availability is subject to future review, as they say. It's an experiment! We'll see how the AWS fees stack up against utility.
I realize that very few of my loyal readers will have need of this feature. The number of people who web-scrape the Archive out of personal interest (rather than, you know, being a web-scraper bot) is probably countable on a grue's molars. But if that's you, feel free to try this new one-stop freebie.
Speaking of which: Would it be useful to have another download link for recent files? Say, all files touched in the past 30 days. That wouldn't be too hard to arrange.
Comments from Mastodon
@technomancy It’s been suggested, but is bittorrent a good solution for a file that’s updated weekly? Seems like that would just keep out-of-date versions in circulation.
@zarfeblong oh, I see; no you're right, that complicates things
It'd probably need to be broken up by year or something in which case the all-in-one-ness factor no longer works
Maybe an annual updated torrent for the history so far and a downloaded archive for the updates since then?
Or, terrible idea: put it all in a git repo so that people regularly updating will only get the new files.
@zarfeblong sweet! we had 30GiB burning a hole in our laptop
@irenes Heh.
I still think of 30 gig as *enormous*. Which is just to say I’m old.
@zarfeblong we totally had a thread a couple weeks ago with our last ill-advised archival download in which we compared the amount of space remaining on our drive after the download to the size of the 10 MiB hard drive that our father VERY UNWISELY bought in 1987 (we once looked up historical prices for the hardware)
@zarfeblong which is to say it feels big to us too, it's just few things seem like a better use of the space to us than IF games
@zarfeblong lol and our computer overheated as soon as we started the download :D
but anyway, in answer to your question on the post about whether incremental updates of some sort are helpful - for us personally, yes but not hugely so. our primary goal is archival, we'll probably just pull the full thing once a year.
@zarfeblong I was going to reply to the bit about making a recent-additions download with "You know what would be even more useful than that? A recent-additions RSS feed!" but then I looked and sure enough, there already is one. It took some poking around to find it, though.
@CarlMuckenhoupt We can mention that more explicitly! Thanks.
@zarfeblong "Oh cool, my buddy @sargent would be interested in thi-- OH, HE'S ALREADY A VOLUNTEER ADVISOR ON THE BOARD." 😅
@zarfeblong modarchive just makes yearly torrents, first one is "everything up to 2007"
@zarfeblong So how many molars does a grue have, anyway? Is the idea "very few" because they're all about the slavering fangs?
(Refrains from making a joke about an Adam Cadre of volunteers... er, oops)
@zarfeblong See I'd actually be interested in Interactive fiction if I knew of a good free or payed client that works well with screen readers. Like seriously I've tried two of them and they both sucked so no.
@rooktallon I just saw someone discussing the Spatterlight client: https://tweesecake.social/@pixelate/111967043313834710
You might want to be aware of Club Floyd, in which people play IF cooperatively via a MUD (without really using any features of the MUD that aren't present in IRC). https://www.ifwiki.org/ClubFloyd is a good starting point to read about it; I think it's still active.
The reason I mention that is that they have similar requirements to what I'd expect a screen reader user would want, or possibly somewhat stricter ones. That said, the standard approach there is that this does restrict the set of playable games.
Would it make sense to also have people seed that file as a torrent?
If so, it might make sense to compress it using `gzip --rsyncable`, so that subsequent versions are more similar to each other. I've roughly estimated (using first 5%) that this would result in ~1% larger files, so the cost is IMO very small.
Comments from Andrew Plotkin
So, a quick interim report:
This blog post (and a similar announcement on the IF forum) generated a big spike in traffic, which cost about $60 in AWS bandwidth fees. However, that was brief -- just the first three days.
I've also been considering the suggestions above, and on the forum. Also Jason Scott jumped in with some advice -- like he knows anything about archiving files. (Rim shot.)
I like the idea of regenerating the giant file annually, instead of weekly. And then having a separate (much smaller) download link for "all files touched in the past year." I haven't set this up yet, though.
I am also reconsidering the idea of public rsync access. (On the main Archive machine, not S3.) Rsync isn't protected by the CDN, but it's not like there are rsync-scraping robots trawling the Internet. So probably it will be okay? I haven't set this up either, but I will give it a try.
Finally, many people mentioned Bittorrent. I don't think this is a sensible use case for Bittorrent. I mean, look at that access pattern: several enthusiasts grabbed it in the first three days, and then nothing. Is that going to be a well-distributed seed? Probably not.
More news when I make the additional config changes.
Further update:
The bandwidth cost has been irregular, but it totaled about $25 in March. I've decided the average is too high for the value this service provides. So I've disabled the link above. Sorry! I said it was an experiment.
We will proceed with the rsync plan.
@zarfeblong neat! seems like it could be a good fit for a torrent too if there's more interest in the future