A question for Google people -- Usenet dump?

Saturday, November 3, 2012

Comments: 9   (latest November 13)

Tagged: newsgroups, ifarchive, interactive fiction, archiving, if, usenet

Here's a question for data-liberation people in Google. (I know some people who work for Google, but I'm not in contact with whoever can directly answer this.)

For some fifteen years, most of the online discussion about interactive fiction -- probably most of the IF discussion, period -- happened on two Usenet groups: rec.arts.int-fiction and rec.games.int-fiction.

We have archives of those discussions from 1992-1997 and some of 1999-2002. (See IFArchive directories for RAIF and RGIF.) Outside those ranges, we rely on Google and its Groups service -- as you can tell from my two links above.

Google Groups has historically been iffy about Usenet. It started by acquiring the Deja News post archive (which itself only started in 1995, and was not completely preserved). Google's Groups service was then built on top of that -- rather in the sense of a rhinoceros being built on top of an old rollerskate -- and its Usenet access dwindled in priority. Its indexing was famously gappy for many years, although Google fixed that a couple of years ago.

I could get into a long post about Google's treatment of Usenet and its long-term consequences, but that's not this post. My question: we, the IF community, would like to hold our own data here. What's the best way for me to get a complete dump of all messages posted to those two Usenet groups, ever?

Scraping through the Google Groups web interface is a way to do this, but it's not very good, for a couple of reasons. (a) Google tends to shut down automated trawlers after some number of requests. (b) I'd have to deal with an extra layer of content encoding, which is more room for encoding to go wrong. (c) I don't know if Google's indexing is really complete, even now.

So it would be way better if some nice Google person could tap it at the source and send me a tar file. Or a DVD, or a hard drive, whatever. Anybody?

The qualifications:
  • Obviously there's no such thing as complete. I'll take whatever Google has, and merge it with the Archive records.
  • I mean all posts with either rec.arts.int-fiction or rec.games.int-fiction in the Newsgroups: line. I also want all the crossposts, including the off-topic ones, the ones troll-crossposted to a zillion irrelevant groups, all of them. Think Newsgroups:.*rec\.(arts|games)\.int-fiction.*
  • I think I want spam, too. Probably. It depends on how much spam there is. (Google's index lets through a lot of spam, but maybe there's a thousand times as much which it doesn't show.) Tell me if it's horrible, we'll discuss it.
  • Original post file format, if possible.
  • My intent is to take whatever I get, ball it up, and stick it on the Archive. Then (at some point, not necessarily soon) I will go through, cull out the off-topic trolls and spam, and post it as a nice browsable web site on the Archive. Or maybe somebody else will do that part. Collect data first; massage later.
  • This is a one-shot request, as discussion on those newsgroups has mostly (not entirely) ceased as of a couple of years ago. A dump from beginning-of-time through this month is fine. (The community has shifted to intfiction.org these days. Archiving the web forum is a separate topic, which I also have feelers out on.)

If you can help, please comment here, or email me (erkyrath@eblong.com). Thanks.

Comments imported from Gameshelf