Anatomy of a bug hunt

Monday, June 4, 2012

Tagged: tldr, bugs, buffer overflow, programming, dreamhold


As my loyal followers know, I released Dreamhold on a Friday, and then spent most of Saturday trying to figure out why it was crashing for some players. Nobody ever wants a software release like that, but I did find the bug, so it's as much of a win as one can hope for. The bug fix just hit the App Store.

I thought it might be interesting to detail the whole sequence of events. Or maybe it'll be as dull as ditch dirt! I don't know. It seems like a good narrative to me, but I was dere, Charlie.


Friday, 5:30 pm: I get the notice that The Dreamhold has been released to the App Store. I immediately update my web sites, and tweet about it. Then I head off to a Game of Thrones party. (Last three episodes of season 1. We're the non-HBO-having crowd.)

I have my blog post mostly written up, but I don't have time to finish it before I have to leave. I figure I'll post it before I go to bed. Good thing, as it turns out.

7:30 pm: Around the time Drogo's manly chest is sliced up, I get a tweet with a screenshot of a Dreamhold crash. Oops. Several more soon arrive. I tweet that fact. Guess I know what I'm going to be doing the next day.

1:45 am: Home now, I finish up my blog post, adding the embarrassing admission of the bug.

I have not yet done any investigation, or even seen the problem myself. But I think, what could be different about the people who are afflicted? If it's crashing on startup, everyone should be in exactly the same state. Well, one thing might be different: some people are signed into Game Center, some aren't. (The game logs into Game Center on start up, so that it can notify achievements.)

So I pull out my phone, log out of GC, flip on "airplane mode" so that I can't possibly be logged in, force-quit Dreamhold, and relaunch it. Presto. Crashy-crashy, just like people said.

Now I have a handle on what might be happening -- at least I can reproduce the problem. I go to bed, vastly less worried than I was three hours ago.

Saturday, 11 am: I wake up to email from Christoph Ender, the author of the Z-machine interpreter engine I'm using. He says he thinks the problem might be in the "blockbuf" module, and he recommends that I comment it out -- that prevents the crash he's seeing in the iPhone Simulator.

Wait. What? The app crashes in the iPhone Simulator? I've never seen that. How is he doing that?

1:20 pm: Christoph explains that he just checked out a fresh copy of my IosFizmo source, created a new Xcode project, and ran it. It crashed immediately. I try this myself, and behold, it crashes immediately.

(Christoph is testing Colossal Cave, not Dreamhold, because that's the sample game that IosFizmo ships with. This means that he's not testing the Dreamhold-specific features like Game Center. But this is good! Demonstrating the bug with a simpler setup means I can debug with a simpler setup.)

Why haven't I seen this before? Well -- honesty time -- I did see it once, several weeks ago. I couldn't reproduce it, and it never happened on a real iPhone, so I eventually forgot about it. This is one of those things that you Really Shouldn't Ignore; I have no excuse. Anyhow, I can't ignore it now.

As with all flaky bugs, the first priority is to figure out a magic chicken dance to make the error appear. After a certain amount of screwing around, I settle on this: delete the app from the Simulator, quit the Simulator, launch the app. Boom. (Naturally I never did this during normal development.)

I have no idea why this works; nor do I care. When launching from a fresh start, the memory blocks line up in such a way as to demonstrate the crash. (This is why it's good to be able to reproduce it in the Simulator. Much easier to restart than a real iPhone.)

I shall repeat this dance about five hundred times over the next few hours.

2:00 pm: (Times from here out will be approximate. Also, interrupted by meals, games of Ascension, and general staring at the ceiling. The ceiling is an important work surface.)

You might think, Christoph already found the problem! Just get rid of the blockbuf module.

True -- when I do that, the problem goes away. But that doesn't mean it's fixed. Anything can make the memory blocks line up differently. (Like not being logged into Game Center, for example.) Until I find the bug, any change is just a way to ruin the magic chicken dance, not a true fix.

Instead, I turn off Fizmo's signal handler. "Signal 11" is a Unix signal; the OS sends it to kill a process when it tries to access a bad memory address. But, as everybody noticed, the app doesn't actually die. Instead, it prints a polite message about "signal 11" and waits for the player to quit.

Why? Because Fizmo is set up to catch signals and report errors this way. But, really, I don't want that. It seems like the graceful thing to do, but in fact the report tells me nothing useful. If the app just flat-out crashed, players wouldn't be any unhappier. Furthermore, if the app flat-out crashed, then the OS would have a crash log that might be forwarded to me. (If you have your device set to do that.) The signal-catching doesn't make debugging any easier either; like I said, a simpler setup is better.

So now I've got the app crashing, in the iPhone Simulator, and it's hooked up to a debugger so I can see where it dies. As it happens, it dies on this line of code:
... [win isKindOfClass:[GlkWindowGrid class]] ...

This is gibberish if you're not familiar with ObjectiveC -- sorry about that. Basically, it checks what kind of window a particular window ("win") is. It's straightforward code which should work (return YES or NO) for any valid reference.

Since it doesn't, then I know that at some previous time the "win" reference stopped being valid. Of course, I have no idea when that happened, much less why. This is why programmers hate memory errors. (And languages that permit them.)

2:30 pm: Time to investigate the blockbuf module. This is the part of Fizmo that remembers what the status line looks like.

Here's the joke: IosFizmo doesn't even use the blockbuf module. It has its own module to track the status line. (The "GlkWindowGrid" class mentioned above.) The Fizmo part of the software sets up blockbuf, and I never got around to turning it off. In fact, for several weeks I had "Disable blockbuf?" on my to-do list! But the one time I tried it, I ran into some picayune problem -- not a crash, I think it was a compiler error -- and so I decided that it was important for some reason. So I crossed it off my list with a big fat "whatever".

Sigh. If only.

Anyway, blockbuf is a simple piece of code. I go through and stick in tests that will alert me if any blockbuf function writes past the end of a memory block. Unfortunately, when I run the app, none of the alerts go off. So much for that theory.

3:00 pm: I know that the crash occurs while the game is trying to draw the status line. Maybe I can pin that down more precisely.

I go through my status-line code, putting in "isKindOfClass" calls every time the window is updated. These calls don't do anything -- I'm checking the kind of a window that I already know what kind it is, and then throwing away the answer. The point is to see when the call starts crashing.

This is tedious, but it seems to be crashing... at the same point it was before. That is, the first time the cursor (the printing location) is relocated. (To draw the status line, the game moves the cursor to the top left corner, and then starts printing text.)

3:30 pm: Okay, the heck with that. I'm going to put in an "isKindOfClass" call after every single Z-machine opcode. That'll pin it down.

It works! The app crashes -- after the very first opcode. Huh.

For good measure, I put in another "isKindOfClass" call before every single opcode. And indeed, the app crashes before the first opcode.

Well, that certainly narrows it down. Memory is getting corrupted in the interpreter setup code! I put in more calls in the middle of the setup sequence. Nope, it's earlier than that. I try the beginning of the sequence. Nope, the problem has already occurred by then.

This is not what I expected. This is really odd.

Then I realize that my test call itself has a bug. I haven't been tracking down the bug, I've been crashing the app with my own test code. Oops. Throw away everything I've learned in the past half-hour.

4:30 pm: My test code does the right thing now, so it's back to the salt mines.

The app appears to be crashing after a "print character" opcode. (In fact it's printing a space character.) This isn't too surprising. Printing means adding new data to memory, and that could be corrupting something if it isn't handled right. (If the app crashed after adding two numbers, or something like that, I'd be surprised.)

How shall I deal with this, though? I can't set up alerts on every single memory-write in the printing path. Half the code in the app deals with printing text, one way or another. (It's a text adventure engine.)

5:00 pm: Bright idea time. The app always crashes after the same printing opcode. It's at address 56790. (Or something like that. I don't remember the exact number.) Fine -- I'll stick in a line that pauses execution right before address 56790. And then I'll single-step through the app and see if it does anything funny.

(You wanted to know why the magic chicken dance is so important? I really need the app to crash exactly the same way every time I test it. Because of moments like this.)

So, delete the app (again), quit the Simulator (again), start it up...

5:30 pm: This is a lot of single-stepping.

Fizmo converts the space character to Unicode. (Which is the same character value as before, but it still has to go through the function.) No overflow there.

Fizmo passes the character off to my iOS printing code.

My code passes it off to the GlkWindowGrid class, which stores it in a new memory block. All properly handled. (As far as I can tell. I'm going to feel really stupid if memory gets squashed in front of my eyes and I miss it...)

My code checks to see if transcripting is on. (It's not.)

Back to Fizmo.

Fizmo hands the character over to blockbuf.

Blockbuf passes my overflow alert (remember the overflow alert?) without a peep.

Blockbuf goes to write the space character at coordinates (-1, 0). Hold it right there.

Why is the X coordinate less than zero? Actually, I don't even care. This is the problem spot. Also, I'm an idiot. Hours ago, I stuck in an alert that would go off if blockbuf tried to write after the end of its memory block. I forgot to check for writing before the beginning of the memory block. If I had done that, I would have found the problem immediately.

6:30 pm: I suppose I should figure out how blockbuf got into that state.

It's not hard to see, now that I look. The way I'm setting up the interpreter (on iOS) leads to Fizmo thinking that the window width is zero, at first. Then, right after that, it gets resized to the "real" width.

But before that happens, it hits a line that tries to ensure that the current X coordinate is non-negative, but less than the width. Well, there is no integer which is non-negative but less than zero. It checks the conditions in that order... so the second one wins, and X winds up as -1.

7:00 pm: Time to unwind all the alerts and debugging code I've piled into this thing.

In the end, the right plan was the first plan: I have to get rid of the blockbuffer module. (But now I know that will fix the problem, not just sweep it under a rug.) The changes for Dreamhold 1.0.1 are just two lines of code:
#define DISABLE_SIGNAL_HANDLERS 1
#define DISABLE_BLOCKBUFFER 1

7:30 pm: Test in the Simulator. Test on my iPhone. Test on my iPad. No crashes.

7:46 pm: BUG FOUND..

(Then of course it took a week for Apple to review it, because of their usual queue plus the holiday weekend. No, we don't need to get into App Store control issues here.)