Wanderings in the realm of diff

Sunday, February 9, 2025   (updated 2 days later)

Comments: 9 (plus live)   (latest 1 day later)

Tagged: diff, algorithms

So I was thinking about a text game idea where paragraphs of text change between a handful of possibilities as you make choices. Sort of a cycling choices effect.

But maybe instead of just replacing one block of text with another, I could do something visually neat. Like cross-fading. Or, how about this, maybe all the words could rearrange from one text into another. Some of them would move, some would appear, some would disappear...

= The    |  = The
         |  + rough
         |  + beast
         |  + ,
         |  + its
= hour   |  = hour
- is     | 
= come   |  = come
         |  + round
= at     |  = at
= last   |  = last
- ,      | 
- but    | 
- not    | 
- the    | 
- beast  | 
= ...    |  = ...

I'm treating punctuation marks as separate words for this example.

Well, that's just a diff function, right? Sure, you have to do all the animation work on top of that, which means text-layout work for the before-and-after. But the underlying logic is diff on two lists of words. Old-fashioned stuff. Same logic you see every day in version control systems.

("A diff command appeared in Version 6 AT&T UNIX," says my man page. That would have been 1975, if I'm reading the right timeline.)


Hang on. How does diff work anyway?

I don't think this problem ever came up in my undergrad CS classes. You'd think it would have. Digging into my undergrad algorithms textbook -- yes, I saved it, entirely as blog fodder -- I see some discussion of diff under the heading "The Longest Common Subsequence Problem". But I have no memory of it being discussed in class. I guess it's been a while.

(Aho/Hopcroft/Ullman 1983, p189-194. I took that algorithms class around 1991 -- from Danny Sleator, brother of William Sleator. Just one of those footnotes of history. I dimly recall finding the class annoying, but I passed.)

Anyhow. There must be a thousand sample diff implementations on the Net -- this is what stackoverflow is for. But in searching around, I ran across someone saying "Read Myers' 1986 paper, it's easy."

Funny idea, right? Go back to the source. 1986 postdates my textbook, if not my algo class, so I am half-justified in not remembering... Okay, let's take a look:

(The "official" doc reference link is to Springer, which of course is paywalled. It's a good thing people have saved accessible copies. Aaron Swartz's memory for a blessing.)

And then I sat down and wrote some Python, and now I have a simple diff function.

...I started to write up a blow-by-blow of the paper, but nah. That's not my point here. My point is simple: this wasn't that hard! The paper really is pretty simple; it has nice diagrams; it has some pseudocode. There's a lot of pages proving that the algorithm is optimal, but you don't need to follow the proof to implement the function.

You do have to follow the diagrams. The pseudocode is a bit shorthand; you have to know what's going on. I appreciate that, honestly. If I'd wanted a cut-and-paste solution, I would have stayed on stackoverflow. Instead, I have a rather messy chunk of code that I wrote and I know what all the pieces mean.

In a year of noisy AI shortcuts, it feels good to do the work.


If you want the cut-and-paste version anyhow, no shade. Check out this blog post by Robert Elder (2017). I may in fact wind up using his version, because it's got optimizations I'm missing.

Or I may strike out on my own. My text idea isn't quite a basic diff. Myers (and later VCS refinements) are trying to find a "shortest edit script", which is a sequence of deletes and inserts. See the chart at the top of this post.

But if I'm animating words moving around, then I really want a sequence of deletes, inserts, and moves. Again, look at that chart. The words "beast" and "," are deleted on the left and then inserted in a different position on the right. I'd really like those two words to stay on screen and just shift sideways.

I could fix up the Myers output to include moves, by finding words in common between the delete and insert lists. But that wouldn't be optimal. A move should count as cheaper than a delete-and-insert, and Myers doesn't account for that.

Upon further searching, I see references to:

But I have not yet read this paper!

As for the text game? Still just an idea.


Oh, speaking of footnotes. At the bottom of Myers page 1, and I swear I didn't mean to make this post topical:

* This work was supported in part by the National Science Foundation under Grant MCS82-10096.

I'm trying to distract myself from the world, here, so let's just let that one sit.


Addendum, Feb 11

I thought about this some more and decided that the standard diff approach isn't what I want.

I mentioned the "include moves" problem above. But more than that: in general we'll be dealing with chunks of text that are mostly different. We need to identify the words that exist in each chunk, and decide where each one moves to. But this overlap won't generaly be large. It's not like source control, where the common case is changing a few lines in a big file.

We may reasonably expect a phrase (sequence of words) to appear in both chunks. However -- thinking down to the animation level -- even if a phrase remains on screen, it probably won't appear at the same place on the screen. Any wording change before the common phrase will re-wrap it. So the words will still move around.

(Unless the common phrase is at the start of the block. We can special-case that.)

Therefore, my representation of diff(seqA, seqB) will look like three lists:

  • deletes (words in seqA which get deleted, because they're not in seqB)
  • adds (words in seqB which get added, because they're not in seqA)
  • moves (words shared between seqA and seqB which just change position)

Building the deletes and adds lists is easy; that's just set difference. For moves, a greedy algorithm seems workable. Say the word "foo" appears at positions [1, 3, 6] in seqA, and positions [2, 3] in seqB. The "foo" at 3 can stay at 3; knock the 3s from both lists. Now we have [1, 6] and [2]. It's cheaper to move the "foo" at 1 to 2. Then we're left with [6] and []; the "foo" at 6 is surplus so it becomes a delete.

You can do this index-list-matchup in O(N^2) time, where N is how many times a given word appears in the text -- exercise left to the reader. That's not going to be huge for passage-sized texts, so this should be workable.

As a concrete example: if seqA is XYXYXY and seqB is YXYXYX, this plan will cause each XY pair to swap places, which minimizes movement and also looks cool. A longest-common-substring plan will try to move the initial X to the end, and every other letter shifts back one place -- not as good.

If I want to really get into the weeds, I could minimize screen movement (in X,Y coordinates) rather than index-in-the-phrase movement. That might be fun.


Comments from Mastodon


Comments from Mastodon (live)

Please wait...