Zarf Updates: Wanderings in the realm of diff

Wanderings in the realm of diff

Sunday, February 9, 2025 (updated 2 days later)

Comments: 11 (latest February 18)

Tagged: diff, algorithms

So I was thinking about a text game idea where paragraphs of text change between a handful of possibilities as you make choices. Sort of a cycling choices effect.

But maybe instead of just replacing one block of text with another, I could do something visually neat. Like cross-fading. Or, how about this, maybe all the words could rearrange from one text into another. Some of them would move, some would appear, some would disappear...

= The    |  = The
         |  + rough
         |  + beast
         |  + ,
         |  + its
= hour   |  = hour
- is     | 
= come   |  = come
         |  + round
= at     |  = at
= last   |  = last
- ,      | 
- but    | 
- not    | 
- the    | 
- beast  | 
= ...    |  = ...

I'm treating punctuation marks as separate words for this example.

Well, that's just a diff function, right? Sure, you have to do all the animation work on top of that, which means text-layout work for the before-and-after. But the underlying logic is diff on two lists of words. Old-fashioned stuff. Same logic you see every day in version control systems.

("A diff command appeared in Version 6 AT&T UNIX," says my man page. That would have been 1975, if I'm reading the right timeline.)

Hang on. How does diff work anyway?

I don't think this problem ever came up in my undergrad CS classes. You'd think it would have. Digging into my undergrad algorithms textbook -- yes, I saved it, entirely as blog fodder -- I see some discussion of diff under the heading "The Longest Common Subsequence Problem". But I have no memory of it being discussed in class. I guess it's been a while.

(Aho/Hopcroft/Ullman 1983, p189-194. I took that algorithms class around 1991 -- from Danny Sleator, brother of William Sleator. Just one of those footnotes of history. I dimly recall finding the class annoying, but I passed.)

Anyhow. There must be a thousand sample diff implementations on the Net -- this is what stackoverflow is for. But in searching around, I ran across someone saying "Read Myers' 1986 paper, it's easy."

Funny idea, right? Go back to the source. 1986 postdates my textbook, if not my algo class, so I am half-justified in not remembering... Okay, let's take a look:

An O(ND) Difference Algorithm and Its Variations, Eugene W. Myers, 1986

(The "official" doc reference link is to Springer, which of course is paywalled. It's a good thing people have saved accessible copies. Aaron Swartz's memory for a blessing.)

And then I sat down and wrote some Python, and now I have a simple diff function.

...I started to write up a blow-by-blow of the paper, but nah. That's not my point here. My point is simple: this wasn't that hard! The paper really is pretty simple; it has nice diagrams; it has some pseudocode. There's a lot of pages proving that the algorithm is optimal, but you don't need to follow the proof to implement the function.

You do have to follow the diagrams. The pseudocode is a bit shorthand; you have to know what's going on. I appreciate that, honestly. If I'd wanted a cut-and-paste solution, I would have stayed on stackoverflow. Instead, I have a rather messy chunk of code that I wrote and I know what all the pieces mean.

In a year of noisy AI shortcuts, it feels good to do the work.

If you want the cut-and-paste version anyhow, no shade. Check out this blog post by Robert Elder (2017). I may in fact wind up using his version, because it's got optimizations I'm missing.

Or I may strike out on my own. My text idea isn't quite a basic diff. Myers (and later VCS refinements) are trying to find a "shortest edit script", which is a sequence of deletes and inserts. See the chart at the top of this post.

But if I'm animating words moving around, then I really want a sequence of deletes, inserts, and moves. Again, look at that chart. The words "beast" and "," are deleted on the left and then inserted in a different position on the right. I'd really like those two words to stay on screen and just shift sideways.

I could fix up the Myers output to include moves, by finding words in common between the delete and insert lists. But that wouldn't be optimal. A move should count as cheaper than a delete-and-insert, and Myers doesn't account for that.

Upon further searching, I see references to:

The String-to-String Correction Problem with Block Moves, Walter F. Tichy, 1983

But I have not yet read this paper!

As for the text game? Still just an idea.

Oh, speaking of footnotes. At the bottom of Myers page 1, and I swear I didn't mean to make this post topical:

* This work was supported in part by the National Science Foundation under Grant MCS82-10096.

I'm trying to distract myself from the world, here, so let's just let that one sit.

Addendum, Feb 11

I thought about this some more and decided that the standard diff approach isn't what I want.

I mentioned the "include moves" problem above. But more than that: in general we'll be dealing with chunks of text that are mostly different. We need to identify the words that exist in each chunk, and decide where each one moves to. But this overlap won't generaly be large. It's not like source control, where the common case is changing a few lines in a big file.

We may reasonably expect a phrase (sequence of words) to appear in both chunks. However -- thinking down to the animation level -- even if a phrase remains on screen, it probably won't appear at the same place on the screen. Any wording change before the common phrase will re-wrap it. So the words will still move around.

(Unless the common phrase is at the start of the block. We can special-case that.)

Therefore, my representation of diff(seqA, seqB) will look like three lists:

deletes (words in seqA which get deleted, because they're not in seqB)
adds (words in seqB which get added, because they're not in seqA)
moves (words shared between seqA and seqB which just change position)

Building the deletes and adds lists is easy; that's just set difference. For moves, a greedy algorithm seems workable. Say the word "foo" appears at positions [1, 3, 6] in seqA, and positions [2, 3] in seqB. The "foo" at 3 can stay at 3; knock the 3s from both lists. Now we have [1, 6] and [2]. It's cheaper to move the "foo" at 1 to 2. Then we're left with [6] and []; the "foo" at 6 is surplus so it becomes a delete.

You can do this index-list-matchup in O(N^2) time, where N is how many times a given word appears in the text -- exercise left to the reader. That's not going to be huge for passage-sized texts, so this should be workable.

As a concrete example: if seqA is XYXYXY and seqB is YXYXYX, this plan will cause each XY pair to swap places, which minimizes movement and also looks cool. A longest-common-substring plan will try to move the initial X to the end, and every other letter shifts back one place -- not as good.

If I want to really get into the weeds, I could minimize screen movement (in X,Y coordinates) rather than index-in-the-phrase movement. That might be fun.

Comments from Mastodon

Andrew Plotkin (February 9, 2025 at 1:35 PM):

Academics in my orbit will blink at me like superb owls, but for the average code monkey, it's not so obvious.

Nick (February 9, 2025 at 1:48 PM):

@zarfeblong this practice was hammered into us in my history and classics coursework and I still do it all the time. it's just so useful!

Tikitu (February 9, 2025 at 2:09 PM):

@zarfeblong you might find the references here interesting (besides Myers someone called Heckel from 1978). I got there because Apple’s UIKit has a diffing algorithm that handles moves, but it’s not open-source, and googling for replacements turned this one up.

Tikitu (February 10, 2025 at 3:06 PM):

@zarfeblong oh dear. Adding the url and not just saying “here” helps: https://github.com/ra1028/DifferenceKit

Andrew Plotkin (February 10, 2025 at 3:10 PM):

@tikitu Heh, thanks.

Justin Sherrill (February 9, 2025 at 9:55 PM):

@zarfeblong SCCS hit 50 years of age recently and there's a paper from the original author on the "weave" format here:
https://www.mrochkind.com/mrochkind/docs/SCCSretro2.pdf

There's a lot of discussion of this on the Unix History mailing list last December: https://www.tuhs.org/pipermail/tuhs/2024-December/ - I know this isn't what you are doing but to my superficial view this appears to relate.

Max (February 10, 2025 at 9:54 AM):

@zarfeblong A different off-the-shelf option (and a different algorithm to explore, too, in its documentation), if you stick with Python, Python is one of the few languages I know with a generalized diff tool in the standard library: https://docs.python.org/3/library/difflib.html

(Years ago I glued pygments and difflib together to prove to my own satisfaction that diffs of syntax highlighting token streams are better character-level diffs in speed and usefulness. https://github.com/WorldMaker/tokdiff)

Andrew Plotkin (February 10, 2025 at 10:16 AM):

@max Thanks, I had forgotten about that.

I am definitely not sticking with Python -- if this turns into an actual game, it will be on iOS or Unity or Godot or something, in whichever native language. But more examples is good, of course.

Max (February 10, 2025 at 12:33 PM):

@zarfeblong The 80s paper the documentation comments in difflib links to are also useful because it’s a hash-based algorithm which makes it really to implement with all sorts of crazy hashable data structures and also most of the good move-detection algorithms I’ve seen are hash-based. Hash-based approaches trade off that robustness by being generally “slower” in O-notation, but not necessarily in practice, esp. with move detection.

C. Scott Ananian (he/him) (February 17, 2025 at 8:55 PM):

@max @zarfeblong oh that might be useful for another purpose -- we need tree-oriented diff algorithms for HTML, and I'm guessing that having hashable subtrees might make that work. I'll have to check that out. (I've been surprised that tree-oriented diff algorithms aren't more common, but maybe it's just my literature search skills which are lacking.)

Max (February 17, 2025 at 9:03 PM):

@cscott @zarfeblong Back when I was playing with Python’s difflib I think I also did a spike turning trees into a tuple stream something like (parent, contents) and the output was fine enough.

I arrived at token stream diffing for a number of reasons that “well formed ASTs/trees” were an interesting red herring to good source control diffs. You generally want to be able to source control work in progress as well and syntax highlighting token streams are good at that.

Zarf Updates

Interactive fiction, narrative in games, and so on

Posts by

My links

Search (via DDG)

Blog archive

Tag archive

Previous post

Next post

Feeds

Games are for everyone