Wanderings in the realm of diff
Sunday, February 9, 2025 (updated 2 days later)
Comments: 9 (plus live) (latest 1 day later)
Tagged: diff, algorithms
So I was thinking about a text game idea where paragraphs of text change between a handful of possibilities as you make choices. Sort of a cycling choices effect.
But maybe instead of just replacing one block of text with another, I could do something visually neat. Like cross-fading. Or, how about this, maybe all the words could rearrange from one text into another. Some of them would move, some would appear, some would disappear...
= The | = The
| + rough
| + beast
| + ,
| + its
= hour | = hour
- is |
= come | = come
| + round
= at | = at
= last | = last
- , |
- but |
- not |
- the |
- beast |
= ... | = ...
I'm treating punctuation marks as separate words for this example.
Well, that's just a diff
function, right? Sure, you have to do all the animation work on top of that, which means text-layout work for the before-and-after. But the underlying logic is diff
on two lists of words. Old-fashioned stuff. Same logic you see every day in version control systems.
("A diff
command appeared in Version 6 AT&T UNIX," says my man
page. That would have been 1975, if I'm reading the right timeline.)
Hang on. How does diff
work anyway?
I don't think this problem ever came up in my undergrad CS classes. You'd think it would have. Digging into my undergrad algorithms textbook -- yes, I saved it, entirely as blog fodder -- I see some discussion of diff
under the heading "The Longest Common Subsequence Problem". But I have no memory of it being discussed in class. I guess it's been a while.
(Aho/Hopcroft/Ullman 1983, p189-194. I took that algorithms class around 1991 -- from Danny Sleator, brother of William Sleator. Just one of those footnotes of history. I dimly recall finding the class annoying, but I passed.)
Anyhow. There must be a thousand sample diff
implementations on the Net -- this is what stackoverflow is for. But in searching around, I ran across someone saying "Read Myers' 1986 paper, it's easy."
Funny idea, right? Go back to the source. 1986 postdates my textbook, if not my algo class, so I am half-justified in not remembering... Okay, let's take a look:
- An O(ND) Difference Algorithm and Its Variations, Eugene W. Myers, 1986
(The "official" doc reference link is to Springer, which of course is paywalled. It's a good thing people have saved accessible copies. Aaron Swartz's memory for a blessing.)
And then I sat down and wrote some Python, and now I have a simple diff
function.
...I started to write up a blow-by-blow of the paper, but nah. That's not my point here. My point is simple: this wasn't that hard! The paper really is pretty simple; it has nice diagrams; it has some pseudocode. There's a lot of pages proving that the algorithm is optimal, but you don't need to follow the proof to implement the function.
You do have to follow the diagrams. The pseudocode is a bit shorthand; you have to know what's going on. I appreciate that, honestly. If I'd wanted a cut-and-paste solution, I would have stayed on stackoverflow. Instead, I have a rather messy chunk of code that I wrote and I know what all the pieces mean.
In a year of noisy AI shortcuts, it feels good to do the work.
If you want the cut-and-paste version anyhow, no shade. Check out this blog post by Robert Elder (2017). I may in fact wind up using his version, because it's got optimizations I'm missing.
Or I may strike out on my own. My text idea isn't quite a basic diff
. Myers (and later VCS refinements) are trying to find a "shortest edit script", which is a sequence of deletes and inserts. See the chart at the top of this post.
But if I'm animating words moving around, then I really want a sequence of deletes, inserts, and moves. Again, look at that chart. The words "beast"
and ","
are deleted on the left and then inserted in a different position on the right. I'd really like those two words to stay on screen and just shift sideways.
I could fix up the Myers output to include moves, by finding words in common between the delete and insert lists. But that wouldn't be optimal. A move should count as cheaper than a delete-and-insert, and Myers doesn't account for that.
Upon further searching, I see references to:
- The String-to-String Correction Problem with Block Moves, Walter F. Tichy, 1983
But I have not yet read this paper!
As for the text game? Still just an idea.
Oh, speaking of footnotes. At the bottom of Myers page 1, and I swear I didn't mean to make this post topical:
* This work was supported in part by the National Science Foundation under Grant MCS82-10096.
I'm trying to distract myself from the world, here, so let's just let that one sit.
Addendum, Feb 11
I thought about this some more and decided that the standard diff
approach isn't what I want.
I mentioned the "include moves" problem above. But more than that: in general we'll be dealing with chunks of text that are mostly different. We need to identify the words that exist in each chunk, and decide where each one moves to. But this overlap won't generaly be large. It's not like source control, where the common case is changing a few lines in a big file.
We may reasonably expect a phrase (sequence of words) to appear in both chunks. However -- thinking down to the animation level -- even if a phrase remains on screen, it probably won't appear at the same place on the screen. Any wording change before the common phrase will re-wrap it. So the words will still move around.
(Unless the common phrase is at the start of the block. We can special-case that.)
Therefore, my representation of diff(seqA, seqB)
will look like three lists:
deletes
(words inseqA
which get deleted, because they're not inseqB
)adds
(words inseqB
which get added, because they're not inseqA)
moves
(words shared betweenseqA
andseqB
which just change position)
Building the deletes
and adds
lists is easy; that's just set difference. For moves
, a greedy algorithm seems workable. Say the word "foo"
appears at positions [1, 3, 6]
in seqA
, and positions [2, 3]
in seqB
. The "foo"
at 3 can stay at 3; knock the 3s from both lists. Now we have [1, 6]
and [2]
. It's cheaper to move the "foo"
at 1 to 2. Then we're left with [6]
and []
; the "foo"
at 6 is surplus so it becomes a delete
.
You can do this index-list-matchup in O(N^2) time, where N is how many times a given word appears in the text -- exercise left to the reader. That's not going to be huge for passage-sized texts, so this should be workable.
As a concrete example: if seqA
is XYXYXY
and seqB
is YXYXYX
, this plan will cause each XY
pair to swap places, which minimizes movement and also looks cool. A longest-common-substring plan will try to move the initial X
to the end, and every other letter shifts back one place -- not as good.
If I want to really get into the weeds, I could minimize screen movement (in X,Y coordinates) rather than index-in-the-phrase movement. That might be fun.
Comments from Mastodon
@zarfeblong this practice was hammered into us in my history and classics coursework and I still do it all the time. it's just so useful!
@zarfeblong you might find the references here interesting (besides Myers someone called Heckel from 1978). I got there because Apple’s UIKit has a diffing algorithm that handles moves, but it’s not open-source, and googling for replacements turned this one up.
@zarfeblong oh dear. Adding the url and not just saying “here” helps: https://github.com/ra1028/DifferenceKit
@zarfeblong SCCS hit 50 years of age recently and there's a paper from the original author on the "weave" format here:
https://www.mrochkind.com/mrochkind/docs/SCCSretro2.pdf
There's a lot of discussion of this on the Unix History mailing list last December: https://www.tuhs.org/pipermail/tuhs/2024-December/ - I know this isn't what you are doing but to my superficial view this appears to relate.
@zarfeblong A different off-the-shelf option (and a different algorithm to explore, too, in its documentation), if you stick with Python, Python is one of the few languages I know with a generalized diff tool in the standard library: https://docs.python.org/3/library/difflib.html
(Years ago I glued pygments and difflib together to prove to my own satisfaction that diffs of syntax highlighting token streams are better character-level diffs in speed and usefulness. https://github.com/WorldMaker/tokdiff)
@max Thanks, I had forgotten about that.
I am definitely not sticking with Python -- if this turns into an actual game, it will be on iOS or Unity or Godot or something, in whichever native language. But more examples is good, of course.
@zarfeblong The 80s paper the documentation comments in difflib links to are also useful because it’s a hash-based algorithm which makes it really to implement with all sorts of crazy hashable data structures and also most of the good move-detection algorithms I’ve seen are hash-based. Hash-based approaches trade off that robustness by being generally “slower” in O-notation, but not necessarily in practice, esp. with move detection.
Comments from Mastodon (live)
Please wait...
This comment thread exists on Mastodon. (Why is this?) To reply, paste this URL into your Mastodon search bar:
Academics in my orbit will blink at me like superb owls, but for the average code monkey, it's not so obvious.