Text editing differencing algorithm

Hi, I have met an algorithm issue about how to calculate the difference between two text content, I am trying to implement a text editor.

Please let me explain it in detail.

I am using the ropey crate as the data structure to store the text content of a text file in my project. In short, ropey provide a set of APIs let you directly updates the text content by lines and every chars on a line, with a great performance. The line index and char index on each line start from 0. Here we have:

struct Rope { ... }

And then I am trying to implement the undo, redo features. I want to create a structure Edit, it saves each editing operation from the user: inserting, deleting, changing.

struct Insert {
 line: usize, // Insert at this line
 char: usize, // Insert at this char of the line
 payload: String, // Inserted payload
}
struct Delete {
 line: usize, // Delete at this line
 char: usize, // Delete at this char of the line
 count: usize, // Deleted chars from the position
}
struct Change {
 line: usize, // Change at this line. Note: A change only changes 1 line at a time.
 start_char: usize, // Change start from this char of the line.
 end_char: usize, // Change end at this char of the line.
 payload: String // New payload
}
struct MultiLineChange {
 lines: BTreeMap<usize, Change>, // Multiple line change, maps from line to the change.
}
enum Edit {
 Insert(Insert),
 Delete(Delete),
 Change(Change),
 MultiLineChange(MultiLineChange),
}
struct EditHistory {
 edits: Vec<Edit>
}

So in my thinking, the EditHistory is a list of Edit (i.e. Vec<Edit>), the Rope is always the latest text content. So we can recover the text content before last N edits by applying the last N Edit operation reversely. This operation is a O(N) time complexity operation, we need to do N reverse editing on the current Rope.

Finally, here are my questions:

  1. Is there any algorithm, that can merge 2 or more continuously edit operations (&[Edit]) into 1 edit operation (Edit)? This can help me reduce some duplicated/repeated operations such as first add a char, then delete it.
  2. With 1 or more edit operations, we actually have two ropes Rope-latest (latest Rope after some edit operations) and Rope-snapshot (snapshotted/origin Rope before these edit operations). How can I map every line and every char per line from Rope-latest to Rope-snapshot, i.e. I want to query the char (line,char) of Rope-snapshot by Rope-latest and these edit operations, and also query in the reverse direction.
  3. Now we have two rope snapshots Rope-A and Rope-B, the B is updated from A after some editing operations. Is there any algorithm that can calculate the difference between them? The difference result should be represented by a single Edit structure above.

    It seems this question can vary based on different requirements:

    • Line-iteration compare: Go through all the lines on A and B, for the same line, just string-compare the two lines in A and B (I found some algorithms here: wu-diff-rs, lcs-diff-rs, diff.rs).
    • Some more advanced algorithms to minimize the edit payload just like git diff?

i don't know of any specific algoritms for that but i think it should be fairòy doable to detect ovrelaps between two close edits.
i would also suggest to take in consideration the possibility of changing what you consider to be an edit, a possible approach could be that once the users starts to edit you wait until they pause for a sec and consider all the changes done in that interval as a single action for your undo/redo logic, or you might find blocks like words and consider and edit complete when the cursor moves to a different word

2 Likes

a possible approach could be that once the users starts to edit you wait until they pause for a sec and consider all the changes done in that interval as a single action for your undo/redo logic, or you might find blocks like words and consider and edit complete when the cursor moves to a different word

Agree, that will be a great improvement and very suitable for this scenario

Another possible solution is to store an undo/redo stack as simple clones of the entire rope. Cloning a rope is extremely cheap and efficient. Whereas crates like imara-diff have already solved the diff generation problem efficiently, you'd need to temporarily flatten out your rope to a String in order to generate them.

1 Like

Thanks for sharing, will take a look!

Persistent collections are a great option for this, yes. Then you don't need to think about diffing live -- you can calculate a diff later if you really want one (maybe to merge local changes with a update that happened on disk) but the vast majority of the time you just change to a different buffer and success.

That also means you can trivially drop an entry in the undo/redo list and it still works, if you want to reduce the memory usage over time while still having long-term undo available. Like maybe you keep everything for really recent changes, but as they changes get older you do more and more debouncing -- for changes a day ago you don't need second-by-second changes in the stack still, but you do have hour-by-hour because there aren't that many of those.

Finger tree - Wikipedia is the one that comes to mind, but I'm sure there are lots of other good ones.

2 Likes