Method/algorithm to compress edits of data?

Is there an algorithm to compress multiple edits (insertions, deletions) of records into smaller one, throwing out intermediate states? Google throws at me only pages on compression algorithms, which is another thing.

Let's say I have a set of changes, in pseudocode:


1. update record id:123 set field1 = 'bla bla bla';
2. update record id:123 set field1 = 'ooooo';
3. update record id:123 set field1 = 'uuuuuuu';
4. update record id:123 set field2 = 456;

And I want to remove #1 & #2 and just keep the last one that leads from the initial to the final state.

Does this method has a name, or algorithms? I could of course write such optimization out of my head, but it looks like O(n^2) algorithm! Maybe there's some knowledge on this?

The simplest approach would be to perform the editing steps and then apply a diff between the original and the final result. Diffing has a vast literature.

Other than that, you'll be probably looking for a graph theoretical approach. The example you presented looks a lot like dead code elimination, which is a textbook problem in compiler theory, solved by SSA. Something like a topological sort may also be helpful for taking dependencies into account. Once you determined a (possibly discontiguous) sub-sequence of operations that aren't inter-dependent with another, you can apply simple "optimizations" such as discarding all but the last assignment, or collapsing multiple accumulative/aggregation operations into a single one.

1 Like

If you are using JSON, you can take a look at JSON diff.

1 Like

Wouldn't the simplest thing be to always diff against version 0 - then you throw away intermediate edits...

The normal diff algorithm is O(ND) - D edits, N text length (x2).

This way every saved version of the data will be equal to (or bigger than) an entire snapshot, which is a lot of duplication. Just for example, my data is road graph with 100-200K edges, and versions may differ by just a dozen.

Why I wanted to optimize those few edits? I want to have fewer "key frames" (versions with entire dataset).

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.