Db-like "transactions" in Rust

High Level:

  1. I am looking for a way to a db-like "transactions" on Rust structures: i.e.
strat-transaction
... do some work ... without IO / network .. (but includes mutations)
either commit or rollback
  1. I know about STMs, but there's no concurrency involved here, and performance is important, so I am looking for something lighter weight.

=====

Low level details:

  1. We have a VM state that looks something like:
pub struct State {
    code: Vec<Instr>,
    data: Vec<Data>,
    alt: Vec<Data>,
    variables: HashMap<String, Data>,
    function_code: Vec<Instr>,
    function_table: Vec<usize>,
}

Now, we have some VM function. The Add function looks something like:

       Instr::Add => {
                let rhs = data.pop().unwrap();
                let lhs = data.pop().unwrap();
                match (lhs, rhs) {
                    (Data::I32(lhs), Data::I32(rhs)) => data.push(Data::I32(lhs + rhs)),
                    (Data::F32(lhs), Data::F32(rhs)) => data.push(Data::F32(lhs + rhs)),
                    _ => panic!("+ fail"), } },

Now, what happens during error handling? Lots of bad things:

If we get a F32 and an String, we have already popped both arguments by the time we get to the panic!

In a sense, on an error, I want to be able to "rollback" the two pop()'s, and say: here is the data stack when we tried to execute +, and it failed.

So I want each instruction to be "transactional" in that it either: (1) succeeds or (2) does not modify the "struct State", and reports an error.

The two ways to do this seem to be:
(1) manually get data w/o modifying State, do cmputation; only modify struct is success is guaranteed -- this could get messy on more complicated instrs
(2) get some type of lightweight "transactions" on Rust structs, so we can "rollback" in case of a failure.

Suggestions?

There's no true support for this in Rust.

The stupid solution is to clone() your struct, do the operations that you want on the copy, and then mem::replace() your changed struct with your other one when you're done.

The smart solution is to use Rc<T> and persistent data structures for the contents of your struct so that you don't have to actually copy everything. This is what your databases do under the hood. It's not that different from the stupid solution: you clone() it, do your operations on it, and then replace it back into place, it's just that cloning it doesn't take as long.

2 Likes

clone() is too expensive.

Persistent data structures generally have a O(log n) overhead right?

I guess maybe the way is either:

have really helpful error msgs on fail or
split the work into two phases:

1. read data from State (but not modify)
2. after computing, update entire State at once (without anything that can panic!)

There are many ways to approach this, but all of them (at least ones I can think of) involve writing custom code - there’s no builtin facility for this in Rust/stdlib.

For example, as you say, you can ensure that you don’t perform mutation until you’re certain that the rest of the processing cannot fail, or alternatively, perform the mutation after you’ve already “committed” the processing (here you need to be careful that doing the mutation itself won’t fail).

You can maintain an undo stack that records the actions to take if the transaction has to be rolled back. In your example, you’d record the fact that those two elements were popped off the tail of the data vector - if the transaction aborts, you unwind that stack, executing the corresponding undo action (eg the various actions can be modeled as enum, with a variant per flavor of undo types).

But at all times, you’ll need to be careful ensuring that either your forward (transaction processing) or backward (rollback) paths can succeed (ie no panics or fallible conditions).

1 Like

Yes, but perhaps we can do this per Struct rather than per use via a 'weak transactions' custom derive trait.

pub WeakTransaction {
  fn start_transaction(&mut self);
  fn commit(&mut self);
  fn reset(&mut self);
}

Then, if all the fields of a struct implements WeakTransaction, the struct itself can auto derive WeakTransaction

This would require some careful usage but eliminated the need of writing custom rollback code (except once per Struct[ EDIT: per underlying collection datatype] ).

How could you implement WeakTransaction for a Vec, or even a simple type, like i32? You need somewhere to store both the original and the new value (or the original value plus a set of instructions) between start_transaction and commit or reset. You can't add data to a type by implementing a trait, so I don't think this idea will work except for custom structs.

I might take a stab at writing a wrapper struct Transactional<T> with a method that takes a closure, and commits the result if the closure returns Ok but rolls it back if the closure returns Err -- that would be less error-prone than three separate functions. But the only way I can think of to write this is by cloning the T, so you probably wouldn't want to put a Vec in it. Persistent data structures sound like the way to go here.

  1. You are right, we can't do impl WeakTransactional. Vec<T> though I think we can do a struct TransactionalVec<T>

  2. clone would kill performance.

  3. I think this still offers an advantage: we only have to implement "rollback" for a few basic collections (which then gets bottled into a library), instead of ahving to implement "rollback" for each function (for which there will be lots of.)

  4. In the Forth VM example, turns out the only collections / data structures we need are:

pusyhing/popping from a stack (not full vec ops)
updating a hashmap (which we can store as a diff in another map)

Both set of ops I think can be implemented efficiently.

  1. I do not know how to solve this problem for collections (or even Vec) in general.

Both of these can be done with persistent data structures. Pushing and popping from a linked list is O(1) (not even amortized!) in time and space and I don't think you care too much about the constant factors there given that you are inflating it already with transactionality. Updating k items in a tree of size n is O(k log n) in time. Which is probably slower than using HashMap, true (O(k) with amortization). But compared to keeping a difference map and doing two lookups for every item? Eh, I'd profile it.

1 Like

@trentj : You are absolutely right. What we have been discussing can be easily done with Persistent Datastructures.

For some stupid reason, I kept on assuming "Persistent Datastructures" meant Understanding Clojure's Persistent Vectors, pt. 1 (representing Vectors via 32-node btree maps).

But you're absolutely right, all of this can be done via Persistent Datastructures.

Yes, and transactional RDBMSs have exactly that kind of overhead. That's why MySQL and PostgreSQL use Persistent B-Trees for their indexes; it allows them to create the new index in a parallel structure and then update it in a single atomic write.

1 Like

That's not necessarily true. A copy-on-write array-list (vector) has O(1) time overhead (because the copy happens just once) and O(n) space overhead. A transactional hashmap with an internal map-of-updates has O(1) overhead, etc.

One way to think of a transactional data structure is to think of it as something that holds onto a changelist, which can be committed or thrown away.

Rust's Vec is a very generic data structure. If you're willing to reduce the allowed operations (e.g. no deletes in the middle), you can significantly reduce the overhead of keeping a changelist.

@mohrezaei : Note: I claimed generally not always.

The reason persistent data structures generally have a O(log n) overhead is that if we allow arbitrary updates, there are n choices for what to update.

Then, after the op, we need both the old & the new version of the structure to exist. This generally requires rebuilding some "tree", with constant k-fanout, this tree will have depth log_k n height -- which is also the number of new nodes we create.

Thus, persistent data structures generally has O(log n) overhead.

That only really applies to tree like structures. If you start with a structure that's not a tree (e.g. a hashmap), there is no reason to think you'd end up with a tree. Trees are useful to be sure. Hashmaps and flat arrays (e.g. vector) are not tree-like and they are incredibly useful. To wit, your own State struct has no trees.

I've built both a transactional map and a transactional list (not in rust), neither of which stored their changes in a tree.

@mohrezaei : I would love to see what you have in mind [in sufficiently detailed pseudocode that we run big-Oh analysis on memory usage / time ] for "persistent hashmap" that has an overhead of only O(1) per operation.

@zeroexcuses, here is a basic rundown for the normal operations of insert/get/remove:

The map internally has a main_map, which is used for committed stuff. The map has to be aware of ongoing transaction, so it can have an internal flag, or check a thread local or whatever. If the flag is false, it'll just work on main_map. If the flag is true, then it'll work on the changelist instead. The changelist is two things: a new_map and removed_keys set. In tx mode, insert puts things into new_map (querying main_map for the old value if not in new_map for the replaced value) and removes entries from removed_keys if there. remove puts an entry into removed_keys and removes from new_map. get queries new_map, then main_map, then filters based on removed_keys. All those operations remain O(1).

For fancier stuff, like key iteration, you'll compose things together (main_map iterator + new_map iterator - removed_keys iterator).

I've done this in Java, which makes all this very idiomatic via GC. In Rust, things might be a bit harder based on lifetimes and so on, but it's probably workable.

Is that sensible?

@mohrezaei :

  1. First, let's see if I understand your impl correctly:
pub struct PersistentHashMap {
  pub old_hashmap: HashMap<K, V>, // no need to modify this before commit
  pub updated_values: HashMap<K, V>,
  pub deleted_keys: HashSet<K>
}

Rollback = no action.
Commit = update the old_hashmap w/ updated_values and deleted_keys.

  1. I agree that above is "transactional", but I do not believe it falls under "Persistent Datastructures.""

Suppose we start with a HashMap state hm0, and do a set of N inserts/deletes/updates, we go through states:

hm0, hm1, hm2, ..., hm(N-1)

the above, however, only keeps "hm0" and "hm(N-1)", but does not give us access to the intermediate states.

Correct. I think somewhere in this thread things got side tracked and instead of answering "looking for a way to [do] db-like “transactions” ", it turned into "persistent data structures", which as this hashmap example shows, is overkill (as is clone). I was answering your original question.

1 Like

I think we both agree with the following statements:

  1. persistent data structures often have a O(log n) cost
  2. persistent data structures can be used to implement transactions
  3. we can also implement transactions WITHOUT persistent data structures with lower cost (as you have shown above)
2 Likes