In-memory BTreeMap like crate that supports ACID transactions

Hi, I am looking for a crate that provides ordered-map API (eg: BTreeMap like) + in-memory storage + ACID transactions. Could you please suggest a crate that best matches the below requirements? -

a. Should have API similar to BTreeMap (eg: crossbeam_skiplist::SkipMap, sled::Db) - ie, an ordered map.
b. Should provide cross-thread ACID transaction support
c. Store data in memory - should not require filesystem access

I have looked into below crates -

  1. sled - has ACID transactions + Ordered Map interface. But requires file backing (please correct me if i am wrong).
  2. rocksdb - similar to sled - requires file backing.
  3. sqlite / rusqlite - Provides in-memory + transactions + ordered map. Need to adapt the SQL interface to expose a key-value store / Map interface. (Currently leaning towards this - as it is workable w.r.t requirements above, but will be a little involved task to translate from ordered-map API to SQL)

Could you please suggest a rust Map library that matches above requirements?

1 Like

If it is in-memory, why not simply use std::collections::BTreeMap?

( So I don't think I understand what you want....! )

But that said, maybe you could use

which is a component of "RustDb" - my database software. Or not...

[ In general RustDb does allow an "in-memory" database, typically I use this when running tests. RustDb is in some ways similar to Sqllite. A bit. ]

Thanks for the rustdb suggestion, shall take a look.

If it is in-memory, why not simply use std::collections::BTreeMap?

BTreeMap does not provide transaction support (requirement b. captured above).

1 Like

You can lock a mutex at the begin of the transaction and unlock it at the end. It would be slower than something that optimistically attempts to run multiple transactions at the same time and restart in case of a conflict, but unless you are doing a lot of concurrent accesses, it may not matter much.

2 Likes

concread is an interesting one in the pure memory category (not part of a persistent database). The monthly downloads are pretty high so it is something I would try if I were looking for what you describe. However, it only allows only one writer at a time (although this is also true of Sqlite.)

So you want to be able ro roll-back a set of updates?

Thinking about that, I suspect you can do it with a sort-of wrapper around a BtreeMap, where new insertions go into a temporary BtreeMap, then on commit they go into the "permanent" BTreeMap. You also have a temporary set of deletions. If you want to iterate I guess that is a bit trickier, but I guess you can merge the temporary and permanent BtreeMaps on the fly... I may be wrong, maybe this doesn't work!!!!

It's probably easier to immediately make changes to the in-memory store, but also maintain a rollback log that can be used to revert all of the changes. As long as the transaction holds exclusive access to anything that's been touched, there should be no conflicts doing the rollback, and no other transaction will ever see an intermediate state.

(I've implemented something like this before, as part of my master's thesis)

2 Likes

Yes, that probably is easier. Maybe an advantage of my suggestion is that concurrent readers can continue to access the "current" map until commit.

I suppose though that the most natural way to do this might a BTree with clone-on-write for the nodes. That way old shared nodes can persist until Arcs are dropped. That corresponds roughly to what my database software does (with "pages" corresponding to BTree nodes).

Thanks for all the suggestions above.

Considering the options, -

  1. Use BTreeMap with a global lock - as mentioned, this make all transactions atomic, and would allow only 1 transaction running at a time in the system. Chances of an ongoing transaction failing in this method is very less (fails only when there are system issues - like no memory for allocation). This approach will also not require any additional maps to optimistically proceed with multiple transactions and the complexities involved with it. With this method, it may be impossible to simulate failed transactions - which i see as negative as we want to have the system test for failures as well. We also need to support concurrent active transactions without blocking.

  2. concread - had a quick look at this, and found the hashmap impl available, while i am looking for an ordered map. It supports multiple readers with "snapshot isolation" using MVCC, but allows only single writer at a time.

  3. Wrapper around BTreeMap with additional temp BTreeMaps for current transaction - Should work in first glance, but may be a bit complex and would need some involved coordination in the wrapper when multiple transactions needs commit.

  4. BTreeMap with rollback log - workable i believe, but relatively complicated too.

  5. Another option probably is to use MVCC for each key (maybe timestamp or sequence number based) in BTreeMap. Timestamp helps with readers to read from a specific snapshot, while transactional writers continue to write new versions to the BTreeMap with newer timestamp/sequence numbers - and (pessimistic) lock the keys being written to, captured in another Map. Proceed with commit or rollback based on the state of the locked keys across transactions.

  6. Use sqlite or rustdb and translate Map operations to SQL - need to write translation layer

  7. A mix-n-match of couple of above options - for eg, 3. + 5. or 4. + 5.!

I am also wondering if there are Sled experts in here who may be able to suggest how easy/difficult is it to bend Sled to not require filesystem access!

concread also has an ordered map.

If you're seriously thinking of rolling your own transactions you may want to look at the lower level tools for this in the lever crate.

1 Like

This is great! My bad, i missed BptreeMap in the first look! :slight_smile: Looks like this may be what is required for my use case! Thanks @jumpnbrownweasel!

1 Like

You're welcome!

The tricky part with this is it doesn't come with deadlock detecting and breaking. Correct database engines can detect deadlock and reject or retry one or both transactions.

1 Like

While I was writing an example of how to do this, I remembered that I published a software-transactional memory crate (mvcc_cell) a year or so ago. Using something like RwLock<BTreeMap<K, MvccCell<V>>> should get you pretty far— Transactions that don't touch the same keys can proceed in parallel while holding a read lock, but adding or removing keys will require taking a write lock.

mvcc_cell uses a first-committer-wins strategy so it should always be possible to make progress, which takes care of the deadlock issue that @riking brought up. The tradeoff here, however, is that a long-running transaction might never end up succeeding if a stream of faster transactions keep conflicting with it and committing first.

You could in theory use something like MvccCell<im::OrdMap<K, MvccCell<V>>> instead but you'll have to write some kind of fair scheduling system to ensure that transactions that want to add or remove keys will eventually succeed, because they'll conflict with everything.


Related previous URLO discussions:


Edit, comparing this to concread, which I was unaware of at the time.

  • concread probably has better inbuilt collections and has obviously been more thoroughly battle tested
  • mvcc_cell provides an atomic transaction mechanism that works when arbitrarily many types/objects are involved, where concread's are limited in scope to a single object/collection
  • The two crates' write strategies are quite different:
    • concread effectively uses a mutex for writes, which means that there's no wasted work— If you're allowed write access, you are guaranteed that write will succeed (at the expense of potentially deadlocking if you don't manage your lock order properly across multiple instances).
    • mvcc_cell, on the other hand, always lets speculative writes occur but may reject the commit if another transaction committed first, even if it started later. If you want to avoid doing useless work, you need to manually check Transaction::can_commit()
1 Like

Thanks @2e71828 ! MvccCell looks very interesting and useful! Will look into this crate as well. Thankyou!

Can you do this via

pub struct TBTreeMap {
  stored: BTreeMap<K, V>,
  tmp: BTreeMap<K, Option<V>> // None denotes deleted
}

impl TBTreeMap {

  pub fn commit(&mut self) {
    for (k, v) in self.tmp {
      match v {
        None => self.store.remove(k);
        Some(v) => self.store.insert(k, v);
      }
    }
  }

  pub fn get(&self, k: K)  -> Option<&V>  {
    match self.tmp.get(k) {
    None => self.stored.get(k);
    Soem(v) => v
    }
  }
}

The basic idea is: to commit, write tmp into store; for reading ops, check to see if it's in tmp first. Then for write ops, write to tmp instead of store.

EDIT: Sorry, I missed the 'cross-thread' requirement. Not sure how to easily handle that as I think you'd have to also track, for each transaction in progress, what cells it read (since if transaction B wrote to key X, but transaction A read from key X before writing, ... that would result in a conflict).

I see you already considered this in above. Do you actually need multiple writers ? Lightning Memory-Mapped Database - Wikipedia allows multiple readers, single writer, with a clever append only btree trick. The core idea is straight forward; most of the complexity comes from managing filesystem pages and "garbage collecting pages" in a durable manner.

1 Like

I wrote a blog explaining why I think having a single writer is the best way:

The crux: "A large update will prevent other updates from running. However, due to the incremental nature of information, it should be possible to divide updates into shorter chunks. In such cases, queries may need to take into account incomplete data. For example..."

1 Like

The D in ACID stands for durability. Persistent media is necessary to necessary to achieve that. So, assuming the memory is volatile, you can't have ACID and pure in-memory at the same time.

Some databases are designed to keep everything in memory while they're online but essentially replicate everything to disk so they can be restarted without data loss.