Current solutions for key value stores

In my limited understanding, LMDB supports 1-writer + infinite readers, without locking.

It does so via using append-only b+-trees.

Imagine you have an immutable ordered map with nodes of size 4k and stored on disk. That is the core of LMDB.

When you do a write to LMDB, it finds the leaf page, creates a new copy of it, then creates a new copy of all the ancestors of the leaf node, all the way up to the root, and creates a new tree (sharing many of the old nodes).

Thus, it never over writes any existing page but the cost of a write is creating log(n) new 4k pages.

This also allows for 1 writer infinite readers, without locking.

The tricky part is: to save space, there is a garbage collector, but the ref counts / free-page-list are also stored in the tree itself, and this gets self referential / tricky.

EDIT: The argument for 1 writer + infinite reader is: in many situations, you read far more than you write. If you are write dominant, probably don't use LMDB.

1 Like

It might be worthwhile to look at jammdb - Rust , imho it is surprisingly readable (compared to LMDB C source, which I have not yet managed to read).

A colleague told me something similar after his research: LMDB is fast for reading, but not as efficient for writing. But perhaps it's still faster than many other solutions? I've not checked benchmarks.

Edit: Some benchmarks comparing LMDB to SQLite from the design review linked below:

Benchmarking shows SQLite to perform relatively poorly on KV workloads when compared to LMDB. In sequential reads on published benchmarks LMDB has 47x throughput (I measured 80x for string keys). In random reads LMDB has 9x throughput. Sequential writes, 8x; random writes 5.7x.

Maybe I should take a look at jammdb also. I heard that the LMDB code is hard to read (but "brilliant"). Not sure if that's true though.

Regarding jammdb, it looks like you can't work on a database with multiple processes, as the database file gets exclusively locked when opening.

I also found this document: "Design Review: Key-Value Storage", which is a bit old (almost 4 years old) but might contain interesting information regarding the topic of key-value stores!

My favorite so far has been heed, a higher-level crate using either LMDB or MDBX (a descendant of LMDB but with very big changes).

LMDB, by design, will lose write only benchmarks. Here is one simple proof:

  1. If you don't care about 1 writer + infinite readers, you can do destructive updates. So when you update a page, you do 1 write: write out page on disk, update in memory pointer to new position on disk.

  2. LMDB does not do destructive updates. Therefore, to update one page, you have to write depth-of-tree new pages.

You can have 1 writer + infinite readers. The above looks like a limitation due to the way the OS / filesystem is handling mmap calls.

@valpackett : Does MDBX use different data structures, or the same data structures but a different implementation ?

Ok. I think this is what I was hinting about, when I said

"My general belief is that a server process is necessary for "good" (well-behaved) concurrency"

I suppose strictly speaking it doesn't have to be a process, you could imagine some kind of operating system function which acts as the supervising shared cache, but without that, it's hard to have shared/concurrent updates without resorting to somewhat inefficient approaches ( as with LMDB ). I still think overall, if you don't want to compromise on performance, you need a single process. If performance isn't a concern, the LMDB approach could work ok.

Well now I looked at wikipedia, and it states:

"Each thread reading from a database gains ownership of an element in a shared memory array, which it may update to indicate when it is within a transaction. Writers scan the array to determine the oldest database version the transaction must preserve, without requiring direct synchronization with active readers."

This makes sense. I use a BTreeMap for similar purposes in my DB implementation ( well actually a HashMap and two BtreeMaps ). I wonder how the shared memory array can be implemented with multiple processes. Maybe it uses a second memory-mapped file, I suppose?

It’s still a B+ tree at its core, but lots of the surrounding management stuff has been rewritten. The readme explains pretty much everything.

Perhaps this is due to my lack of db impl background. I found the README to be a list of claimed improvements without explaining how the improvements are achieved.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.