Current solutions for key value stores

sled doesn't support multiple processes :frowning:

Currently I'm using sqlite. It somewhat works, and has proven reliability, but it is a performance bottleneck for me, so I'm curious to see if you find something better.

I think sanakirja fits, but imo the API and docs are pretty incomprehensible. I got a simple example to work with concurrent access but even in the simple example I had no idea what half the code I wrote was actually doing. The next thing I was going to try was rusqlite with the bundled feature

I had considered TiKV too, but it comes with a server and network protocols. That way it seemed to be less lightweight to me.

Also note that LMDB, for example, is relatively small (compressed source code ~170kiB vs ~3.9MiB for TiKV).

After some web research, looking into the LMDB C API, and experimenting a bit with LMDB in C, I think I'll go for LMDB. Not sure about which Rust binding to use though, see further below.

LMDB maps the database into memory, and (if I understand right) a corrupted database file on the harddrive can cause crashes. That means it is not possible to open a database ("environment" in LMDB's terminology) without unsafe, as you have to assure that the database file has not been tampered with, for example.

Nonetheless, the approach doesn't seem to be unreasonable, and it isn't a blocker for me, as long as I know about it.

Consequently, the lmdb-zero crate doesn't offer an (entirely) safe interface. It's documentation for lmdb_zero::EnvBuilder::open states:

Unsafety
It is the caller's responsibility to ensure that the underlying file will be used properly. Since LMDB is built on memory-mapping, these responsibilities run quite deep and vary based on flags.

Generally, lmdb-zero's documentation seems to be quite verbose and several design decisions appear well-reasoned, see its documentation on docs.rs (don't forget to click on "Expand description").

lmdb-zero's repository has the last commit on April 22nd, 2018.

2 Likes

Well, it's always a shame when projects go without commits for too long, but considering they're bindings that isn't as much of a problem. Unless you're looking to have the cutting-edge version of LMDB, lack of activity is fine if there's documentation.

I thought the *-sys crates would provide bindings to the native liblmdb on my system. In that case, I wouldn't worry too much about old crates.

So I wondered if the *-sys crates really use the same LMDB as my operating system installed. They do not. Instead, they are shipped with old copies of LMDB's code:

The package manager of my operating system (FreeBSD) ships LMDB version 0.9.29.

I'm not even sure if LMDB changed its binary format. Does anyone have information on that?

Also, I would like to know if it's common practice to include a copy of the upsteam source in the *-sys crates?

(Edit: Removed my comments on getting errors when updating the database. Looks like my database was in a wrong state.)


After finding the changelog of LMDB, I'm getting a bit nervous, seeing which fixes are missing in the Rust crates (an excerpt):

  • Fix MDB_DUPSORT alignment bug (ITS#8819)
  • Fix delete behavior with DUPSORT DB (ITS#8622)
  • Fix mdb_cursor_get/mdb_cursor_del behavior (ITS#8722)
  • ITS#8756 Fix loose pages in dirty list
  • ITS#9007 Fix loose pages in WRITEMAP
  • ITS#9278 fix robust mutex cleanup for FreeBSD

I haven't looked deeper into these, but the thought of using a database productively, where such bugs aren't fixed, doesn't make me sleep well…

I am not sure what this implies. Is it possible to share memory between different processes? My guess would be no, but I don't know for sure. I certainly don't have answers (and really I have more questions than answers), but it seems like you need some kind of inter-process communication. I don't know what LMDB is doing "under the hood". Do you?

[ Oh, and my main thought was "why not use a single process" ? ]

It basically means the files are opened by two different programs running on the same machine. This involves using some sort of locking (not sure which mechanism LMDB uses, I think it's file locking or SYSV IPC).

To have different tools operating on the same dataset. Also to allow backups using a separate software/process.


The "Caveats" section in LMDB's documentation reveals some internals.

I would think it's easier to have the tools run under a single process ( if they are to be running at the same time, which I presume is what is wanted ).

I think I remember some recent discussion that this is problematic ( not supported in a uniform way by different operating systems ). My gut feeling is it would not work, or not work well. My gut feeling may of course be completely wrong!

I also thought on command-line tools for example. If there is only a single process, then the command-line tools would need to communicate with the running process. It's possible, but not sure if it always makes things easier. (It depends on the use-case, actually, I'd say.)

When network transparency is required, then having a (separate) server process is a necessity anyway.

I think you are right with that there are certain issues. To cite LMDB's documentation as some examples where locking can be difficult:

On BSD systems or others configured with MDB_USE_POSIX_SEM, startup can fail due to semaphores owned by another userid.
[…]
Do not have open an LMDB database twice in the same process at the same time. Not even from a plain open() call - close()ing it breaks flock() advisory locking.
[…]
Do not use LMDB databases on remote filesystems, even between processes on the same host. This breaks flock() on some OSes, possibly memory map sync, and certainly sync between programs on different hosts.
[…]
Opening a database can fail if another process is opening or closing it at exactly the same time.

(Edit: Made second part a separate post, see below.)

As of now, I'm tempted to use the original LMDB C-implementation (from Howard Chu, Symas Corp.) and write my own Rust bindings. I don't want to use crates that include copies of the original source which then aren't updated for years. Hence doing it myself seems most reasonable.

1 Like

My general belief is that a server process is necessary for "good" (well-behaved) concurrency, but I don't know what you mean by "network transparency" or why it results in that conclusion.

With "process" I meant an OS process, not a task or thread.

With "network transparency" I meant being able to move the server/storage part of a client-server application to a different host.

So what I meant is that when I want to be able to create a client/server application which runs over a network (i.e. where it's irrelevant where the server and the client runs because they use TCP/IP to communicate with each other, for example), then I need an architecture where my database is a separate process (on a different machine, possibly).

When I just have a bunch of programs which run on the same host/machine, then I won't need something like TCP/IP, and I could either work with having multiple OS processes accessing the same database on file storage (assuming that writes are synchronized somehow, which is supported by LMDB), or I can have a single OS process which runs several threads. (Of course I could also install a database server locally and communicate with it through TCP/IP, even if it's on the same host/machine, but that makes handling a bit more complex sometimes.)

This is where I cannot really understand what LMDB is doing. I can imagine maybe the memory for the memory-mapped file is accessible to multiple processes, but how can the synchronisation be done? That's to say, to one process, "hold on a moment, the database is being updated", in other words, inter-process locking. Well, maybe there is a mechanism, but I don't know what.

There is SYS V IPC for example. Another way could be to use flock used on a lock-file on the file system.

Ok, this seems to be specific to Linux, I have not used Unix for over 30 years, and I know even less about Linux.

Not sure how this works under Windows, but I'd assume there are similar mechanisms…

"IPC" (Interprocess Communication) is the keyword in general. Or maybe semaphores, or file-locking.

In my limited understanding, LMDB supports 1-writer + infinite readers, without locking.

It does so via using append-only b+-trees.

Imagine you have an immutable ordered map with nodes of size 4k and stored on disk. That is the core of LMDB.

When you do a write to LMDB, it finds the leaf page, creates a new copy of it, then creates a new copy of all the ancestors of the leaf node, all the way up to the root, and creates a new tree (sharing many of the old nodes).

Thus, it never over writes any existing page but the cost of a write is creating log(n) new 4k pages.

This also allows for 1 writer infinite readers, without locking.

The tricky part is: to save space, there is a garbage collector, but the ref counts / free-page-list are also stored in the tree itself, and this gets self referential / tricky.

EDIT: The argument for 1 writer + infinite reader is: in many situations, you read far more than you write. If you are write dominant, probably don't use LMDB.

1 Like

It might be worthwhile to look at jammdb - Rust , imho it is surprisingly readable (compared to LMDB C source, which I have not yet managed to read).

A colleague told me something similar after his research: LMDB is fast for reading, but not as efficient for writing. But perhaps it's still faster than many other solutions? I've not checked benchmarks.

Edit: Some benchmarks comparing LMDB to SQLite from the design review linked below:

Benchmarking shows SQLite to perform relatively poorly on KV workloads when compared to LMDB. In sequential reads on published benchmarks LMDB has 47x throughput (I measured 80x for string keys). In random reads LMDB has 9x throughput. Sequential writes, 8x; random writes 5.7x.

Maybe I should take a look at jammdb also. I heard that the LMDB code is hard to read (but "brilliant"). Not sure if that's true though.

Regarding jammdb, it looks like you can't work on a database with multiple processes, as the database file gets exclusively locked when opening.

I also found this document: "Design Review: Key-Value Storage", which is a bit old (almost 4 years old) but might contain interesting information regarding the topic of key-value stores!