Avoiding unsafe when using a C library: LMDB

I have been writing a crate (mmtkvdb, memory mapped typed key-value database) to provide a key-value store using the Symas LMDB library (technical docs here).

See also my previous thread: Current solutions for key value stores

LMDB uses memory-mapping, and my create allows retrieving references to such data on disk (without copying), where alignment is not an issue. Of course, these operations must be unsafe (to not be unsound) when certain types are involved (e.g. bools or enums), because a Rust reference must not point to invalid data or uninitialized memory, and data could have been modified by another program if it persists on a file.

So if I keep the feature of directly referencing to data types stored on disk, I will not get rid of unsafe. I don't see this as a no-go, but I documented that restriction:

Safety

Because of how memory-mapped I/O is being used and also because of certain assumptions of the underlying LMDB API, opening environments and databases requires unsafe Rust (i.e. the programmer must ensure that certain preconditions are met that cannot be enforced by the compiler to avoid undefined behavior). If you aim to program in safe Rust only, this Rust library is not suitable for you.)

However, I would like to avoid unsafe in certain other cases. Let me get to an example:

LMDB offers registering a comparison function, which can be used to collate keys and values:

int mdb_set_compare ( MDB_txn * txn ,
MDB_dbi dbi ,
MDB_cmp_func * cmp
)

Set a custom key comparison function for a database.

The comparison function is called whenever it is necessary to compare a key specified by the application with a key currently stored in the database. If no comparison function is specified, and no special key flags were specified with mdb_dbi_open(), the keys are compared lexically, with shorter keys collating before longer keys.

Warning

This function must be called before any data access functions are used, otherwise data corruption may occur. The same comparison function must be used by every program accessing the database, every time the database is used.

Now I wonder what really happens if the comparison function changes in a wrong way (which can always happen without the program noticing it, e.g. because you could replace the database files on the file system). In that case "data corruption" may occur. But would that lead to undefined behavior? I feel like the API reference of LMDB isn't specific enough to judge about what "data corruption" means in this context. If it would just affect key-value pairs being lost, for example, this wouldn't be unsafe.

This problem is similar to changing the comparison function of the type stored in a HashSet:

It is a logic error for an item to be modified in such a way that the item’s hash, as determined by the Hash trait, or its equality, as determined by the Eq trait, changes while it is in the set. This is normally only possible through Cell, RefCell, global state, I/O, or unsafe code. The behavior resulting from such a logic error is not specified (it could include panics, incorrect results, aborts, memory leaks, or non-termination) but will not be undefined behavior.

As you can see, the Rust standard library specifically gives a guarantee(!) that the "not specified" behavior (see also unspecified behavior in Wikipedia) will not be "undefined behavior". But API specifications of C libraries might not always be that clear about what happens if you call functions in a wrong way (e.g. set a wrong comparison function in LMDB).

Some questions:

  1. In the concrete cased of LMDB: Would you assume that calling mdb_set_compare with a comparison function that does not match the underlying database will cause undefined behavior (UB)?
  2. Did you stumble upon other cases of API's written in different languages where it's difficult to judge whether violating certain preconditions causes only certain errors or rather leads to unspecifieddefined behavior (which then would require unsafe if you cannot ensure the precondition is always true)? How did you deal with these cases?
1 Like

I don't know anything about LMDB, but my basic assumption would be that the 2 arguments to mdb_set_compare are dynamically sized/void pointers, I would generally assume that unless stated and/or proven otherwise, the compare will avoid bounds checking with the size field possibly doing an uninitialized read...

If the comparison function assumes a certain size, that's right. But this could be avoided by using comparison functions which respect the size of the datum (which is passed to the comparison function, see mv_size field of struct MDB_val).

But even if such a bounds check is done, it's still unclear to me whether "data corruption" (that can still happen, even if the comparison function performs a boundary check using the mv_size field) is unspecified or undefined behavior.


An example could be a change in the collation rules for a field, e.g. changing collation rules to sort "10" after "2" instead of the other way around. Even if both comparison functions avoid reading beyond (char *)mv_data + mv_size, this can still mess up the internal state of the database when the ordering of "10" and "2" changes between accesses.

I would assume the unspecified vs undefined behavior difference with HashSet is because there's no use of unsafe in the implementation that doesn't establish all safety requirements of Rust, which probably just means they don't call the key function in unsafe code. They can't say what will happen, but if nothing undefined is happening in unsafe code, no undefined behavior will happen.

You don't have that guarantee here, without doing the leg work of proving that none of the C code makes unsafe assumptions about the key function (eg not just logically incorrect, but could cause undefined behavior). That sounds like it's impossible enough that it's probably easier to just port it to Rust :pensive:

Not quite your question, but note that you in theory could support a safe API if you had a way of ensuring a lack of unsafe access, but here that would have to be some IPC mechanism to negotiate RwLock-style access. Sounds messy if you're avoiding bringing the performance to a crawl though.

Undefined behavior means that anything can happen. I'm not sure if it makes sense to claim that "anything" can happen except something that can make "anything" happen (UB). Basically it means "anything but at least one thing" can happen. Formally that excludes UB, but does this really help?

To give an (awkward / fantasy) example: Let's assume that providing inconsistent behavior of Hash::hash and Eq::eq will make HashSet behave in the following strange way: Writing and reading from a HashSet will cause other HashMaps to behave inconsistently (e.g. because data of all HashSets and HashMaps is internally stored in a singleton, which is likely not true, but this is just an example). This certainly would not be "undefined behavior", but in practice would likely be unsound because other code might rely on HashMaps working correctly if Hash and Eq are implemented "logically correct" for that use.

My point is: Allowing "anything but UB" to happen makes it impossible to soundly use unsafe anywhere else in your code because the assumptions in your unsafe code might not be met.

Perhaps it would be wise to extend the documentation of std to include a positive list of things that can happen, because "anything but UB" is very vague (but perhaps I'm missing something here).

Yeah, I guess I would have to check LMDB's source. And even if I don't find anything, I can't guarantee that future versions won't change. But I see here a fundamental problem: Libraries written in other languages will often not provide detailed explanations of what exactly happens when certain preconditions are not met. I think if you cannot ensure the preconditions are met, you'll always infect any use of the API with unsafe.

With Rust libraries, we have the advantage that only unsafe functions may cause UB. But like I said above: I believe that allowing "anything but UB" to happen will make further use of unsafe unsound because you don't really know what needs to hold in order for the other unsafe code to be sound.

To summarize:

  • If you never use unsafe in Rust, then you're safe.
  • If you provide a safe API in Rust that claims "anything but UB" can happen (like done in std with HashSet / HashMap), you are safe, but you cannot use unsafe soundly elsewhere in your code.
  • Documentation of non-Rust libraries might be less specific than documentation of Rust libraries in regard to what happens when preconditions aren't met (because there is no unsafe modifier in C, for example). This might infect Rust code with unsafe unnecessarily.

I didn't understand what you meant here.

I think this is close to, if not literally the definition of undefined though: if you have no UB, then all other safety pre- and post-conditions hold. For example, the HashSet might hard-lock your process, it might remove or duplicate elements, or any other such logical errors could happen: but there's no chance of a double free, aliased mut references, or other memory safety issues.

Yes, this means that your safety establishing code for unsafe can't depend on logical guarantees from other modules, but in practice this isn't really an issue: you generally either aren't sharing any state and there's no interaction at all, or the interaction is factored out into a common module that establishes all the safety guarantees.

This only works with Rust-only code, of course.

Basically, you stated that at least one reason for unsafety is you can't ensure that only one process is writing a page/row, and that it could modify one out from under another, but IPC lets you ensure this, assuming only this library is used to access the file. Whether that assumption can be made is a lot tougher to call! Maybe a fig leaf unsafe init/open that has your users pinky promise they won't let that happen?

Personally, I'd assume it very likely could, unless I'd pinned the LMDB to a specific version after a thorough analysis. I tend to treat every C function as unsafe, with its documented usage and behavior as its safety contract. For instance, if a function doesn't specify what happens when a null pointer is passed in, I'd assume it violates the safety contract and might cause UB, and refrain from passing a null pointer in my wrapper. Doing anything else would be opening the door for new UB possibly occurring every time the C library is updated, since stability only applies to the documented parts of its interface.

But would you agree that formally, if "any error but UB" can happen, I cannot use unsafe code soundly anywhere in my program?

Let me try to pin this down to a particular example. Consider the following code:

use rand::random;

use std::collections::{HashMap, HashSet};
use std::hash::{Hash, Hasher};

struct Wobbly;

impl PartialEq for Wobbly {
    fn eq(&self, _: &Wobbly) -> bool {
        random()
    }
}

impl Eq for Wobbly {}

impl Hash for Wobbly {
    fn hash<H: Hasher>(&self, state: &mut H) {
        random::<bool>().hash(state)
    }
}

fn main() {
    let a = [1, 2, 3, 4];
    let mut m = HashMap::<String, usize>::new();
    m.insert("index".to_string(), 3);
    m.insert("something_else".to_string(), 999);
    let mut s = HashSet::<Wobbly>::new();
    for _ in 0..12 {
        s.insert(Wobbly); // anything but UB can happen here!!
    }
    unsafe {
        // is this sound, and how can we prove it?
        println!("{}", a.get_unchecked(*m.get("index").unwrap()));
    }
}

(Playground)

Output:

4

I would argue that s.insert(Wobbly) can do anything as long as it's not UB. The exact wording in the docs is:

The behavior resulting from such a logic error is not specified (it could include panics, incorrect results, aborts, memory leaks, or non-termination) but will not be undefined behavior.

So it could, for example, make instances of HashMap returning wrong values, because internally HashSet and HashMap use a shared state that is private but can be corrupted. I'm pretty sure this isn't the case in practice, but it's just a thought experiment for now. Note that while this hypothetic behavior is pretty odd, it would not be undefined behavior. If only safe Rust is involved, no memory error can occur.

Nonetheless, relying on HashMap returning correct values in the following unsafe code would be unsound.

Hence I would conclude that using s.insert(Wobbly) anywhere in my code would formally make it impossible to use unsafe in a sound way anywhere else in the code.

Is my reasoning wrong? I would say this is a rather theoretic problem, but could be an issue when you want to formally verify correctness of a program. I would claim that to fix this, the standard library needs to provide a finite list of behavior that may happen instead of saying "anything but UB" can happen.


Yeah, that's exactly my point :sweat_smile:. That's why, for example, mmtkvdb::EnvBuilder::open_rw has to be unsafe :frowning:. And honestly, it's pretty impossible to guarantee that a file is not corrupted (unless I would do a full scan and compare a hash value, for example).

I still feel like it's better to mark these function as unsafe even if in practice, users might not be able to ensure using them soundly. At least it's the fault of the user then :innocent:.


Even if I don't like the implications in my case, I would agree.

That's not allowed. Unsafe code must anticipate that safe functions, traits, etc. can be implemented incorrectly. If your unsafe code relies on the correct behavior of HashSet for memory safety, then it's unsound, because it can be broken by safe code.

Not being able to rely on std behaving as documented would make it pretty much impossible to soundly use unsafe then. I couldn't make any assumptions about Vec not randomly losing elements, for example.

Well, you can depend on safe types to behave as documented. But they can have an unspecified but memory-safe state in certain scenarios. Really, there's only two ways this can happen: if you use it after one of its operations panics, or if you ignore preconditions in safe code, such as using an inconsistent Ord. At that point, the only consistent thing you can do is to drop the object.

But is that sufficient to ensure that further unsafe code (which uses different types/objects/functions) can be sound? There could be thread-local caches, etc. involved.

I'm not sure if I'm doing a good job with explaining what I mean…

Basically I try to say that unspecified behavior, if it doesn't specify what exactly is unspecified and what might happen, is formally safe but in practice formally unsound as soon as any unsafe code exists.

In other words: "unspecified behavior" is a very unspecified term, and it will make reasoning about code similarly difficult as UB does in theory practically impossible if unsafe code is involved. (Edit: yeah, I know "in theory practically" is kinda contradicting, but I don't know how to express it in a better way.) This is why I think it needs to be specified what's "unspecified" in "unspecified behavior". Not sure if I made that explanation better now. :sweat_smile:

As I read it, you're asking that these types at least formally specify that the unspecified behavior is local to that instance, that breaking the logical preconditions of one HashMap doesn't mean you can no longer assume the logical post-conditions of another?

That seems pretty reasonable, and what people already assume in practice.

2 Likes

It's probably necessary for UnwindSafe to make any sense at all: the fundamental assumption is that only mutable references or interior mutability can let inconsistent state leak past a catch_unwind() boundary, so passing everything through a Mutex is sufficient to avoid any issues. This assumption holds in the presence of most regular static variables, since they require a Mutex for synchronization. However, it does not seem to hold for thread-local variables, which contain no poisoning logic.

If "behavior local to an instance" is (or can be) well-defined, then yes, it might be sufficient. I originally thought more on a finite list of things that can happen, like:

  • while using methods on a HashSet instance that had affected elements stored:
    • deadlocks / non-termination
    • panics
    • aborts
    • memory leaks (or more general: resource leaks)
    • retrieval of wrong values

I don't think it would be necessary to state that, for example, randomized base addresses could be exposed, because I can't think of an implementation where this would happen due to inconsistent ordering/equality. And I generally believe that an open list of possibilities that might happen (instead of a finite one) makes reasoning about a program very hard (which, in turn, is necessary to soundly use unsafe).

I'm thinking on maybe writing an issue to fix the documentation of HashSet / HashMap in that regard. The same issue affects some other data structures in std::collections:

I would like to point out that in other places of the standard library, more specific remarks about unspecified behavior are made, for example in Vec::drain:

Leaking

If the returned iterator [Drain] goes out of scope without being dropped (due to mem::forget, for example), the vector may have lost and leaked elements arbitrarily, including elements outside the range.

This is a good example of how unspecified behavior can be "specified", and while this leaves a certain degree of freedom to the implementor, it doesn't make reasoning unconsiderably hard.


Another thing I found in the Rust reference, which irritates me:

Logic errors

[…]

For example, implementing both Hash and Eq requires that values considered equal have equal hashes. Another example are data structures like BinaryHeap, BTreeMap, BTreeSet, HashMap and HashSet which describe constraints on the modification of their keys while they are in the data structure. Violating such constraints is not considered unsafe, yet the program is considered erroneous and its behavior unpredictable.

Here, it is explicitly stated that the "program behavior" becomes "unpredictable" (which is pretty much what UB is), though it's not considered "unsafe". I could interpret this like I did before: "Anything can happen except UB", which is not very helpful in practice when you want to reason about a program. And it would make it impossible to soundly use unsafe anywhere.


I created an updated playground to show more drastically what I mean: Updated Playground.

Well, there are multiple, distinct problems here.

  • First, you may be tempted to trust core, std, alloc, etc., but I'm not sure if that is formally allowed. It is most definitely disallowed to rely on documented/logically consistent behavior of any 3rd-party code for soundness of unsafe code. Code that comes straight from the Rust distribution may be a reasonable exception, but it's not automatic.
  • You need to precisely define what you want to rely on. "Vec randomly losing elements" is way too broad. In what sense would it lose elements?
    • Would it be equivalent to pop() or remove()? In that case, you would still get consistent pointer, length, and capacity information from it, so you couldn't blame it if you relied on its logical invariants for safety.
    • If instead it lost an element in the sense that it would become uninitialized/garbage, then that's not your fault, since Vec uses unsafe internally.

I do think you're being a bit pessimistic: undefined behavior is a much larger set of possible behaviors than unspecified but memory safe behaviors. You can safely assume that if you have a reference to a value, nobody else has a mutable reference to it, or that if an item is not visible outside a module, that only items in that module have access to it. Most safety guarantees only need this much.

You are right though that you could, for example, have some handle containing a key handed out that proves via the type system that that key was inserted into a private HashMap, which you use in an unsafe block. Without a "this instance only" restriction on violating preconditions there's a safety hole here, in theory (of course you would have to go out of your way to break this)

I started a discussion on IRLO to consider adding such restrictions.

2 Likes