Is There a Good, Thread-Safe Append-Only Map Crate?

I'm looking for a thread-safe, append-only HashMap, than I can add elements to with an immutable reference to the map. I've been using append_only_vec, but I could have a lot of entries and need to look them up by key, so I'm not sure about the performance implementations of using an append-only Vec<(key value)>, compared to a HashMap<key, value>.

It looks like OnceMap might work, I'll try that.

2 Likes

append_only_vec guarantees that elements will not be moved (their address will not be changed.) If you're relying on that feature, then OnceMap will not work for you, since it doesn't do this -- it is using the std HashMap, which does move elements.' Sorry, that is incorrect since OnceMap values must implement the StableDeref trait.

In addition I don't have very much confidence in a crate like OnceMap with no doc at all. I had to look at the source to see whether the elements are moved or not. Edit: There is some doc in the README, so perhaps I'm wrong about this also.

It may need some tweaks to be performant, but it's not too hard to build one of these yourself:

fn main() {
    const N:usize = 1024;
    let map = AppendOnlyMap::<_,_>::with_capacity(4);
    
    // Insert concurrently on a bunch of different threads 
    let vals:Vec<&str> = std::thread::scope(|s| {
        let mut handles = vec![];
        let map = &map;
        for x in 0..N {
            handles.push(s.spawn(move || map.try_insert(37*x,format!("{x:04}")).unwrap().as_str()));
        }
        handles.into_iter().map(|h| h.join().unwrap()).collect()
    });
    
    // Check contents
    for x in 0..N {
        assert_eq!(vals[x], map.get(&(37*x)).unwrap());
    }
}

// ----------------------

use std::sync::OnceLock;
use std::hash::{Hash,BuildHasher};

#[derive(Debug)]
pub struct AppendOnlyMap<K,V,H=std::collections::hash_map::RandomState> {
    hasher: H,
    table: Page<K,V>
}

#[derive(Debug)]
struct Page<K,V> {
    next: OnceLock<Box<Self>>,
    items: Box<[OnceLock<(K,V)>]>,
}

impl<K:Eq+Hash, V, H:BuildHasher> AppendOnlyMap<K,V,H> {
    fn hash_key(&self, k:&K)->u64 {
        self.hasher.hash_one(k)
    }
    
    pub fn get(&self, k:&K)->Option<&V> {
        self.table.get(k, self.hash_key(k))
    }
    
    pub fn try_insert(&self, k:K, v:V)->Result<&V, (K,V)> {
        let hash = self.hash_key(&k);
        self.table.try_insert(k,v,hash)
    }
    
    pub fn with_capacity(cap:usize)->Self where H:Default {
        AppendOnlyMap {
            hasher: H::default(),
            table: Page::new(cap)
        }
    }
}

impl<K:Eq, V> Page<K,V> {
    fn new(len:usize)->Self {
        let mut items = Vec::with_capacity(len);
        items.resize_with(len, OnceLock::new);
        Page {
            next: OnceLock::new(),
            items: items.into_boxed_slice()
        }
    }
    
    fn idx(&self, hash:u64)->usize {
        (hash % (self.items.len() as u64)) as usize
    }
    
    fn get(&self, k:&K, hash:u64)->Option<&V> {
        let candidate = self.items[self.idx(hash)].get()?;
        if candidate.0 == *k { Some(&candidate.1) }
        else { self.next.get()?.get(k, hash) }
    }
    
    fn try_insert(&self, k:K, v:V, hash:u64)->Result<&V, (K,V)> {
        let i = self.idx(hash);
        match self.items[i].set((k,v)) {
            // Slot was empty, insert was successful
            Ok(()) => Ok(&self.items[i].get().unwrap().1),
            Err((k,v)) => {
                if self.items[i].get().unwrap().0 == k {
                    // Slot was full with a matching key; cannot insert
                    Err((k,v))
                } else {
                    // Slot was full with a colliding key; try next page
                    let len = self.items.len() * 2;
                    self.next.get_or_init(|| Box::new(Self::new(len))).try_insert(k,v,hash)
                }
            }
        }
    }
}
3 Likes

im has you covered: https://crates.io/crates/im

Of course, you need to use the (default) Arc version of it.

im doesn't allow you to insert an element in an existing collection given an immutable reference to it. What it allows is to create a new collection containing the old elements and the new one.

3 Likes

Of course not, that doesn't make any sense. But it doesn't change the original map when adding an element to it, which is what you normally want.

It does – that's exactly what is meant by a "concurrent data structure". Suitably-synchronized interior mutability so that it remains correct in the face of concurrency and threading, allowing all operations to be performed given only a shared reference.

2 Likes

I’d be surprised if they avoided this issue ^^

In fact, now, a few minutes later, here’s the issue adapted:

use std::{
    panic::{catch_unwind, AssertUnwindSafe},
    sync::Mutex,
};

#[derive(PartialEq, Eq, Debug)]
struct H(u32);

impl std::hash::Hash for H {
    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
        if PANIC_ON.lock().unwrap().as_ref() == Some(&self.0) {
            panic!();
        }
        0_u32.hash(state);
    }
}

static PANIC_ON: Mutex<Option<u32>> = Mutex::new(None);

fn main() {
    let mut map = once_map::sync::OnceMap::new();
    for i in 1..=28 {
        map.insert(H(i), |k| {
            if *k == H(28) {
                String::from("Hello World!")
            } else {
                String::new()
            }
        });
    }
    for i in 1..=27 {
        map.remove(&H(i));
    }

    let hello_world = map.get(&H(28)).unwrap();

    println!("exists: {hello_world}");

    let _ = catch_unwind(AssertUnwindSafe(|| {
        *PANIC_ON.lock().unwrap() = Some(28);
        map.insert(H(1), |_| String::new());
    }));

    println!("gone: {hello_world}");
}

Edit: Reported this, for good measure :slight_smile:


@zicklag FYI, you should not be able to run into this issue at all if you never .remove anything from your map, anyways :wink:


Edit2: (20 hours later) The crate seems very well maintained, the issue is fixed already.

3 Likes

No, that's what persistent (or "functional" or "immutable", but not in the Rust sense) data structures as in im are for.
A type that is immutable but can be changed with changes visible from the outside - as in adding elements to a container - is an abomination and a lie.

So RefCell<Vec<_>> and Mutex<Vec<_>> are an abomination and a lie?

You should probably familiarize yourself with basic terminology before shouting nonsense at others. You are confusing concurrent data structures (which are not required to be immutable) with persistent data structures.

1 Like

This exact pattern, "interior mutability", is extremely common, and not just with inexperienced rustaceans or low-profile Rust projects, but with the major, reputable projects in the Rust ecosystem.

To give you a sense of my use-case, I have a World that is shared across multiple threads. In order to be shared, you must have only an immutable reference to the object, otherwise, it wouldn't be shared, because mutable references must only exist in one place. The world contains a hash-map that must be lazily initialized, whenever a new item in it is needed. These items will never be removed, so once you get a reference to it, you know that that reference will stay valid as long as the World exists.

The only way to avoid a shared reference across threads would be to use channels, which, also have to use interior mutability themselves, in order to update a channel queue that is shared across threads. So all I'm doing is moving the synchronization into my World instead of using a channel to do it externally.

Ultimately, such a type is exactly what OP has asked for, though. Interestingly enough, once you abstract around it an append-only data type can be used in a quite purely functional manner once again, too.[1]

Also, we should pay attention to the fact that in-depth discussion of interior mutability in Rust will be taking this thread off-topic relatively quickly.


  1. of course Rust isn’t a purely functional language, so true shared mutability - be it via atomics, append-only / initialize-only data such as OnceCell, general interior mutability constructs such as Mutex or RefCell and the like, are allowed everywhere (even in truly purely functional languages, such constructs are typically allowed as well, but only in marked contexts, not in “ordinary functions”)… I don’t follow how this is as bad as qualifying as “an abomination and a lie”, even though I’m a fan of explicit effects system myself, I’d claim that lack of one only means that side-effects are allowed everywhere, not that anyone was lying. ↩︎

4 Likes

Ooh, that's cool! I think I might just use that. I've actually wondered for a while what a simple hash-map implementation could look like, and that's not that complicated.

I don't think I need the most performant thing in the world, so that will probably work fine, and I can always swap it out with something else later if I need to.

Here's the catch, it isn't immutable, and it doesn't claim to be. And in case you wonder "but it takes a &self reference":&self is not an "immutable reference" but a "shared reference".

4 Likes

Very nice!! Compared to other solutions being discussed, this is the only one that is both non-blocking (other than when two threads attempt to insert the same key) and does not box every map element. It is the boxing that is the biggest drawback of the other solutions, IMO.

Its drawback of course is that it is not in the form of an optimized and tested crate. But we have to start somewhere. I have a need for a specialized map that is similar to this, so I appreciate that it is a clean and simple starting point. I will probably try using it.

There is one exception. It should be possible to use an arena allocator along with OnceMap to avoid a malloc for every map element.

Right; there are a lot of small optimizations that I left out because this was meant to be a prototype and pedagogical tool instead of a full solution.

For example, when you get a collision in one page, there are only two possible slots on the next page that might be used. If you seed an LCG with the hash instead, every slot on the next page is available; I suspect this will increase the average load factor, but I haven't done the analysis to verify that:

 impl<K:Eq, V> Page<K,V> {    
     fn idx(&self, hash:u64)->usize {
-        (hash % (self.items.len() as u64)) as usize
+        (hash.reverse_bits() % (self.items.len() as u64)) as usize
     }

+    fn bump_hash(hash:u64)->u64 {
+        // MMIX LCG constants, perhaps not ideal
+        const A: u64 = 6364136223846793005;
+        const C: u64 = 1442695040888963407;
+        hash.wrapping_mul(A).wrapping_add(C)
+    }
---
     fn get(&self, k:&K, hash:u64)->Option<&V> {
         let candidate = self.items[self.idx(hash)].get()?;
         if candidate.0 == *k { Some(&candidate.1) }
-        else { self.next.get()?.get(k, hash) }
+        else { self.next.get()?.get(k, Self::bump_hash(hash)) }
     }

--- fn try_insert(...
                     // Slot was full with a colliding key; try next page
                     let len = self.items.len() * 2;
-                    self.next.get_or_init(|| Box::new(Self::new(len))).try_insert(k,v,hash)
+                    self.next.get_or_init(|| Box::new(Self::new(len))).try_insert(k,v,Self::bump_hash(hash))
1 Like