Safely acquiring a pointer to one of the elements of a boxed slice

I have a rather large boxed slice (~4M elements). Each element contains a string. I would like to build a hash map to allow me to index that slice by that string. To avoid duplicating ~4M strings and using up additional memory (and on the hardware I'm targeting, I care), I'd like the keys in my hashmap to point directly to those strings rather than being copies. I've already built an unsafe wrapper type around a *const str that passes through Hash and Eq, and when building the hash map I simply iterate through the boxed slice, cast a reference to the string to that type, and insert it into my hash map as the key, with a *const pointer to the element itself as the value. The boxed slice and the hash map containing raw pointers to it are both private members of a struct that the user then interacts with.

Is there something magical I need to do involving pinning in order for this to be sound, or is the fact that the boxed slice is a private member of the struct containing it sufficient to guarantee that its contents will not move unless I move them?

After reading this question 2-3 times I can confidently say it would benefit from some concrete code for illustration, especially to precisely explain the types involved.

2 Likes

I think I understand what you are trying to do. You are building a self-referential type.

Neither of the points you listed (privacy or pinning) has anything to do with whether this is sound. That has to do with the fact that String heap-allocates its backing buffer, therefore its address is stable. So as long as you don't modify the String in a way that causes it to reallocate, you should be safe.

To be clear, this means that you are allowed to:

  • move it around
  • read from it, iterate over it, slice it immutably or mutably
  • mutate it in a way that does not affect its capacity, eg. clear() or pop()
  • publicly expose immutable references to it
  • publicly expose mutable subslices to it

but you are not allowed to:

  • push() to it
  • drop it
  • drop it and replace it with a freshly allocated String
  • publicly expose a mutable reference of type &mut String

You mentioned pinning, but pinning wouldn't help here at all. Pinning isn't a prescription, it's a description. Types can be Unpin, meaning that they are safe to move around without self-referential pointers (if any) becoming invalid. This is a safety requirement that you can rely on in generic code by requiring the Unpin bound, but you can't force a type to have this property.

Concrete types either are or aren't Unpin, which is a structural property determined by their implementation. If String didn't heap-allocate its buffer (or otherwise ensured a stable adress for it), then you couldn't magically force it to work in a self-referential setting anyway.

I understand the question a bit differently, I focus on this sentence:

Remember that mutating the keys of a hashmap is a logic-error. For this reason, I assume the Strings are effectively-immutable (while in the map).

In this case, what is wrong with the following solution?

struct Element {
   element: Arc<str>,
}

type MyHashMap = HashMap<Arc<str>, Element>

This makes the key "point to" the same memory region that contains the str data.

If you want to mutate the string while not inserted in the map, you can use Arc<String> instead of Arc<str> and call Arc::get_mut.

3 Likes

Another suggestion, but may not be applicable to your use-case: If the HashMap lives short enough and the elements are immutable and owned by someone else (or you leak them) , you could go with HashMap<&str, &Element>. You can insert into that map with map.insert(&element.key, &element)

Indeed, that isn't applicable, because:

Most probably. Similar to steffahn, I did not understand much after "I've already built..." of the original post.

I have had success in the past refactoring the data flow to allow different solutions than originally anticipated.

You could use a RawTable<usize> where the usize are indexes into your boxed slice. The hashing methods for inserting/finding etc etc all take a custom closure to hash, and you can have that deref to the implementation of Hash for the string at the given index.

Edit: here's an example of how you would do this Rust Playground

struct Map {
    items: Box<[String]>,
    indices: RawTable<usize>,
    hasher: RandomState,
}

impl Map {
    fn new(items: Box<[String]>) -> Self {
        let mut indices = RawTable::with_capacity(items.len());
        let hasher = RandomState::new();

        for (idx, item) in items.iter().enumerate() {
            let hash = hasher.hash_one(item);
            // Note: this assumes all items are different
            indices.insert(hash, idx, |&idx2| hasher.hash_one(&items[idx2]));
        }

        Map {
            items,
            indices,
            hasher,
        }
    }

    fn index_of(&self, s: &str) -> Option<usize> {
        let hash = self.hasher.hash_one(&s);
        self.indices
            .get(hash, |&idx2| s == self.items[idx2])
            .copied()
    }
}

My understanding is that OP has approximately the following data structure, and s/he is asking whether it's sound:

struct RawStr(*const str);

impl Eq for RawStr {}
impl Hash for RawStr { ... }

struct Map {
    items: Box<[String]>,
    indices: HashMap<RawStr, usize>,
}

impl Map {
    fn new(items: Box<[String]>) -> Self {
        let indices = items
            .iter()
            .enumerate()
            .map(|(i, s)| {
                let key = RawStr::from(s.as_ptr());
                let value = i;
                (key, value)
            })
            .collect();
        Map { items, indices }
    }
}

If I'm right, your Arc<str>/Rc<str> suggestion would be an equivalent but safe alternative.

H2CO3 is correct. That code seems to work in testing, but I'm worried whether there are gremlins lurking in it that will bite me in the long term.

You should probably just use @samuelpilz's solution based on reference counting. As a rule of thumb, you shouldn't write unsafe of which the soundness you need to ask about.

Very different Ansatz but if you only need to find the index of a string in the slice how about sorting the slice of strings once and doing binary search instead of using a hashmap?

If you can't sort the slice because it's structure is somehow important you could still sort an array of indices with argsort and use that to do the binary search.

pub fn argsort<T: Ord>(data: &[T]) -> Vec<usize> {
    let mut indices = (0..data.len()).collect::<Vec<_>>();
    indices.sort_by_key(|&i| &data[i]);
    indices
}

Code taken from StackOverflow answer.