I have a rather large boxed slice (~4M elements). Each element contains a string. I would like to build a hash map to allow me to index that slice by that string. To avoid duplicating ~4M strings and using up additional memory (and on the hardware I'm targeting, I care), I'd like the keys in my hashmap to point directly to those strings rather than being copies. I've already built an unsafe wrapper type around a *const str that passes through Hash and Eq, and when building the hash map I simply iterate through the boxed slice, cast a reference to the string to that type, and insert it into my hash map as the key, with a *const pointer to the element itself as the value. The boxed slice and the hash map containing raw pointers to it are both private members of a struct that the user then interacts with.
Is there something magical I need to do involving pinning in order for this to be sound, or is the fact that the boxed slice is a private member of the struct containing it sufficient to guarantee that its contents will not move unless I move them?
After reading this question 2-3 times I can confidently say it would benefit from some concrete code for illustration, especially to precisely explain the types involved.
I think I understand what you are trying to do. You are building a self-referential type.
Neither of the points you listed (privacy or pinning) has anything to do with whether this is sound. That has to do with the fact that String heap-allocates its backing buffer, therefore its address is stable. So as long as you don't modify the String in a way that causes it to reallocate, you should be safe.
To be clear, this means that you are allowed to:
move it around
read from it, iterate over it, slice it immutably or mutably
mutate it in a way that does not affect its capacity, eg. clear() or pop()
publicly expose immutable references to it
publicly expose mutable subslices to it
but you are not allowed to:
push() to it
drop it
drop it and replace it with a freshly allocated String
publicly expose a mutable reference of type &mut String
You mentioned pinning, but pinning wouldn't help here at all. Pinning isn't a prescription, it's a description. Types can be Unpin, meaning that they are safe to move around without self-referential pointers (if any) becoming invalid. This is a safety requirement that you can rely on in generic code by requiring the Unpin bound, but you can't force a type to have this property.
Concrete types either are or aren't Unpin, which is a structural property determined by their implementation. If String didn't heap-allocate its buffer (or otherwise ensured a stable adress for it), then you couldn't magically force it to work in a self-referential setting anyway.
Another suggestion, but may not be applicable to your use-case: If the HashMap lives short enough and the elements are immutable and owned by someone else (or you leak them) , you could go with HashMap<&str, &Element>. You can insert into that map with map.insert(&element.key, &element)
You could use a RawTable<usize> where the usize are indexes into your boxed slice. The hashing methods for inserting/finding etc etc all take a custom closure to hash, and you can have that deref to the implementation of Hash for the string at the given index.
Edit: here's an example of how you would do this Rust Playground
struct Map {
items: Box<[String]>,
indices: RawTable<usize>,
hasher: RandomState,
}
impl Map {
fn new(items: Box<[String]>) -> Self {
let mut indices = RawTable::with_capacity(items.len());
let hasher = RandomState::new();
for (idx, item) in items.iter().enumerate() {
let hash = hasher.hash_one(item);
// Note: this assumes all items are different
indices.insert(hash, idx, |&idx2| hasher.hash_one(&items[idx2]));
}
Map {
items,
indices,
hasher,
}
}
fn index_of(&self, s: &str) -> Option<usize> {
let hash = self.hasher.hash_one(&s);
self.indices
.get(hash, |&idx2| s == self.items[idx2])
.copied()
}
}
You should probably just use @samuelpilz's solution based on reference counting. As a rule of thumb, you shouldn't write unsafe of which the soundness you need to ask about.
Very different Ansatz but if you only need to find the index of a string in the slice how about sorting the slice of strings once and doing binary search instead of using a hashmap?
If you can't sort the slice because it's structure is somehow important you could still sort an array of indices with argsort and use that to do the binary search.
pub fn argsort<T: Ord>(data: &[T]) -> Vec<usize> {
let mut indices = (0..data.len()).collect::<Vec<_>>();
indices.sort_by_key(|&i| &data[i]);
indices
}