HashMap with slice keys?

I'm trying to implement a struct with a map whose keys are slices of a larger string or char array owned by the struct. Something like this:

struct A<'a> {
  chars: Vec<u8>,
  map: HashMap<&'a str, u32>,  
}

impl<'a> A<'a> {
    fn insert(&'a mut self, begin: usize, end: usize, value: u32) {
        self.map.insert(std::str::from_utf8(&self.chars[begin..end]).unwrap(), value);
    }
}

But this requires a lifetime annotation on self, and then I run into several issues due to this, such as Named self lifetimes and reborrowing. Is there a way to implement this idea which avoids these issues? I could use String keys, but my understanding is that this duplicates data and I'm trying to avoid this.

search for "self-referential struct" in this forum

1 Like

type like this is practically unusable and not desirable, is there a reason you don't just store the range?

struct A {
  chars: Vec<u8>,
  map: HashMap<Range<usize>, u32>,  
}

impl A {
    fn insert(&mut self, begin: usize, end: usize, value: u32) {
        &self.chars[begin..end]; // just to get the same panicking behavior

        self.map.insert(begin..end, value);
    }
}
3 Likes

One way is to just not store the Vec and HashMap in the same struct. You won't be able to move the Vec (you can't move borrowed things).

Another way is to use ranges instead of references (and deal with invalidation if you edit the Vec).

2 Likes

If you want to use temporary scope-limited references, then you have to have two structs, one crated after another:

struct AOwner {
  chars: Vec<u8>,
}

struct AViewIntoOwner<'from_owner> {
  map: HashMap<&'from_owner str, u32>,  
}

and the second one will be temporary, and only allowed to exist in the same function scope as the variable that holds AOwner, and it won't be possible to expand that.

BTW, &'a mut self referring to 'a on the type is a terrible anti-pattern that is almost never useful. It means self must be borrowed for the lifetime of 'a exactly (since &mut has no flexibility about that), and lifetime on A<'a> means it lives at least as long or longer than A, so combination of these means &'a mut self borrows for the entire lifetime of A, and because mut is exclusive, it can be called only once ever, and makes the whole object useless.

5 Likes

If you do need to do this in one struct, and can't sacrifice performance to copying strings, there's some patterns to allow this.

The one thing you can't do in safe code, even with the existing crates, is have insert be the one place you do from_utf8. But you can make an unsafe UTF-8-checked wrapper around Bytes.

struct Utf8Bytes(Bytes);

struct A {
  chars: Bytes,
  map: HashMap<Utf8Bytes, u32>,  
}
2 Likes

HashMap<Range<usize>, ...> won't work if users of this struct need to look up strings in the HashMap without knowing if or where they appear in the Vec. Like this sort of thing:

let a = A::from(some_data);
if let Some(x) = a.lookup("foo") {
    println!("foo has a score of {x}");
}
2 Likes

The annoying but optimal solution is to use the raw_entry API of hashbrown, which allows you to provide a closure which hashes/compares keys when you do the lookup. In that way you can store Range<usize> as the map key data but do lookup based on the subslice. Here's an old example where I did a version of this.

1 Like

Expanding this into a working example:

// Note: OK to derive comparisons without breaking the `Borrow` requirements
//       b/c UTF-8 and raw bytes are order-compatible
#[derive(Clone,Ord,PartialOrd,Eq,PartialEq)]
struct Utf8Bytes(Bytes);

impl Utf8Bytes {
    pub fn new(chars: Bytes)->Result<Self, Utf8Error> {
        let _ = from_utf8(&chars)?;
        Ok(Utf8Bytes(chars))
    }
}

impl Borrow<str> for Utf8Bytes {
    fn borrow(&self)->&str {
        // Safety: UTF-8 was checked on construction, and the data is immutable
        unsafe { from_utf8_unchecked(&self.0) }
    }
}

impl Hash for Utf8Bytes {
    fn hash<H:Hasher>(&self, hasher: &mut H) {
        let chars:&str = self.borrow();
        chars.hash(hasher)
    }
}

struct A {
  chars: Bytes,
  map: HashMap<Utf8Bytes, u32>,  
}

impl A {
    pub fn new(chars: impl Into<Bytes>)->Self {
        A { chars: chars.into(), map: HashMap::new() }
    }
    
    pub fn insert(&mut self, begin: usize, end: usize, value: u32) {
        self.map.insert(Utf8Bytes::new(self.chars.slice(begin..end)).unwrap(), value);
    }
}

#[test]
fn test_str_lookup() {
    let mut a = A::new("Hello, world!");
    a.insert(0, 5, 1);
    a.insert(7, 12, 2);

    assert_eq!(a.map.get("Hello"), Some(&1));
    assert_eq!(a.map.get("world"), Some(&2));
    
}
2 Likes

This idea of storing ranges would be possible using hashbrowns new safe HashTable API.
This idea could also work with strings by using String instead of Vec<K> and &str instead of &u8

use hashbrown::hash_table::Entry;
use hashbrown::{HashMap, HashTable};
use std::hash::{BuildHasher, Hash, RandomState};
use std::ops::Range;

type TableData<V> = (Range<usize>, V);
pub struct SliceMap<K, V, S> {
    key_data: Vec<K>,
    hash_table: HashTable<TableData<V>>,
    hasher: S,
}

pub struct SliceMapVacantEntry<'a, V> {
    range: Range<usize>,
    data: hashbrown::hash_table::VacantEntry<'a, TableData<V>>,
}

impl<'a, V> SliceMapVacantEntry<'a, V> {
    fn insert(mut self, v: V) -> &'a mut V {
        &mut self.data.insert((self.range, v)).into_mut().1
    }
}

pub enum SliceMapEntry<'a, V> {
    Occupied(&'a mut V),
    Vacant(SliceMapVacantEntry<'a, V>),
}

fn eq<'a, K: Eq, V>(key: &'a [K], key_data: &'a [K]) -> impl Fn(&TableData<V>) -> bool + 'a {
    move |(range, _)| &key_data[range.clone()] == key
}

fn hash_fn<'a, K: Hash, V, S: BuildHasher>(
    key_data: &'a [K],
    hasher: &'a S,
) -> impl Fn(&TableData<V>) -> u64 + 'a {
    move |(range, _)| hasher.hash_one(&key_data[range.clone()])
}

impl<K: Hash + Eq, V, S: BuildHasher> SliceMap<K, V, S> {
    pub fn with_hasher(s: S, data: Vec<K>) -> Self {
        SliceMap {
            key_data: data,
            hash_table: HashTable::new(),
            hasher: s,
        }
    }
    pub fn get(&self, k: &[K]) -> Option<&V> {
        let hash = self.hasher.hash_one(k);
        Some(&self.hash_table.find(hash, eq(k, &self.key_data))?.1)
    }

    pub fn entry<'a>(&'a mut self, range: Range<usize>) -> SliceMapEntry<'a, V> {
        let k = &self.key_data[range.clone()];
        let hash = self.hasher.hash_one(k);
        match self.hash_table.entry(
            hash,
            eq(k, &self.key_data),
            hash_fn(&self.key_data, &self.hasher),
        ) {
            Entry::Occupied(occ) => SliceMapEntry::Occupied(&mut occ.into_mut().1),
            Entry::Vacant(vac) => SliceMapEntry::Vacant(SliceMapVacantEntry { range, data: vac }),
        }
    }

    pub fn iter(&self) -> impl Iterator<Item = (&[K], &V)> {
        self.hash_table
            .iter()
            .map(|(range, v)| (&self.key_data[range.clone()], v))
    }
}

#[test]
fn test() {
    let data = "hello world".to_string();
    let mut map = SliceMap::with_hasher(RandomState::default(), data.into_bytes());
    match map.entry(0..5) {
        SliceMapEntry::Occupied(_) => unreachable!(),
        SliceMapEntry::Vacant(v) => v.insert(0),
    };
    match map.entry(6..11) {
        SliceMapEntry::Occupied(_) => unreachable!(),
        SliceMapEntry::Vacant(v) => v.insert(1),
    };
    match map.entry(0..5) {
        SliceMapEntry::Occupied(o) => {
            assert_eq!(*o, 0);
            *o = 10;
        }
        SliceMapEntry::Vacant(_) => unreachable!(),
    };
    assert_eq!(*map.get("hello".as_bytes()).unwrap(), 10);
    assert_eq!(*map.get("world".as_bytes()).unwrap(), 1);
    let map2: HashMap<String, u32> = map
        .iter()
        .map(|(k, v)| (String::from_utf8_lossy(k).into_owned(), *v))
        .collect();
    assert_eq!(
        map2,
        HashMap::from_iter([("hello".to_string(), 10), ("world".to_string(), 1)])
    );
}
1 Like