Data structure for indexing values on a given field

Let say I have a simple struct which contains only owned and/or copiable values:

#[derive(Debug)]
struct Person {
    name: String,
    id: u32,
   // Other fields...
}

I want to index many of such elements using one of its (hashable) field, for instance with the name field. For that purpose, I'm currently using a HashMap<String, Person>, which works fine:

let mut index = HashMap::new();
let person = Person { name: "Alice".into(), id: 42 };
index.insert(person.name.clone(), person);

However, this design incurs an unnecessary allocation since the person name is a duplicate of the index key. I cannot use a HashMap<&str, String> to avoid the allocation since it would not work with borrowing rules. I think I could make it work using Rc and RecCell but it's still suboptimal.

I tried another design which is this one:

use std::{
    collections::{HashMap, hash_map::DefaultHasher},
    hash::{Hash, Hasher}, 
    mem::replace,
};

struct Index {
    values: Vec<Person>,
    index: HashMap<u64, usize>,
}

fn hash<T: Hash>(value: T) -> u64 { 
    let mut hasher = DefaultHasher::new();
    value.hash(&mut hasher);
    hasher.finish()
 }

impl Index {
    fn insert(&mut self, value: Person) -> Option<Person> {
        let h = hash(&value.name);

        if let Some(idx) = self.index.get(&h) {
            let previous = replace(&mut self.values[*idx], value);
            return Some(previous)
        }

        let idx = self.values.len();
        self.values.push(value);
        self.index.insert(h, idx);

        None
    }

    fn get(&self, key: &str) -> Option<&Person> {
        let h = hash(key);

        if let Some(idx) = self.index.get(&h) {
            return Some(&self.values[*idx])
        }

        None
    }
}

impl Index {
    fn new() -> Self {
        Index { values: vec![], index: HashMap::new() }
    }
}

This data structure internally stores a hash map mapping name's hashes to their index in a list storing the person instances.

I identified several issues with this design:

  • It requires two hashing, one for the name and another for the name's hash,
  • Removing items is not supported since it would invalidate the indices,
  • I do not fully understand the guarantees of hashing so I do not know if this design has flaws.

Is this design valid and can it be improved? Do robust implementation which I could reuse exist?

You haven't specified criteria which you want to use to judge your design.

If you worry about cost of allocations then something like compact_str may be a great solution.

Or if your strings are, usually, long, then Arc wouldn't be so expensive.

Or maybe bumpalo maybe a way to go if you want to use that as temporary data structure.

There are no such thing as “optimal” or “suboptimal”.

Only “optimal for a particular purpose” or “suboptimal for a particular purpose”.

You always have to remember that in today's world 1 (one) mispedicted memory access is more-or-less equal to 1000 (thousand) bytes processed.

This difference dwarfs big O considerations till we are starting to talk about big (really big) numbers.

It seems to me that you are basically reinventing a hash table, poorly and for no good reason, given that there is already a hash table in std.

There aren't many "guarantees" when it comes to hashing. In particular, your design doesn't seem to handle collisions in any way, meaning that you can get false positives (reporting an item to be in the collection when it isn't, because it hashes to the same number as some other item).

With regards to your original solution:

To be crystal clear, indexing does in general require duplication of the index. Databases work like that, for one. If you absolutely want to avoid that, you could just not store the name in the structure, and instead always pass around the name and the rest of the Person when you need both. But that's likely more cumbersome than warranted.

If you know your strings are going to be short most of the time, you may want to perform short-string optimization instead, making most of the strings cloneable without an additional allocation.

2 Likes

You could use a HashSet with custom Hash+PartialEq+Eq+Borrow implementation…

use std::borrow::Borrow;
use std::hash::Hash;
use std::hash::Hasher;

use std::collections::HashSet;

#[derive(Debug)]
struct Person {
    name: String,
    id: u32,
    // Other fields...
}

struct PersonInIndex(Person);

impl Hash for PersonInIndex {
    fn hash<H: Hasher>(&self, state: &mut H) {
        self.0.name.hash(state)
    }
}

impl PartialEq for PersonInIndex {
    fn eq(&self, other: &Self) -> bool {
        self.0.name == other.0.name
    }
}

impl Eq for PersonInIndex {}

impl Borrow<str> for PersonInIndex {
    fn borrow(&self) -> &str {
        &self.0.name
    }
}

#[derive(Default)]
struct Index(HashSet<PersonInIndex>);

impl Index {
    fn new() -> Self {
        Self::default()
    }

    fn insert(&mut self, value: Person) -> Option<Person> {
        self.0.replace(PersonInIndex(value)).map(|p| p.0)
    }

    fn get(&self, key: &str) -> Option<&Person> {
        self.0.get(key).map(|p| &p.0)
    }
}

fn main() {
    let mut index = Index::new();
    dbg!(index.insert(Person {
        name: "foo".to_owned(),
        id: 42,
    }));
    dbg!(index.insert(Person {
        name: "bar".to_owned(),
        id: 1337,
    }));
    dbg!(index.insert(Person {
        name: "foo".to_owned(),
        id: 100,
    }));
    dbg!(index.get("foo"));
    dbg!(index.get("bar"));
    dbg!(index.get("baz"));
}

(Playground)

Output:

   Compiling playground v0.0.1 (/playground)
warning: field `id` is never read
  --> src/main.rs:10:5
   |
8  | struct Person {
   |        ------ field in this struct
9  |     name: String,
10 |     id: u32,
   |     ^^
   |
   = note: `#[warn(dead_code)]` on by default
   = note: `Person` has a derived impl for the trait `Debug`, but this is intentionally ignored during dead code analysis

warning: `playground` (bin "playground") generated 1 warning
    Finished dev [unoptimized + debuginfo] target(s) in 1.33s
     Running `target/debug/playground`
[src/main.rs:55] index.insert(Person { name: "foo".to_owned(), id: 42 }) = None
[src/main.rs:59] index.insert(Person { name: "bar".to_owned(), id: 1337 }) = None
[src/main.rs:63] index.insert(Person { name: "foo".to_owned(), id: 100 }) = Some(
    Person {
        name: "foo",
        id: 42,
    },
)
[src/main.rs:67] index.get("foo") = Some(
    Person {
        name: "foo",
        id: 100,
    },
)
[src/main.rs:68] index.get("bar") = Some(
    Person {
        name: "bar",
        id: 1337,
    },
)
[src/main.rs:69] index.get("baz") = None

2 Likes

hashmap is what I would go with also.

If optimizing the memory consumed, how about removing name from the body of your struct, use it in the key position of your hashmap?

The docs would capture “keyed using the name field”.

This all said, I’ve found the redundancy quite helpful when actually using the data (data structure + algorithm = source of best design).

Thanks for the answer. The database point of view was particularly helpful for me to better understand the issue with my design. It look like that index duplication is likely unavoidable as you said, and index keys should be kept in sync with the values they index.