String as u32 // for fast comparison

EDIT: The only comparison I need is equality comparison.

  1. I'm writing a pattern matching program that involves lots and lots of string comparisons. We are guaranteed there are at most u32 unique values.

  2. I would like some way of representing Strings as u32's, so comparisons can be fast.

  3. The 'obvious' solution seems to be a Map<String. u32> stored behind some mutex/lock, and every time we create a String, we check vs the map and either get u32 or insert new value and get u32.

  4. This seems like a problem that others must have run into in the past. Is there an idiomatic solution to this?

In general, this sounds like string interning, if that helps you search for solutions.

You might like IndexSet for this, where the numeric value of a String is just its index. Use insert_full to "convert" any string to an index in the set.

3 Likes

If you know all possible strings at compile time, then phf is the best.

I've also used string-interner for dynamic sets. I don't know if that crate is the best/fastest, but it does work.

5 Likes

@Caine: Welcome to the community! In this particular question, we're not re-encoding the string from utf8 to utf32. We are mapping the string to a single u32 by building a map of String -> u32 elsewhere.

If you can tolerate collisions*, you can use 32bit variant of FNV-1a hashing. Rust code would be:

fn hash(value: &[u8]) -> u32 {
    let mut hash: u32 = 0x811c9dc5;
    for byte in value {
        hash ^= *byte as u32;
        hash = hash.wrapping_mul(0x1000193); 
    }
    hash
}

* - if most of your matches are misses, you can compare hashes first and then, if they are matching, compare the values.

64 bit version of FNV would be more robust (much, much less likely to get a collision) and matching would be just as fast if you are on a 64bit architecture.

It's hard to tell what it exactly you are trying to do, but if you are trying to find if a string is present in a set, you can also try using bloom filters to speed things up considerably.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.