Question: does a `match` with bytes (or strings) build an automata? Should it?

Say I have

match input {
   "and" | "av2" | "lan" => 1,
   "svr" | "mip" | "mon" => 2,
   ...
}

then IIUC it would be (edit: in general) a lot faster for the compiler to build a state machine to walk through the input text than to compare the text to every possible literal. Does the compiler do this? And if it doesn't, should it?

EDIT or is there some even better way to optimize it?

You can see for yourself how the code is compiled:

For your example what it ends up doing is:

  1. check whether the length is 3
  2. compare the first 3-byte sequence by doing a check: (first two bytes ^ b"an") | (next byte) ^ b'd') == 0
  3. compare the second 3-byte sequence the same way, etc

It doesn't seem like it's doing a perfect job, I think it could be optimized better, but not necessarily by building a "state machine". For instance, after checking the length is 3, it could load the 3 bytes into a 4 byte register and then just search for a 4 byte value in a simpler way rather than re-loading it in 2+1 bytes each time.

The best way to optimize this code really depends on what the strings are.

It also depends somewhat on the input distribution. Is the first case more likely than the last cases? Etc.

2 Likes

It doesn't.

AFAIK rustc spits out very naive inefficient code in this case. LLVM might clean it up a bit in trivial cases, but it doesn't work for long lists. Sometimes it even ends up making calls to bcmp for every string.

If you want any perf guarantee, you will have to DIY:

1 Like

Is there a better default choice than 'just compare to each string in a list'?

EDIT Some more detail

@tczajka thanks for the Godbolt, but I fear that my example is too simple for it. To be realistic you'd need a lot more entries. I've definitely seen this kind of thing in the wild - for example looking at file extensions.

@kornel yep phf is great, thanks for it! But you have to know about things like hashing to find it. I'm thinking about people maybe coming to Rust for the first time and evaluating it, perhaps with a focus on perf.

EDIT 2 oh and thanks for writing lib.rs - it's soooooo fast!

1 Like

The phf crate is one option.

If you have quite a few words here, you can use very efficient finite state transducers (that don't require hashing anything) with the fst crate's Map type, which is essentially a map from anything that implements AsRef<[u8]>, like strings, to u64s.

You can see an introduction to how it works internally in "Index 1,600,000,000 Keys with Automata and Rust".

To generate the map and store it in a file:

use fst::map::Map;
use std::fs::File;
use std::io::Write;

fn main() {
    let mut opcode_values = vec![
        ("and", 1),
        ("av2", 1),
        ("lan", 1),
        ("svr", 2),
        ("mip", 2),
        ("mon", 2),
        // ⋮
    ];
    
    opcode_values.sort_unstable();
    opcode_values.dedup();
    
    File::create("opcode_values.bin")
        .unwrap()
        .write_all(
            Map::from_iter(opcode_values)
                .as_fst()
                .as_bytes()
        )
        .unwrap();
}

Then, you can load the map's underlying bytes at compile time, though it will take a small amount of time to do some minor checks that it's the right byte format at runtime since the new method isn't const fn.

use fst::map::Map;

fn get_opcode_values() -> Map<&[u8]> {
    Map::new(include_bytes!("opcode_values.bin")).unwrap()
}

fn whatever() {
    let opcode_values = get_opcode_values();
    
    // ⋮
    
    assert_eq!(map.get("mip"), 2);
}
1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.