Best practice for storing large data structures known at compile time

I'm very new to compiled languages and am trying to figure out the best way to store a "big" static data structure (<10,000 key HashMap) in a binary. I'm sure there's an established way to do this and I'd love it if someone pointed me to somewhere I can read more about it, I just don't think I'm using the right search terms to find the info I need.

The application is a CLI tool that will need to store a dictionary for matching against input strings - put text in from a file or stdin, get text out based on lookups from the dictionary. The dictionary itself currently exists as a large JSON file where each key is one character.

What I've done thus far, which works but is probably wildly inefficient (and definitely slows down rls) is to convert the JSON text to a HashMap macro (provided by maplit) that lives in a .rs file. That file then exports a function that calls the macro, which I can call once in the app:

    pub fn default_entries() -> HashMap<char, Entry> {
        hashmap!{
            'a' => Entry { ... },
            'b' => Entry { ... },
            'c' => Entry { ... },
            ...
        }
    }

What I'd like to be able to do is take the JSON and generate some kind of pre-compiled, immutable representation of the data structure that can be bundled with the binary.

1 Like

I think the phf crate does what you want: rust-phf/README.md at master · rust-phf/rust-phf · GitHub

It generates compile-time hash maps. Specifically, I think you could use the phf_codegen crate and feed it data read from your json file with serde_json in build.rs.

4 Likes

This looks awesome, thank you! Can I import the structs I've defined in lib.rs (like Entry) into my build script so that serde knows about them? The file structure is:

src/
┗ lib.rs
┗ main.rs
build.rs
dict.json
Cargo.toml

I don't think so, unfortunately.

The build.rs script runs before any of your other code is compiled, so lib.rs and other library files can't be used until build.rs has finished running.

For the code generation, this won't be a problem (phf_codegen - Rust has code which constructs types in lib.rs w/o importing them, as you've probably seen), but I can see it being problematic for gathering the data.

The two alternatives I can think of are to either put your types into a sub-crate so that both build.rs and lib.rs can depend on it (it'd be in both [build-dependencies] and [dependencies] in Cargo.toml), or if it's minimal enough just copy the code.

In either case you'll still need a bit of trickery since you need to write out the data as rust code constructing the structures to feed it to phf_codegen, but that should be possible with some custom Display implementations for the versions build.rs uses, or just using format!() with each entry's data?

1 Like

I've got it more or less working using format!() currently. Entry isn't that complicated, so I think I can make do with just duplicating its declaration for now.

Thanks for your help!

No problem!

One more idea I just thought of which might or might not be useful: you could use include!() or #[path = "..."] mod xxx; to avoid duplicating source. Like if you put the structures in src/data_structs.rs, then you could have mod data_structs { include!("src/data_structs.rs") } in build.rs as a hack to essentially copy the code without having two copies in source control. #[path = "src/data_structs.rs"] mod data_structs; in build.rs could also work very similarly?

1 Like