Efficiently embedding large static JSON data

Suppose I have a large JSON file with a fairly simple schema that I could easily write down by hand in #[derive(Deserialize)] form.

The file is static and I want to embed it in my app. What's the best way to do this? I would like to:

  1. Avoid having to parse it at runtime.
  2. Avoid large compile times.
  3. Have it in a proper typed form, i.e. hand-written structs; not something like serde_json::Value.

I don't mind if I need to manually convert the JSON file to some other format once.

I guess the obvious answer is a JSON to Rust source code converter, but I'd worry that will massively bloat the compile times, and maybe binary size.

Has anyone solved this problem?

This is the specific file if it makes any difference.

Personally I would use something like:

use std::sync::LazyLock;
static DATA: LazyLock<Data> = LazyLock::new(|| serde_json::from_str(include_str!("path")).unwrap());

Maybe one day we'll get nice comptime evaluation and I'd replace that then. Until then I'm happy to pay the price of runtime parsing.

2 Likes

It's not too complex to generate the struct(s) and a const representing your JSON with a build script, but it's a lot of boilerplate and I couldn't find a crate that already does it. You can almost use Debug, except the arrays aren't represented correctly (they need to either be appended with vec! or & to make Vecs or slices respectively). If this file never changes, just deserialize it and print it out with Debug, then search-and-replace the arrays, and paste that into your code. You can generate the structs with serde derives with https://transform.tools/json-to-rust-serde (don't give it your entire file, just enough to infer the types).

Yeah but won't that generate an enormous file that takes ages to compile?

I checked that JSON and it's like 3 MB with some 1200 elements. It's probably going to be fine.

That is what I would do. As a first attempt anyway. Surely that does not take so long to compile.

But why the need for static data and lazy lock? Isn't that an unstable feature anyway? Just deserialise it into a data variable where you need it.

It's been stabilized in 1.80 actually :slight_smile:

I assumed OP wanted that configuration data globally. (I also assumed it was a set of compile time configuration)

You could consider changing to a different format for your file instead of json. In increasing order of potential complexity and efficiency: messagepack, bincode, rkyv.

I just wrote this Python script to generate a module that has all the data hard-coded as Rust structs and string/number literals:

outpath = sys.argv[1]
insns = json.load(urlopen('https://raw.githubusercontent.com/ThinkOpenly/RISC-V_ISA/main/src/ISA.json'))['instructions']

def render_instruction(ins):
    lines = [f'        {key}: r#"{value}"#,' for key, value in ins.items() if isinstance(value, str)]

    extensions = f'        extensions: &[{", ".join("r#\"" + ext +"\"#" for ext in ins["extensions"])}],'
    operands = f'        operands: &[\n{"".join("            Operand { name: r#\"" + op["name"] + "\"#, type_: r#\"" + op["type"] + "\"# },\n" for op in ins["operands"])}        ],'
    fields = f'        fields: &[\n{"".join("            Field { field: r#\"" + f["field"] + "\"#, size: " + str(f["size"]) + " },\n" for f in ins["fields"])}        ],'

    lines.extend([
        extensions,
        operands,
        fields,
    ])

    return '    Instruction {\n' + '\n'.join(lines) + '\n    },\n'

with open(outpath, 'wt') as outstream:
    outstream.write(textwrap.dedent('''
    #[derive(Clone, Copy, Debug)]
    pub struct Instruction {
        pub mnemonic: &'static str,
        pub name: &'static str,
        pub operands: &'static [Operand],
        pub syntax: &'static str,
        pub format: &'static str,
        pub fields: &'static [Field],
        pub extensions: &'static [&'static str],
        pub function: &'static str,
        pub description: &'static str,
    }

    #[derive(Clone, Copy, Debug)]
    pub struct Operand {
        pub name: &'static str,
        pub type_: &'static str,
    }

    #[derive(Clone, Copy, Debug)]
    pub struct Field {
        pub field: &'static str,
        pub size: usize,
    }

    '''))

    outstream.write('static INSTRUCTIONS: &[Instruction] = &[\n')

    for ins in insns:
        outstream.write(render_instruction(ins))

    outstream.write('];\n')
    outstream.write(textwrap.dedent('''
    fn main() {
        dbg!(&INSTRUCTIONS[0..3]);
    }
    '''))

It generates a Rust source file at the path specified by the only command-line argument. On my machine, the Rust file compiled in less than half a second with optimizations fully enabled:

$ time rustc -C opt-level=3 ~/Downloads/instructions.rs -o ~/Downloads/instructions
> 0.38s user 0.06s system 110% cpu 0.398 total

The whole 3 MB source file is 80 kB gzipped: here it is.

1 Like

Damn it. On my MacBook M1 I have been doing rustup update and have it tell me 1.80 was installed. But it was running rustc and cargo from 1.79.

Turns out I had a homebrew installation of Rust here as well. I have no idea why, no recollection of using homebrew to install Rust.

Thanks for the heads up on that.

1 Like

less than half a second

That's still a fair bit IMO, however... I can't complain about the simplicity, and you having done all the work for me! Thanks very much!

How so? I'm quite frankly surprised by that judgement. It's not rare for a medium-sized Rust project to take several dozens of seconds to build. Half a second is nothing compared to that.

Also do note that this is the time for building the entire executable, some of which is the linking part — and that doesn't take a whole lot for a single static, so most of it is unrelated, and likely constant, overhead.

How so? I'm quite frankly surprised by that judgement. It's not rare for a medium-sized Rust project to take several dozens of seconds to build. Half a second is nothing compared to that.

I agree, but I don't think that's desirable. I would like my project to compile relatively quickly and while 0.4s isn't much on its own, if I have that attitude everywhere it will quickly add up.

Also do note that this is the time for building the entire executable, some of which is the linking part

Ah right, not so bad then.

But you don't have that everywhere. Or are all of your statics 3 MB in size?

The rkyv approach will probably be faster to compile, and equally efficient, I think.

Not equally, since values in a const can be inlined, while rkyv cannot.

And if this is the only thing that needs rkyv, the struct is definitely faster to compile than all of rkyv.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.