Why do include_str and include_bytes have such different effect on code size?

I have a crate that has quite a lot (~80 MB) of JSON data, which is included in the crate.

Previously the crate used include_str! to include the various data (it is then parsed at run time). The overall crate was quite large; on my machine the release build rlib was 150 MB and the rmeta file was 69M. During my investigation of this, I noticed that the rmeta file which is nominally containing "the metadata" for the crate, also seems to include the entire contents of everything passed to include_str!!

On a hunch, I changed the code to use include_bytes! instead. This reduced the rlib to 88 MB and the rmeta file shrunk to just 7 MB - over 100 MB in savings overall.

I cannot find any documentation that suggests that include_str! and include_bytes! are different in any interesting way, and yet this single change cut the size of the crate binary in half.

Can anyone point to anything (docs, forum post, compiler internals) that explains the difference?

3 Likes

I don't know what the reason for the difference you observe is, but if you have 80 MB of JSON, you should really compress it — or use a more efficient serialization format (serde-transcode can help prepare that), or both — at which point you won't be using include_str! anyway.

3 Likes

I tried to reproduce this and found that both resulted in a rmeta about the size of the included file, and a rlib about twice that size.

// pub static NUMBERS: &[u8] = include_bytes!("numbers.txt");
pub static NUMBERS: &str = include_str!("numbers.txt");

If I assign it with let instead of static or const, then both the rlib and rmeta drop down to normal sizes. My guess is that there's some missed optimization with the &str version, and that probably has more to do with the JSON parser than the include macros.

I also agree you should reencode this into some other format.

4 Likes

Thanks that appears to be it. Either missed optimization, or over-optimization. If I #[inline(never)] the function that is using include_str! I get the same code size as I see with include_bytes!.

1 Like

Perhaps this reddit thread could be related. See in particular this comment by Saefroch. Quote:

We turn the entire blob of all data for all consts in the crate into an inline assembly string literal. And since this is an inline assembly string literal, it needs to be escaped into ASCII.

1 Like

That would explain the opposite, that bytes was larger, but that doesn't seem to be what the OP is seeing...

Honestly I'm pretty grossed out by the expansion of include_*, and would have expected them to be compiler builtins (or, moreso, I guess), but they seem to work fine I in practice.

1 Like

Unfortunately, it's never that simple. Since we go through LLVM, the constant/static data needs to be in a format LLVM's pipeline can work with, and currently the best way to do so is as a data string literal.

It's also significantly simpler to treat included data uniformly to other source data in the compiler middle and backend, rather than make them some sort of symbolic reference. Especially so for const, since that data can get further manipulated at constant time.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.