Json-rust 0.9.0 released, cuts parsing time by half

...by reducing the amount of heap allocations to bare minimum.

Notes here: Release 0.9.0 · maciejhirsz/json-rust · GitHub

Github has most of the information on what changed, but I thought I'd share some implementation details here.

The enums in Rust have a peculiar behavior in that they not only take the size of the largest variant inside the enum + 1 byte for the tag, but they also round up the size to the nearest pointer size value, adding padding. From what I've looked up, this makes it easier to copy / store the data around. If someone has a good explanation of this, I'm quite curious.

Storing something like a Vec or String inside the enum will make the enum's size equal to 4 * pointer size: 1 for the pointer to the heap, 1 for length, 1 for capacity and 1 for the tag + padding. On 64 bit architecture, that makes the entire union take 32 bytes of space.

Turns out you can do quite a lot with 32 bytes of space, especially if your main concern is storing a lot of small strings, which JSON does quite often. Since the enum size is rounded, but the tag takes only one byte, that means you can add a variant that takes exactly 31 bytes without increasing the size of the whole union. Which is exactly what I did - the new Short type stores a buffer of 30 bytes and an extra byte for length, thus short strings can be represented with no heap allocation, and the whole union is neither bigger nor slower because of it.

I've used a similar mechanism for Object keys, those are always heap allocated by the virtue of the Object itself being heap allocated. However, if the key is short, it can be stored directly on the node which would otherwise contain only a pointer to another location in the heap (+ length + capacity). There are some interesting internals happening there to remove the cost of figuring out whether the key is on the node or somewhere else on the heap, but I won't be going into the details of that here, the source code for that is pretty well commented.

It's been quite fun figuring this stuff out.

13 Likes

It's not a behavior unique to enums. Primitive data types have alignment requirements - 32 bit integers must be 4-byte aligned on x86 for example (i.e. their address modulo 4 must be 0). Compilers will add padding to data types to ensure that alignment requirements will be met. For example,

struct Foo {
    a: u8,
    b: u32,
}

is not represented as |a|__b_|, but rather as |a|pad|__b_| in memory.

3 Likes