Blog post: Making Slow Rust Code Fast

I wrote a blog post on performance benchmarking and tuning in Rust using Criterion.rs and flamegraphs, which is based on my recent experience optimizing the mongodb crate. Check it out if interested in learning some techniques for speeding up a Rust codebase!

18 Likes

Serde isn't really designed to be fast, but rather to handle all the bad things about JSON and to provide helpful descriptions of errors in human-written JSON. For BSON you don't need any of this. If you can get rid of Serde, you can probably serialize and deserialize several times faster, and the implementation of the parser will be a lot simpler, and compilation times will be noticeably shorter and the compiled program will be significantly smaller.

From BSON you should be able to do zero-copy deserialization of strings and blobs, but you're not taking advantage of this. You have the bound DeserializeOwned which makes it impossible.

Serde is really designed to be fast, by allowing fully statically dispatched operations without runtime reflections so formats and types are decoupled at code but transparent to the optimization. It's serde_json which cares all the weirdnesses of the JSON format. The core serde crate and its data model is fully agnostic about the exact formats.

But it's true that the serde tends to bloat the compile time and resulting binaries. It's a notable downside of the statically dispatched operation, and generally considered a cost for runtime performance.

You can see its benchmark here with comparison with various implementations and the C++ RapidJSON library. The simd-json mentioned there is another JSON parser implementation also based on the serde. You can see it basically have similar performance with the RapidJSON - some benchmark is faster but some are slower. But the RapidJSON requires some hand-written state machine which makes it impossible to be a casual choice.

8 Likes

That's not what I consider designing to be fast. That's what I consider baseline for any program written in Rust.

The entire data model is entirely biased towards serde_json.

It's not because of static dispatch. It's because of bloat required by the data model.

1 Like

Maybe you have too high standard on general purpose serialization libraries, because the serde is the only thing I know which is fully statically dispatched type and format agnostic and ready for casual use, across all the widely used languages. By casual use I mean you can use it without much effort like writing state machines. The usual approach in this domain was to use reflections or to expose raw maps and arrays, or to expose streaming parser internals like RapidJSON I mentioned above. Serde archives the pros of both by utilizing Rust's macro and trait system, from the cost of increased compile time. Please reply me if you know another implementation with similar or better properties as I'm very interested in this topic.

The model doesn't fit 1:1 to the JSON data model, Usually getting &str is impractical in JSON as the string may contains some escape sequence. And you can see here how inefficient the serde_json handles the &[u8]. But it's the only way to consistently handle byte sequence in JSON to treat it as a sequence of numbers.

Can you elaborate more on it? Generally static dispatch generates O(N * M) code from the N and M combination while dynamic dispatch generates O(N + M).

9 Likes

Alas, it is not! A general pattern in all of programming is to build up complex types from a limited set of primitives using type-level operations. This is not specific to JSON or serde-json; almost all non-format-agnostic serialization libraries end up eventually defining their own Value type.

I think you're confusing the internal data model of Serde with the Value enum of serde_json, which is not what we're talking about.

Just because you haven't seen something better than Serde doesn't mean that Serde doesn't have room for improvement.

Which is a good idea, that can be implemented with much less overhead than Serde. Instead of making a strawman argument defending an idea I wasn't criticizing, you could look at my first post which parts of Serde I'm actually criticizing. The high compilation time is not because of this idea. It's purely because Serde contains a lot of this bloat as characterized in my first post.

Now I made an implementation with similar properties just so I have something to point at.


When serializing, non-Serde is twice as fast.

When deserializing, non-Serde is 5.4 times as fast.

You can try it yourself:

git clone https://gitlab.com/Veverak/serde_benchmark.git
cd serde_benchmark
cargo bench

Someone could start bike-shedding about how non-Serde doesn't behave exactly as Serde, none of which would have any significant impact on performance if accommodated. Note that the serializer and deserializer implementations of the example struct, although hand-written, are identical to what a proc-macro designed for non-Serde would generate.

1 Like

Oh, even better now. I compared serialization to deserialization (as opposed to comparing Serde to non-Serde) and found it odd that serialization is slower. Then I put in a benchmark with Vec::with_capacity. BSON for Serde doesn't have the option to this, so I only did it for non-Serde.


Non-Serde is now serializing not twice as fast as Serde, but 6.4 times as fast.

I'm not confusing them. They are related. The serde data model absolutely does rely on a small number of core primitives and a small number of type-level operators (e.g. sequences and maps). This is not specific to JSON.

How so?

Please be more specific. In what situation and how does this occur?

Are there not other formats that require escaping?

That's exactly what I am trying to do, but ad hominem insults don't help with that.

1 Like

So I wrote a new serde impl from the bottom! Here's the code.

It's pretty much incomplete same as your benchmark example. It only contains serializer because it's simpler. I have nearly zero prior knowledge about the bson format so I mostly copy-pasted your code. Performance seems on par, I don't believe the difference in this scale as my macbook pro is running dozen of chromiums concurrently.

Aside, I found that the benchmark group serialization and deserialization was swapped in your repository. All three compares perf of the to_vec() despite the name deserialize.

8 Likes

Have a look at the newer commit in my repository. It fixes the swapped group labels and it adds the with_capacity benchmark that is way faster (see my previous post presenting the performance difference). You could do the same and have one benchmark with and one without with_capacity for your serializer implementation.

Criterion should be ensuring that the results are statistically significant even if you're running other tasks on the same computer.

1 Like

From an API perspective, it's really important that the MongoDB driver supports serde for maximum interoperability with the greater Rust ecosystem, even if there would be a performance penalty for doing so over supporting a custom serialization framework. And as other users have pointed out, serde is used successfully with many other serialization formats besides JSON, including BSON.

This is absolutely true, and actually my current project is to add support for this to the driver, so hopefully that should be released in the near future!


Regarding the performance of serde in general, the cases where the data model doesn't support a custom type for a given serialization format can incur overhead. For example, BSON and TOML both have a datetime type, but there is no equivalent type in the serde data model. As a result, they have to be represented as something like the equivalent of { <special magic key>: "<datetime as string>" }, which the BSON/TOML serializer can interpret and serialize accordingly. I think once specialization is stabilized, this drawback will mostly go away though, since serializers will be able to provide custom implementations for specific types. For types that are supported by the data model, there seems to be very little if any overhead, however.

Also, thank you all for highlighting some bottlenecks in the bson::to_vec implementation! After looking at a flamegraph of bson::to_vec, I noticed that a lot of time was being spent serializing the keys of a BSON array. In bson, this is implemented using u64's Display implementation, which involves a heap allocation and other expensive work. Once I updated that to use the loop from @Hyeonu's sample, I saw similar serialization times for all three. I've filed RUST-1062 on our issue tracker to ensure this improvement gets upstreamed into bson itself. Thanks again for helping discover this bottleneck!

Post array key optimization serializing a struct with a String, an int64, and an array of int64:

Edit: running it with the exact example from @Frederik's benchmark shows that bson's implementation is still a bit slower, but it's a lot closer now:

7 Likes

It's very understandable you want to have interoperability with Serde. I'd like to point out that you don't have to choose one or the other. It may possible to make the API take a data model agnostic trait, allowing the consumer of the API to make their own choice whether to use Serde or something else. Different choices could even be made for different parts of a program, allowing to performance-critical features to be implemented in one way and having other features be compatible with Serde. I'd rather like to see this for the diversity and evolution of the greater Rust ecosystem than tying the ecosystem up to the stale data model of Serde.

It's nice to see you took hint from the different implementations in the benchmark. I'm suspecting there may be various inaccuracies in the benchmark, as all three of us have got very different result. It may be better you share your forks of the benchmark. And don't forget the deserialization benchmark. It's in deserialization you'll find a lot more questionable things about Serde than in serialization.

Yeah, I think the driver API will end up just exposing the raw BSON bytes and letting users do what they want with it. We assume that they'll largely use serde to deserialize it further, but there's nothing stopping them from writing custom deserialization logic if they wish to.

The benchmark I was using can be found on this branch of my fork of bson: GitHub - patrickfreed/bson-rust at serde-perf. For serialization, I think the results that @Hyeonu and I had were pretty similar after I applied the array index/key optimization.

For deserialization, the difference between bson::from_slice and your implementation is definitely more distinct, though I think that has more to do with the implementation in bson rather than any limitation of serde, given that the msgpack serde library (rmp_serde) seems to be able to achieve similar performance. We've had a ticket open for trying to further optimize the deserializer to match rmp_serde's performance, but it hasn't been prioritized yet because, in practice, the extra ~200ns spent in deserialization is dwarfed by the time spent doing other things like network I/O (in the case of the driver), which is on the order of milliseconds. That being said, there is definitely room for improvement here, so we hope to get to looking into it eventually.

1 Like

This. I have the feeling that the bson crate was originally somewhat of a quick hack, only needed so that the MongoDB driver has something to work with. I have submitted pull requests myself in the past, improving obviously sub-optimal aspects of the library, mostly the removal of spurious allocations.

Exactly. I have yet to see a real-world, non-microbenchmark case where JSON/BSON/whatever (de)serialization is a bottleneck. I have to work with gigabyte-sized JSON files pretty regularly (thanks to the bioinformatics industry for not yet having grokked the concept of relational databases), but the initial couple of seconds spent on parsing them don't matter even there, because the rest of the processing of the extracted data takes way longer.

3 Likes

Well, it was a bug, but

Due to using sscanf, GTA Online was parsing a 10MiB JSON string at startup in O(nĀ²) time. Fixing the parser to be properly O(n) cut load times by 70%, so it was clearly a bottleneck there.

That said, I agree that nearly any deserialization that is not asymptomatically horrible will typically easily complete in less time than it took to do the IO to load the data to deserialize. And if it doesn't... what you serialize is typically the problem before how you serialize (after low hanging optimizations). If you need to go even further beyond, taking a different approach (e.g. memmapping) will almost certainly serve you better than deserialization.

4 Likes

It is true that serde deserializes keys into a C-style enum. But this is really not as bad as you make it out to be; a C-style enum is just an integer to represent which field it is. This mapping still has to happen, because the field could be coming from a string (when serialized into a {string:value} map) or from an integer (in a non self describing format). It's literally encapsulation 101 to split field name recognition from parsing the field itself. (And hey, you can use Cow to deserialize key names if you don't want a custom enum! You'll just have to live with errors in the wrong place, because you didn't parse (don't validate) it in the correct spot.) Also,

the pathological case of a field name containing an escape sequence where there shouldn't be one

So I guess you don't care about being conformant to any general purpose textual format? They all allow escapes in keys, because they want to allow arbitrary keys, which requires escapes.

If you want to complain about serde being tailored for JSON, complain about miniserde, which actually is. And miniserde is pretty much better at being a JSON serialization framework than serde is, because it's specialized to only work for JSON.

Serde is complicated because it has to be in order to support general purpose serialization format agnostic serialization. It's a hard problem. That's not to say it's perfect ā€” nothing is ā€” but it's likely that any simple obvious "problem" is actually required for an important use case you didn't think about. (The exception is not being able to handle special types not in the serde data model, but I don't think that's realistically possible while remaining format agnostic.)

Also, I just want to point out that the serde data model really isn't anything special for a general purpose serialization target, and is if anything specialized for Rust, and definitely not for JSON. JSON can't represent most of the things that the serde data model can (enum variants, value-value maps) and the data has to massaged from one format to the other.

JSON values can be an array, boolean, null, string, number, or object (string-value map). Serde values can be bool, i8, i16, i32, i64, i128, u8, u16, u32, u64, u128, f32, f64, char, string, bytes, option, unit, unit struct, unit variant, newtype struct, newtype variant, seq, tuple, tuple struct, tuple variant, map (value-value), struct (key-value), or struct variant. I fail to see how that's optimized specifically for JSON. (Because it isn't.)

That said, #[serde] attributes do often describe what they do in terms of the corresponding JSON rather than the effect on the serde data model representation. This is primarily due to JSON being a familiar target to talk and care about. But that isn't because they only serve JSON; they work perfectly for any self describing format, and many work for any format.

8 Likes

The raw BSON serializer / deserializer used in bson::to_vec and bson::from_slice were written recently actually (by yours truly), so I think it's more that we haven't had the chance to micro-optimize it yet rather than it being hacky.

More generally speaking, bson was originally a community maintained library, but it's since been transferred to MongoDB the company and these days is being actively maintained by my colleagues and me. So while there is some technical debt in it that needs addressing, it is a fully supported library intended to be production-ready. If you have any ideas for improvements though, we'd love to hear about them on our GitHub issue tracker or Jira project!

3 Likes