This crate targets simplicity and performance. No dependencies are used, just the standard Rust library.
We have two main structures available:
Message - Owned parsed values
MessageRef - Message parsed by reference and bound to the lifetime of the readers source
Snippet example
use msgpacker::prelude::*;
use std::io::{Cursor, Seek};
let buffer = vec![0u8; 4096];
let mut cursor = Cursor::new(buffer);
let key = Message::string("some-key");
let value = Message::integer_signed(-15);
let entry = MapEntry::new(key, value);
let message = Message::map(vec![entry]);
// Write the message to the cursor
message.pack(&mut cursor).expect("Message pack failed");
cursor.rewind().expect("Reset the cursor to the beginning");
// Read the message from the cursor
let restored = Message::unpack(&mut cursor).expect("Message unpack failed");
// Alternatively, we can use the index implementation
let value = restored["some-key"]
.as_integer()
.expect("The value was an integer")
.as_i64()
.expect("The value was a negative number");
assert_eq!(value, -15);
Your code is trusting the number of array elements specified by the input and allocating whatever capacity it requests, which opens the door for parser bombs.
Yes. Its not really the protocol responsibility to check for that, right? As long reader is able to provide enough data, it will just create the messages
As in the protocol specs, we can take up to 4294967295 items
You're trusting that all the data will be provided. It's possible to create a five byte message that doesn't actually contain the data but fools the parser into allocating a space for it. These five bytes can even be repeated to allocate any number of nested arrays without providing the data for any of them.
You're still not benchmarking any competing crates, so the benchmark doesn't really say anything. Look at the benchmarking code related to the post I previously linked to for reference.
Posting the violin chart of the benchmark makes the results a lot easier to read. Or even better, publish the full report in HTML format, perhaps through CI.
That is true. I have two naive blocks in the implementation. First is this pre-alloc, and second is the unrestricted recursion for arrays or maps.
The second is even cheaper to produce an attack. For the pre-alloc, I'd need several requests because each of them would alloc around 4G - that by itself is not enough to kill a modern server - but with a couple of others in parallel, then we might be talking about some serious damage. The second I can just explode the call stack with a bunch of bytes. Will release a new version with the fixes for that.
Also, do you have a recommendation of other performant message pack implementations for rust? I'll create a more comprehensive benchmark.
Thx for the feedback, appreciate it!
update:
rmpv is by far the most downloaded one. I'll make the benches comparing to that, but if there is any other relevant crate that I'm missing, please let me know and I'll include there!
Each array can allocate up to u32::MAX (4_294_967_295) * std::mem::size_of::<Message>() bytes. With a brief look at the definition of Message, I estimate its size is 40 bytes. The total is then 160 GiB, not 4 GiB.
As I pointed out in my last message, there's no need to do several requests in parallel to do several big allocations. One requests can repeat the same five byte sequence several times, each time allocating a nested array.
As you're mentioning servers as a use case, I'm wondering what's the motivation for a design that works with an enum type representing data of arbitrary form. Many (if not most) use cases, including the typical server, work with data in a specific form, which in Rust means using the type system to define the form of the data. Other serialization/deserialization libraries for Rust can work directly with statically typed data without going through a soup of enums.
There's rmp-serde, which has five times as many downloads as the one you claim to be the most downloaded. I recommend also including libraries for other formats in the benchmark, such as serde_json and serde_cbor. When using rmp-serde, keep in mind that you have to choose whether to include field names in the serialized data, in order to make a fair comparison, as discussed in the other thread.
Does it make sense to compare different protocols? MessagePack is known to be faster than JSON, while the latter provides better readability. Its the header of their website, and we also have a bunch of benchmarks out there in the wild comparing these two.
Unless ofc there is an implementation error that makes MessagePack slower.
They don't aim for the same problem, right? rmpv will create Value while rmp_serde will just grab bytes around, leaving to the user all the protocol functionality and implementation (e.g. Boolean, Integer, Float, Map, etc). The goal of this crate is to create a struct to encapsulate the behavior of the message pack protocol.
We fall into the same problem of comparing different protocols.
As everything in life, that depends. The decision to use enum based structs might be optimal for implementations that requires abstraction without additional overhead - with the trade-off of the enum being actually more expensive memory-wise compared to concrete struct variants.
Update:
Made some preliminary bench between rmpv and msgpacker, and msgpacker is faster. Will include also rmps
Yes. It's not unheard of that libraries for supposedly faster formats are slower than libraries for JSON, because popular JSON libraries are highly optimized. Comparing libraries for different protocols helps find opportunities for optimization.
On the contrary, rmp_serde will create Value or better. I suggest you add it twice to the benchmark, both using Value and using serde_derive.
It depends on the use case, and as I mentioned before, the most likely use case is one where a soup of enums is a lot worse in every way. I don't see what kind of abstraction would make enum values optimal.
It's more like on par. More interesting would be if you can go twice as fast. You could for example try the bump arena trick I suggested in the other thread.