Best way to decode network packets?

Hi,

I'd like to toy with Rust for reading and writing network packets. I was thinking I should use something like serde or rustc-serialize (both of which I've never used) to decode/encode the packets, but after browsing some Rust code a bit (an NTP client and 2 DNS servers) I realize people seem to prefer hand-decoding and hand-encoding everything.
Why is that ? What's the best way for me to start ?

Let's start by taking a look at rustc-serialize's documentation. As you can see, it contains three modules: one for encoding to and decoding from base64, another for hex, and another for json. Serde (in the most reductive sense) does much of the same.

You can think of both crates as providing you with some ways to take data from your Rust programs, such as a struct or arrays of bytes, and turn that data into "friendly" (i.e. human-readable) formats. These kinds of formats are often referred to as text-based formats. They are useful in cases where we want humans to be able to read the data or we need to use it in a situation where we want strings rather than blobs of arbitrary memory contents.

That leads us to binary formats. A binary format is essentially just a block of memory, a bunch of bytes with values, probably organized in some order. I should point out that the big distinction here really is in the representation of the data, because of course no matter what you do, when you send data over the network it's transmitted as a bunch of bytes. With a binary format, you take the literal contents of your data- that is, the actual bytes underneath it all- and you align those bytes in one big array. Now you can send that array of bytes out over the network and the recipient, provided it understands the protocol you're using, should be able to reconstruct the data you sent it just by reading the appropriate number of bytes into the correct variables (or struct fields).

Let's look at an example. Suppose we had the following struct.

struct Message {
  ip: [u8; 4],
  port: u16,
  body: Vec<u8>
}

If we instantiated an instance of this like

// Note, I haven't run this code but it should illustrate the point
let msg = Message {
  ip: [127, 0, 0, 1],
  port: 8080,
  body: vec![104, 101, 108, 108, 111] // The string "hello" encoded to ASCII bytes
};

The JSON representation of this struct is relatively human readable (save for the body).

{
  "ip": [127, 0, 0, 1],
  "port": 8080,
  "body": [104, 101, 108, 108, 111]
}

But look at all the wasted bytes! Surely we don't need to include the {} characters, or the property names "ip", "port", "body", the colons, the spaces, the commas, the newlines... That's a lot of wasted space (when you're transmitting millions of such packets every minute). Instead, we could design a protocol (like DNS, TCP, etc...) that says: "I will send you a sequence of bytes. They are as follows:

  1. The first four bytes will be IP address octets
  2. The following two bytes will be a port number
  3. Every byte after that is part of the body"

Now I can lay out a byte array like so:

|127|0|0|1|80|80|104|101|108|108|111|

and just send that! Much simpler and more efficient.

There are other performance-related reasons for using binary formats but the gist of it is that it's mostly about efficiency. Serde and rustc_serialize are concerned with encoding/serializing data into text-based, human-friendly formats, and less with aligning byte arrays with data. That's the kind of thing you can only do if you understand the layout of those byte arrays, and so it's often left to you to handle.

Please understand that I'm fudging over some details (like how a port number would actually be represented as two bytes) for the sake of simplicity.

If you check out the RFCs specifying protocols like DNS, you'll get a much more detailed look at how binary formats are used and defined.

I do hope this helps.

2 Likes

OK, so if I get you correctly, the code-generation (de)serializers aren't efficient enough to be used for binary data. Right, a bit sad (seeing how Rust is generally perf-oriented) but I can understand that.
Thanks for taking the time to answer me !

I really enjoyed writing a mysql network protocol parser in nom. It has good documentation and good developer experience (UX), it's based on flexible composition of macros and is quite fun to use!

This applies mostly to parsing some existing protocol / binary (or string) format. If you're just de-/serializing your own data structures, then I'd suggest thinking about whether some of the existing RPC/data exchange protocols suit your use case, like CapnProto or Protobuf.

But if the goal is to just learn a bunch of Rust, then maybe indeed handcoding everything is also a reasonable option to consider. I think that may also be one of the reasons you'll see a bunch of handrolled de-/serialization in Rust land, though this is just a guess. Another potential reason would be that there are issues, or there have been at the time the libraries were written, with available de-/serialization libraries.

Cheers

1 Like

Just to add something to this:

There are de-/serializers that take efficiency (in terms of encoded message length and de-/ser performance) very seriously and achieve very competitive results, like CapnProto as well as others. Different libraries/implementations have different goals and optimize for differing use cases, so have a look around and pick whatever is the best fit for your use case, if any of them come close enough.

@zsck @bestouff I don't think that is an accurate characterization of rustc-serialize or Serde. There are plenty of Serde serializers for binary formats geared toward performance. Some examples are Bincode, CBOR, MessagePack, and BSON. The compact byte representation you suggested is totally possible to implement using Serde in a way that is as performant as anything you would write by hand.

The difference between Serde/rustc-serialize vs something like nom is not what format you are serializing into or how fast you are serializing. The difference is what you are serializing. An ad-hoc serializer that you implement from scratch or a deserializer you implement using nom is able to serialize structs defined by your crate only. A serializer written against Serde is able to serialize structs defined by the user.

If you want your crate to serialize data structures defined outside of your crate by the user, you need to use Serde. And it will be as fast as (probably faster than) anything you would write any other way.

8 Likes

3 posts were split to a new topic: Decoding network packets