Zero-cost abstraction frontier: No-copy, low-allocation ordered decoding


#1

Hello All,

I’m working on implementing Simple Binary Encoding for Rust for a
high-performance embedded use case, and have bumped up against some
interesting challenges to the hope of zero-cost abstractions.

In short, I’d like to be able to decode raw bytes (u8s) into

  1. meaningfully typed value references (f64, C-style enums, etc) without copying
  2. with an abstraction that makes incorrectly-ordered component decoding impossible
  3. with minimal per-field-decoded allocations or runtime checks
  4. for messages which may include arbitrarily-sized collections of substructures

Here’s a playground showing an example of my solution thus far.

I believe the first goal can be accomplished largely through mem::transmute,
pointer magic, and being careful with any compound structs that
map on top of a given slab of bytes.

The second goal is about making the decoder API minimize opportunities
for the end users to misinterpret regions of bytes. A straightforward example of
this is using an interpret_as_my_enum method on bytes 3..5 when the data for that
enum within the context of the encoded message is actually over at bytes 7..9.

My present solution to 2 uses session types, which amount to a long chain of structs that wrap the underlying bytes and
expose a single method that spits out a decoded value and the next step of the
type-level state machine of interpretation. This pattern makes tight
APIs that leverage Rust’s data ownership system to do correct
byte-offset management, but unfortunately runs afoul of constraint 3, minimizing
allocations. If each step in a session-typed decoding chain requires a new
wrapper struct to enforce the decoding state machine’s currently-exposed
capabilities, that implies a costly allocation per field decoded. At high speeds,
that’s a bit worrying.

I would love to learn if there are patterns by which those allocations could be avoided,
either through a different ordering-enforcement scheme, or perhaps by providing
additional hints to the compiler. Perhaps @alexcrichton or SilverWingedSeraph would know?

Constraint 4, handling messages with subsections exhibiting
variable-at-runtime-multiplicity seems like a simple matter of coding at first
glance, but has taken on a sinister aspect as I’ve wrestled with it. The most
pressing consequence is that it highlights the difficulty of
working with unsized byte array slices. Namely, without elevating the size
of the byte source from which data is being read into the
type system, we are forced to do runtime offset + length <= size style checks
and furthermore pay the allocation cost of Result wrapping each
field-interpetation attempt in case the bounds check fails.
Constraint 3, minimize checks and allocations, rides again.

My early inclination to was to generate APIs which rely on sized byte arrays,
but there are several ergonomic problems. First, since we’re dealing with
arbitrary message schema containing repeating groups, the sizing of the
byte arrays would have to be exposed to the end user for input for each distinct
fixed-length block within the message.

Second, even if we could specify a sized
array that represents the whole of a message, generating sized read-only views on
that data in order to decompose processing into well-typed helper functions
seems tough in Rust right now. At very least, I don’t know how to do it without
doing either data copies (into sized array buffers) or leaning on truly horrific
code generation of custom sized wrapper types that skip bounds checks.

In other words, it would be nice to start from a [u8; 1024] and pass down &[u8; 256] and &[u8; 768] references to decoder functions without allocating fresh arrays.

Is this a planned area for possible language or compiler improvements?

I’m hopeful that others who have forayed into the low-copy space
( dtolnay , @dwrensha , et al?) could comment on how to
aim for compile-time-assuredness of zero-copy byte array slicing and dicing?

Any and all advice would be welcome,

Thanks,

Zack


#2

The description seems to match what the bincode crate achieves. Is this not the solution you’re looking for?


#3

Thanks for the pointer. Bincode seems like a useful reference, but not a solution for several reasons.

First, I’m aiming to implement a specification (SBE) that has some properties that bincode does not:

  • SBE encourages streaming field-by-field decoding rather than all-in-one-go whole-message decoding.
  • SBE is a cross-language encoding format established in the industry, bincode seems firmly Rust-focused.

Second, while bincode does aim to be “zero fluff,” I don’t think that translates fully to “zero copy.” At very least, the use of the byteorder crate, which does copies of numeric primitives, strongly suggests as much.

Third, bincode does not attempt to solve the issue of byte range checking at runtime. Length and size checks abound.


#4

To be fair, that macro is used explicitly where data endianness doesn’t match the CPU’s. The happy path is a transmute, which is as zero-copy as you’ll get :slight_smile:.


#5

Good point about the byteorder copy being sad-path-endianness-mismatch only, @vitalyd . I haven’t found where that switch is managed yet in the crate, but I’ll keep looking.

You raise another point worth following up on, namely the copy-cost of transmute, which does not seem to be zero, per the current docs. I suppose it does not matter much for small primitives, but a full copy of larger structs during a transmute seems like it might be more expensive than fabricating a reference.


#6

Hmm, I could’ve sworn I thought that was the case but looking at it again now, I don’t see anything. The macro is “copying” into a stack local but I wonder if that gets optimized out - would need to check the generated asm to be sure. The to_le/to_be trailer is against the builtin types, and those are cfg! based on target endianess.

This is probably one of those “should be optimized out” scenarios, similar to any other copy/move operations in Rust (both of those technically mempcy data around, but that’s unlikely to be the case in most code after optimization).