Zero-cost abstraction frontier: No-copy, low-allocation ordered decoding

ZackPierce · June 23, 2017, 11:30pm

Hello All,

I'm working on implementing Simple Binary Encoding for Rust for a
high-performance embedded use case, and have bumped up against some
interesting challenges to the hope of zero-cost abstractions.

In short, I'd like to be able to decode raw bytes (u8s) into

meaningfully typed value references (f64, C-style enums, etc) without copying
with an abstraction that makes incorrectly-ordered component decoding impossible
with minimal per-field-decoded allocations or runtime checks
for messages which may include arbitrarily-sized collections of substructures

Here's a playground showing an example of my solution thus far.

I believe the first goal can be accomplished largely through mem::transmute,
pointer magic, and being careful with any compound structs that
map on top of a given slab of bytes.

The second goal is about making the decoder API minimize opportunities
for the end users to misinterpret regions of bytes. A straightforward example of
this is using an interpret_as_my_enum method on bytes 3..5 when the data for that
enum within the context of the encoded message is actually over at bytes 7..9.

My present solution to 2 uses session types, which amount to a long chain of structs that wrap the underlying bytes and
expose a single method that spits out a decoded value and the next step of the
type-level state machine of interpretation. This pattern makes tight
APIs that leverage Rust's data ownership system to do correct
byte-offset management, but unfortunately runs afoul of constraint 3, minimizing
allocations. If each step in a session-typed decoding chain requires a new
wrapper struct to enforce the decoding state machine's currently-exposed
capabilities, that implies a costly allocation per field decoded. At high speeds,
that's a bit worrying.

I would love to learn if there are patterns by which those allocations could be avoided,
either through a different ordering-enforcement scheme, or perhaps by providing
additional hints to the compiler. Perhaps @alexcrichton or SilverWingedSeraph would know?

Constraint 4, handling messages with subsections exhibiting
variable-at-runtime-multiplicity seems like a simple matter of coding at first
glance, but has taken on a sinister aspect as I've wrestled with it. The most
pressing consequence is that it highlights the difficulty of
working with unsized byte array slices. Namely, without elevating the size
of the byte source from which data is being read into the
type system, we are forced to do runtime offset + length <= size style checks
and furthermore pay the allocation cost of Result wrapping each
field-interpetation attempt in case the bounds check fails.
Constraint 3, minimize checks and allocations, rides again.

My early inclination to was to generate APIs which rely on sized byte arrays,
but there are several ergonomic problems. First, since we're dealing with
arbitrary message schema containing repeating groups, the sizing of the
byte arrays would have to be exposed to the end user for input for each distinct
fixed-length block within the message.

Second, even if we could specify a sized
array that represents the whole of a message, generating sized read-only views on
that data in order to decompose processing into well-typed helper functions
seems tough in Rust right now. At very least, I don't know how to do it without
doing either data copies (into sized array buffers) or leaning on truly horrific
code generation of custom sized wrapper types that skip bounds checks.

In other words, it would be nice to start from a [u8; 1024] and pass down &[u8; 256] and &[u8; 768] references to decoder functions without allocating fresh arrays.

Is this a planned area for possible language or compiler improvements?

I'm hopeful that others who have forayed into the low-copy space
( dtolnay , @dwrensha , et al?) could comment on how to
aim for compile-time-assuredness of zero-copy byte array slicing and dicing?

Any and all advice would be welcome,

Thanks,

Zack

alteous · June 23, 2017, 11:56pm

The description seems to match what the bincode crate achieves. Is this not the solution you're looking for?

ZackPierce · June 26, 2017, 4:54pm

Thanks for the pointer. Bincode seems like a useful reference, but not a solution for several reasons.

First, I'm aiming to implement a specification (SBE) that has some properties that bincode does not:

SBE encourages streaming field-by-field decoding rather than all-in-one-go whole-message decoding.
SBE is a cross-language encoding format established in the industry, bincode seems firmly Rust-focused.

Second, while bincode does aim to be "zero fluff," I don't think that translates fully to "zero copy." At very least, the use of the byteorder crate, which does copies of numeric primitives, strongly suggests as much.

Third, bincode does not attempt to solve the issue of byte range checking at runtime. Length and size checks abound.

vitalyd · June 26, 2017, 5:32pm

To be fair, that macro is used explicitly where data endianness doesn't match the CPU's. The happy path is a transmute, which is as zero-copy as you'll get .

ZackPierce · June 26, 2017, 5:48pm

Good point about the byteorder copy being sad-path-endianness-mismatch only, @vitalyd . I haven't found where that switch is managed yet in the crate, but I'll keep looking.

You raise another point worth following up on, namely the copy-cost of transmute, which does not seem to be zero, per the current docs. I suppose it does not matter much for small primitives, but a full copy of larger structs during a transmute seems like it might be more expensive than fabricating a reference.

vitalyd · June 26, 2017, 6:36pm

Hmm, I could've sworn I thought that was the case but looking at it again now, I don't see anything. The macro is "copying" into a stack local but I wonder if that gets optimized out - would need to check the generated asm to be sure. The to_le/to_be trailer is against the builtin types, and those are cfg! based on target endianess.

This is probably one of those "should be optimized out" scenarios, similar to any other copy/move operations in Rust (both of those technically mempcy data around, but that's unlikely to be the case in most code after optimization).

dm3 · September 5, 2018, 9:26am

Sorry about the necromantic rituals...

@ZackPierce, I see you've contributed a Rust SBE parser generator to the SBE project. That's awesome! Did you go further with Rust for the rest of your use case? Any significant bumps that you've met along the road?

ZackPierce · September 5, 2018, 1:54pm

Thanks, @dm3 . For the unfamiliar, the effort referenced was explained in a blog post last year.

The further Rustification of the use case is a work-in-stuttering-progress. The largest remaining technical roadbumps on the top of my head are dealing with alignment and type-level zero-cost endianness.

Since a year ago, the Rust compiler has raised its awareness of potential data misalignment for borrows off of packed data structures, and the codec implementation could be more clever about avoiding those warnings and potential errors. Right now many of the present implementation's user-facing types are repr(packed) and the backing storage format is an arbitrary byte slice, and while it may be tempting to await more options from the compiler, I'm pretty sure that attentive analysis from the codec-generator could use the available repr(C) and repr(align) options to greater affect.

As for capturing endianness at compile time and managing differences at runtime with low/zero costs, the constraints of my problem domain meant I was able to simply punt on it.

Finally, I'd say that the most significant bump was choosing to work in the context of a pre-existing project, written in Java where the main method of code generation is ad-hoc string print-templating and the definition DSL is in XML. To be sure, I'm very appreciative of the SBE team their open source efforts. That said, being able to more readily break with a pre-existing spec and work more completely in Rust (and, say, make wider use of macros) would be a boon for more rapidly targeting some of the advanced type-safe zero-cost features I'm hoping for.

dm3 · May 3, 2019, 2:02pm

I'm now back on the tasks related to SBE. Picked up the Rust codec and it works wonderfully! Thanks for opensourcing it!

@ZackPierce - are you still interested in making improvements to the implementation you've contributed or did you go a different way?

ZackPierce · May 3, 2019, 3:38pm

Glad to hear it's working as a starting point for you.

At present I lack the time-capacity to contribute to that implementation, but will ping you or raise a hand here if that changes.

dm3 · May 6, 2019, 7:30am

Great! The reason I'm asking is that I might be able to contribute and possibly make design changes to the codec. Was wondering if you'd like me to run the ideas by you.

hakejam · December 10, 2019, 2:12am

Thanks @ZachPierce . This looks really good. Do you have any suggestions on how to handle the endianness issues?

Topic		Replies	Views
[ NEWBIE ] unconstrained lifetime parameter	17	928	March 23, 2022
Is this the most efficient way to perform a highly-iterative bitwise de/encoding task? help	7	1730	January 12, 2023
Not quite getting it: ergonomic handling of slices/arrays with known fixed sizes	8	822	March 26, 2023
Design problem: minimizing allocation/copying while parsing for a Tokio Decoder help	7	1325	January 12, 2023
C ffi - avoid final string copy help	5	1317	January 12, 2023

Zero-cost abstraction frontier: No-copy, low-allocation ordered decoding

Related topics