Avoid allocations on bytes::Bytes wrapper/derivative type with &str attributes

Howdy,

I am writing some network code that uses the bytes crate. I have some types that I de-serialize from the underlying Bytes frames. Some of the fields are strings but I would like to use &str rather than String to avoid allocations, as this code will be processing a lot of messages at a high speed. Clearly this is a lifetime issue as the &str is referencing a Bytes instance. My first thought was that this means that the Bytes instance needs to "hang around" alongside the &str. But when trying to store the underlying Bytes frame on the struct so it can continue to be owned, the compiler didn't like that either. Also I am curious if I am missing a better way to approach this.

Here is a toy problem that I am using as a test case, trying to work through different approaches as I dance with the compiler. Any thoughts here on how to approach this problem in a way that avoids allocations on each and every message. I'd like to take advantage of the Bytes object being zero copy and already being in memory.

Many thanks!
Dave

use bytes::Bytes;

#[derive(Debug)]
struct Test<'a> {
    a: &'a str,
    b: &'a str,
    // source: Bytes,
}

impl<'a> Test<'a> {
    fn new(b: &'a mut Bytes) -> Self {
        let a = b.split_to(2);
        let a_str = into_str(&a);
        let b_str = into_str(&b);
        Test {
            a: a_str,
            b: b_str,
            // source: b,
        }
    }
}

fn into_str<'a>(b: &'a Bytes) -> &'a str {
    std::str::from_utf8(b).unwrap()
}

fn main() {
    let mut b = Bytes::from(&b"hello world"[..]);
    let t = Test::new(&mut b);
    println!("{t:?}");
}

Yes, that's because that would make the struct self-referential (in the sense, trying to store a reference to a vector which is contained in the struct in the struct itself).

I don't think you are allocating anything extra here - the str points to the bytes in the Bytes struct.

A sidenote: use from_utf8_unchecked if you are immediately unwrapping the result for from_utf8.

No! That is unsound unless the outer function is itself unsafe. The current implementation panics on invalid UTF-8, while from_utf8_unchecked() would cause undefined behavior.

7 Likes

If you know you may get invalid UTF-8, why unwrap?

In that case the error should probably still be handled. But I think that the argument being made is that unwrapping will panic while the unchecked method could possibly continue going in a bad state.

2 Likes

Right, I am not when I use str, however the Bytes instance is not living long enough. In my non-toy problem I have an async method that generates a Bytes frame which then gets handed to an associated function on a Message enum:

async fn read_frame(frame: Bytes) -> Message 

The message enum wraps a struct with fields generated by de-serializing the bytes based on message type (which is read from the frame's header). I'd like to generate that Message(MessageStruct) and wrapped struct without making allocations. It's just not clear where to keep the original Bytes instance in order to keep it "alive". If I used an owned String instead then it doesn't matter what happens to the source Bytes but I would like to keep things as close to "zero-copy" as possible. I thought about having all the attributes be "getters" which are method calls based on indexing into a split of the Bytes (which is delimited by null bytes) but I'd rather just assign once at a instantiation.

Regarding the point about self-referential, I must say that is counter-intuitive to me. In my struct example above, a and b would be pointing to source but don't point in any way to their owner/parent? a, b and source are all just attributes at the same level (a and b being "views" into source and none of them reference their owner. Or at least they don't explicitly point to their owner, perhaps the issue is implicit and reflects a fundamental concept with ownership here that I am missing?

Thanks again for the help!

The only reason I unwrap here is because it is just an example toy problem. I use ? in the real code.

3 Likes

Yeah, pointing to the same level is the problem (in most cases). Pointing to the parent can be solved with Rc/Weak or Arc/Weak.

I'd say this is the way to go.

Okay, I assume that this would come at a performance hit unless I cache the values somehow? Maybe not though, it could be me overthinking things since I come from an interpreted language background (python). Seems like it would since I each getter is going to have to create a new split iterator and then use nth(i) followed by some kind of deserialization to the target type. It also gets a little tricky for nested structures that may "come from" the same Bytes slice.

I was hoping I was just missing something basic but perhaps that's not the case.

You don't necessarily need to do any "deserialization" to access the data. You can have your data structures do any necessary validity checks at instantiation, then access the data inside for only the cost of a read. For instance, here's your example rewritten to use an owned BytesString with free lookup:

use bytes::Bytes;
use std::{
    ops::Deref,
    str::{self, Utf8Error},
};

#[derive(Clone, Debug)]
pub struct BytesString {
    inner: Bytes,
}

impl BytesString {
    #[inline]
    pub fn new(bytes: Bytes) -> Result<BytesString, Utf8Error> {
        str::from_utf8(&bytes)?;
        Ok(BytesString { inner: bytes })
    }

    // Note: defined as an associated function to prevent conflicts with methods
    // on str
    #[inline]
    pub fn into_inner(s: BytesString) -> Bytes {
        s.inner
    }
}

impl Deref for BytesString {
    type Target = str;

    #[inline]
    fn deref(&self) -> &Self::Target {
        // SAFETY: `self.inner` is always a valid UTF-8 string.
        unsafe { str::from_utf8_unchecked(&self.inner) }
    }
}

// Add AsRef, Borrow, etc. as needed

#[derive(Debug)]
struct Test {
    a: BytesString,
    b: BytesString,
}

impl Test {
    fn new(bytes: &Bytes) -> Self {
        let mut a = bytes.clone();
        let b = a.split_off(2);
        Test {
            a: BytesString::new(a).unwrap(),
            b: BytesString::new(b).unwrap(),
        }
    }
}

Notice how from_utf8() is called in BytesString::new() to verify that the slice is valid UTF-8, so that in Deref::deref() we can unsafely call from_utf8_unchecked(). Also, the compiler inlines function calls within a crate automatically, but the #[inline] attribute tells the compiler to inline the function across crates.

2 Likes

Boom! That looks great. Thank you for taking the time to clearly illustrate this!

Best,
Dave

The string crate provides a existing version of this type, for what it's worth.

4 Likes

Yes, the string crate would solve this nicely.
Additionally, depending on what you're actually doing, tendril might be interesting.

2 Likes

Thanks for the suggestions on the other crates, will take a look.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.