Zero-copy Deserializing with ArcRef and Serde

I have a struct that holds metadata about a block of data, and the data itself (usually 1MB in size). I currently use serde & bincode to serialize, and everything works; however, I'm copying 1MB of data each time. I want to use serde's borrowing capabilities to prevent this copy. The problem is that someone must own the underlying data. My thought is to put the data into an Arc, then have an inner struct that represents the block. This is what I've come up with:

use owning_ref::{OwningRef, ArcRef};

#[derive(Deserialize, Serialize)]
struct InnerBlock<'a> {
    id: u64,
    #[serde(borrow)]
    data: &'a[u8]
}

struct Block<'a> {
    inner: OwningRef<Arc<Vec<u8>>, InnerBlock<'a>>,
    raw_data: Arc<Vec<u8>>
}

Then when I want to read this block from a file (for example) I'd have something like the following:

fn read_block_from_file<'a>() -> Block<'a> {
    let mut f = OpenOptions::new().read(true).open("/dev/zero").unwrap();
    let mut buff = vec![0; 1024];

    f.read_exact(&mut buff);

    let raw_data = Arc::new(buff);
    let or = ArcRef::new(raw_data.clone());
    let inner = or.map(|b| {
        let i : InnerBlock = bincode::deserialize(b.as_slice()).unwrap();
        &i
    });

    Block { inner, raw_data }
}

However, I get the following error from the compiler:

error[E0495]: cannot infer an appropriate lifetime for autoref due to conflicting requirements
  --> src/main.rs:28:53
   |
28 |         let i : InnerBlock = bincode::deserialize(b.as_slice()).unwrap();
   |                                                     ^^^^^^^^
   |
note: first, the lifetime cannot outlive the anonymous lifetime #2 defined on the body at 27:24...
  --> src/main.rs:27:24
   |
27 |       let inner = or.map(|b| {
   |  ________________________^
28 | |         let i : InnerBlock = bincode::deserialize(b.as_slice()).unwrap();
29 | |         &i
30 | |     });
   | |_____^
note: ...so that reference does not outlive borrowed content
  --> src/main.rs:28:51
   |
28 |         let i : InnerBlock = bincode::deserialize(b.as_slice()).unwrap();
   |                                                   ^
note: but, the lifetime must be valid for the lifetime `'a` as defined on the function body at 19:25...
  --> src/main.rs:19:25
   |
19 | fn read_block_from_file<'a>() -> Block<'a> {
   |                         ^^
note: ...so that the expression is assignable
  --> src/main.rs:32:5
   |
32 |     Block { inner, raw_data }
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^
   = note: expected `Block<'a>`
              found `Block<'_>`

Not sure if there is a way to do this w/Serde. Any help would be greatly appreciated!

Self referential structs are rarely worth the trouble in Rust. I would suggest trying to refactor so that the block allocation is an entirely separate struct from the data referring to it and using the typical rust borrowing patterns. Those patterns typically take a bit of practice before they click. Since you're using Arc I assume you're multithreaded, so scoped threads may help. Either crossbeam or maybe at a higher level Rayon. With those you can probably even avoid the Arc around the buffer.

You could post an example of why you think you need self referential structs.

If you really want to do self referential, you may need a different library. I could never work out how to do fancy things with owning_ref, but it should be doable with rental rental which is no longer maintained (very complicated internals) or ouroboros which while currently maintained is not well known so hasn't been as battle tested.

1 Like

Appreciate the reply, but I don't see how it helps to have the two in separate structs. Changing to something like this:

#[derive(Deserialize, Serialize)]
struct InnerBlock<'a> {
    id: u64,
    #[serde(borrow)]
    data: &'a[u8]
}

struct BlockData {
    raw_data: Arc<Vec<u8>>
}

struct Block<'a> {
    inner: InnerBlock<'a>,
    data: BlockData
}

fn read_block_from_file<'a>() -> Block<'a> {
    let mut f = OpenOptions::new().read(true).open("/dev/zero").unwrap();
    let mut buff = vec![0; 1024];

    f.read_exact(&mut buff);

    let data = BlockData { raw_data: Arc::new(buff) };
    let inner :InnerBlock = bincode::deserialize(data.raw_data.clone().as_slice()).unwrap();

    Block { inner, data }
}

I'm still faced with basically the same issue:

error[E0515]: cannot return value referencing temporary value
  --> src/main.rs:32:5
   |
30 |     let inner :InnerBlock = bincode::deserialize(data.raw_data.clone().as_slice()).unwrap();
   |                                                  --------------------- temporary value created here
31 | 
32 |     Block { inner, data }
   |     ^^^^^^^^^^^^^^^^^^^^^ returns a value referencing data owned by the current function

I need to convince the borrow-checker that this data and the struct the references it will have the same lifetimes. That's really what I'm struggling with. I thought putting them in the same struct would achieve that, but it clearly doesn't. Thanks!

Lifetimes are use to link one value to the life of another. You almost never want the same lifetime you want the compiler to confirm the one borrowing has a shorter life than the one borrowed from. The issues probably isn't with the types but with how they're used. For example why isn't this suitable.

use std::{fs::OpenOptions, sync::Arc, io::Read};
use serde::{Serialize, Deserialize};
use anyhow::Result;

struct BlockData {
    raw_data: Vec<u8>
}

#[derive(Deserialize, Serialize)]
struct InnerBlock<'a> {
    id: u64,
    #[serde(borrow)]
    data: &'a[u8]
}

impl BlockData {
    fn from_file() -> Result<Self> {
        let mut f = OpenOptions::new().read(true).open("/dev/zero")?;
        let mut buff = vec![0; 1024];
        f.read_exact(&mut buff)?;
        Ok(BlockData { raw_data: buff })
    }

    fn interpret(&self) -> Result<InnerBlock<'_>> {
        Ok(bincode::deserialize(&self.raw_data)?)
    }
}

fn main() -> Result<()> {
    let block_data = BlockData::from_file()?;
    let inner = block_data.interpret()?;

    let inner2 = block_data.interpret()?;

    println!("inner: {}, inner2: {}", inner.id, inner2.id);

    Ok(())
}

@Cocalus thanks again for the reply! I'd have to think about it more, but the problem is that I want access to both the metadata and the serialized data, and that's tough/awkward with 2 structs.

For example, I want methods to get things like the block's id block.id(). I can either deserialize InnerBlock each time in this method, or add the method to InnerBlock. The first option is obviously slow, and the second option suffers from the problem that I'll want to serialize this block at some point. However, do do that I'd need to copy this data again, or keep a copy of Block around somewhere else which basically gets me back to my original problem. I'd really like to keep it packaged up all in a single struct.

How about

use anyhow::Result;
use serde::{Deserialize, Serialize};
use std::{fs::OpenOptions, io::Read};

#[derive(Deserialize, Serialize, Clone)]
struct InnerBlock<'a> {
    id: u64,
    #[serde(borrow)]
    data: &'a [u8],
}

/// Could use Arc/Rc to make clones cheaper but this is 4 usizes + 1 u64 per clone.
#[derive(Clone)]
struct Block<'a> {
    raw_data: &'a [u8],
    inner: InnerBlock<'a>,
}

impl<'a> Block<'a> {
    fn id(&self) -> u64 {
        self.inner.id
    }

    fn from_bytes(bytes: &'a [u8]) -> Result<Self> {
        Ok(Block {
            raw_data: bytes,
            inner: bincode::deserialize(&bytes)?,
        })
    }
}

fn main() -> Result<()> {
    let mut f = OpenOptions::new().read(true).open("/dev/zero")?;
    let mut buff = vec![0u8; 1024];

    f.read_exact(&mut buff)?;

    let block = Block::from_bytes(&buff)?;

    println!("{} {}", block.id(), block.raw_data[10]);

    Ok(())
}

I don't see how this gets you back to the original problem – to me, it seems like that is the exact solution to the original problem. I.e. just don't use Block except when you need to mutate or move the raw data (note that it's easy for InnerBlock to hold multiple references – it can have one reference that borrows the whole chunk, as well as the deserialized stuff, so you don't need to own Block to access the raw data in a shared fashion). Mutating the raw data is something you can't do while extant references exist anyway.

I think you need OwningHandle to achieve something like what you originally tried; OwningRef is for when the "owning" thing is specifically a & reference.

@Cocalus thanks again for the reply! The problem is with lifetimes again... if I move your read-from-file code out to another function, so that buff goes out of scope, I run into the issue:

 fn read_from_file() -> Block<'static> {
    let mut f = OpenOptions::new().read(true).open("/dev/zero").unwrap();
    let mut buff = vec![0u8; 1024];

    f.read_exact(&mut buff).unwrap();

    Block::from_bytes(&buff)
}
error[E0515]: cannot return value referencing local variable `buff`
  --> src/main.rs:44:5
   |
44 |     Block::from_bytes(&buff)
   |     ^^^^^^^^^^^^^^^^^^-----^
   |     |                 |
   |     |                 `buff` is borrowed here
   |     returns a value referencing data owned by the current function

Even if I change from_bytes to take ownership of Vec<u8>, I still get hung-up on the borrow by derserialize. The only way I've been able to make it work is via a terrible hack with Option and Arc<RwLock>. It's a lot of mechanics, but it seems to work.

@trentj I think this shows again my original problem... I have to keep 2 things around. The thing that holds the data, and the struct that references the data. I really want these both in the same struct (or the appearance therein), but haven't found a clean way to convince the lifetimes. Using a modified version of the above code is a clear example of this:

   fn from_bytes(bytes: Vec<u8>) -> Self {
        Block {
            raw_data: bytes,
            inner: bincode::deserialize(&bytes).unwrap(),
        }
    }

It should be OK that I'm doing the borrow &bytes as bytes will be owned by the same struct, and therefore the reference will still be alive. However, I don't know a way to tell the compiler that the borrow is OK. I want to say something like &self.raw_data, but self hasn't been created yet... this is where I get the terrible Option hack from.

This is the problem. Relax that constraint, and all your problems vanish. So what is so important about keeping the value and the reference in the same struct?

The problem with self-referential structs is they can't be safely moved or mutated in the general case. (Not with only static analysis, anyway.) So you limit yourself to non-general cases, but you make some concessions in the API to allow moving (this is what you get from a crate like owning_ref). But if you instead design your API such that the self-borrowing is unnecessary, you'll tend to run into less friction because the simpler borrowing semantics are more amenable to static analysis.

@trentj, thanks for the response! I guess I ultimately don't know how to relax the constraint or redesign... take the read_from_file example above... how would I make that work?

fn read_from_file() -> Block<'static> {
    let mut f = OpenOptions::new().read(true).open("/dev/zero").unwrap();
    let mut buff = vec![0u8; 1024];

    f.read_exact(&mut buff).unwrap();

    Block::from_bytes(buff)
}

With from_bytes currently being:

    fn from_bytes(bytes: Vec<u8>) -> Self {
        Block {
            raw_data: bytes,
            inner: bincode::deserialize(&bytes).unwrap(),
        }
    }

The references would only be certain to remain alive if the borrowing rules were enforced. (If they weren't enforced, raw_data could be modified -- e.g. replaced or reallocated -- and this could invalidate your references.) But lifetimes and borrowing aren't properties of data structures, they are an analysis on (executing) code blocks. That's why there's no clean way to tell the compiler that this is okay. From it's way of tracking lifetimes and borrows, it is not okay.

There are ways to work around it, but they're generally unsafe and/or hacky.

1 Like

I suspect the problem is better resolved by looking at how this API is called to figure out the best trade offs needed. While a self referential in your case is probably doable (Since the data pointed at by the vector won't move, but other parts of the vector like it's capacity/size will move) it's very likely not worth the API complication to do it. The complication is to make sure only non moving parts are refereed to, so the 3rd party crate can correctly use unsafe to lie to compiler about casting normal lifetimes into 'static, or raw pointers.

The most basic way is to load the file into an owned struct (Like Vec). Then borrow that and make the borrowed view (this is my second example). And pass that borrowed view around as needed with the allocation rooted in the call stack. This is pretty basic Rust borrowing, so I suspect that isn't working for you due to some other issue. This looks like an XYproblem, which is very common when moving to Rust from another language since borrowing often takes a bit a of practice before it clicks.

@Cocalus, thanks again! I might be confusing-the-issue via XYproblem... my apologies.

It's really just that the owned thing Vec has to travel around with the view as you call it into that owned thing.

Say for example after I read it out of a file I want to store the block in a cache, but also pass a clone up to some caller. This becomes harder (maybe not impossible?) to manage.

I think the easiest solution is to head words of advice from somewhere (cannot find the source) that basically say, "Don't kill yourself on making everything zero copy, as memcpy is pretty fast!" With that spirit, I'm going with an "owned" version of inner, then an offset into a Vec<u8> (really I'll use ByteBuf to speed-up serde) that has the whole thing serialized. This way I have a "no copy" version of serialization, and can also use serde's methods but simply incur additional overhead.

Thanks again everyone for the help!

OK a cache + borrowing makes some sense on why you would want self borrowing. Since cache invalidation doesn't play well with borrowing.

Here's how to do the Self Referential using ouroboros (I've been meaning to play with it for a while anyway), Rental could do it as well.

use anyhow::Result;
use ouroboros::self_referencing;
use serde::{Deserialize, Serialize};
use std::sync::Arc;

#[derive(Deserialize, Serialize)]
struct InnerBlock<'a> {
    id: u64,
    #[serde(borrow)]
    data: &'a [u8],
}

impl<'a> InnerBlock<'a> {
    fn from_bytes(bytes: &'a [u8]) -> Self {
        bincode::deserialize(bytes).unwrap()
    }
}

#[self_referencing]
struct Block {
    raw_data: Vec<u8>,
    #[borrows(raw_data)]
    inner: InnerBlock<'this>,
}

fn read_block(raw_data: Vec<u8>) -> Block {
    BlockBuilder {
        raw_data,
        inner_builder: |raw_data: &[u8]| InnerBlock::from_bytes(raw_data),
    }
    .build()
}

fn main() -> Result<()> {
    let b1 = Arc::new(read_block(vec![0; 1024]));
    let b2 = Arc::clone(&b1);

    let thread = std::thread::spawn(move || {
        println!("b2 {:?}", b2.with_inner(|inner| inner.id));
    });
    println!("b1 {:?}", b1.with_inner(|inner| inner.id));

    thread.join().expect("Thread Paniced");
    Ok(())
}

Yes, read_from_file is a thing you just cannot do with a Block that borrows its data (there's no owner to borrow from). Instead, you have to put the owning buffer (let's just call it a Vec<u8>, for simplicity) into long-term storage, and then borrow the Block from there. In a simple case it might be as easy as changing

let block = Block::read_from_file("abc")?;

to

let owner = std::fs::read("abc")?;
let block = Block::from_bytes(&owner)?;

But sure enough, there are more complicated cases where this gets... more complicated. So let's talk about some of those.

Moving the Vec is fundamentally at odds with borrowing from it. To first order, if you can move it, you can mutate it and invalidate the borrow. Yes, the fact that Vec's buffer has a stable address sometimes means this isn't always true, but that's something the compiler doesn't (and can't) understand. I really like this way of putting it:

You can't have both compiler-checked safe borrowing and be able to move the Vec (with the borrows persisting during the move). Even owning_ref won't allow that (moving the OwningRef will invalidate extant borrows that aren't the owning one). Which is why this probably won't work either:

If you move a thing into a cache, its lifetime is no longer that of the thing, but now of the cache. Unless both the thing and the cache borrow from the same owner, you won't be able to do that while still holding a reference to it. To the compiler, mem::drop, mem::forget and (rhetorical) cache.store all have the same signature: fn(T) -> ().

It might be what you're looking for is not borrowing, but shared ownership. Unfortunately Serde's "rc" feature does a built in copy when you deserialize to Arc (hmm, that might be fixable...?) but you could at least avoid copying again by using bincode::deserialize_from. Or maybe you could write a custom deserializer. I haven't dedicated too much thought to it.

There are at least two other possibilities. One uses Cow and has some runtime cost, the other amounts to basically writing two structs... or maybe one generic struct, parameterized with either &'a [u8] or Vec<u8>. But I really need to get some sleep so maybe that's an exploration for another day.

1 Like

One other approach would be storing non-reference analogs (e.g. offset + length, or some sort of weak pointer) and a lot of impl Deref or similar to create references on the fly (which may lessen, but would not eliminate, the run time hit of deserializing on the fly). I think this is the approach of ouroboros (but I didn't look into it in depth).

Another approach would be to leak the memory, but keep a pointer to it and construct a static reference from that. Then create inner, using more static refs. This is basically pretending that your allocated and leaked memory is static memory.

However, if you ever want to reclaim that memory, you're going to have to write a custom impl Drop, be extremely careful to get things correct, not be Copy, no trivial Clone, probably other things I'm not thinking of, and even then this will lead to undefined behavior and corruption unless your data structures are private and you're very careful in your own crate. Because &'static can be copied freely, and if any code anywhere keeps a copy pointing to your memory around after the drop, it will be dangling. Much safer to leave it leaked if possible.

I've also only really thought through the read-only case; if you want to serialize as well presumably you want to modify things first. The mutable case would at a minimum have other requirements like throwing out all &'static mut before reconstructing the original to avoid aliasing violations... similar to the Drop considerations, but perhaps even more touchy. Well, definitely more touchy, after a minute of thought -- you would have to store raw_data as a pointer and not a reference, and either construct inner by chopping up a &'static mut [u8] (i.e. not with serde, probably) or by recursively and unsafely converting &'static to &'static mut or something, being careful to never create two mutable references to the same piece of memory at the same time.

So, as you can tell by now, it's possible but very messy and risky. These things being very hard to get right is a big part of why Rust exists, and even Rust makes mistakes in their internals sometimes. You'll have to give up on a lot of Rust guarantees and get it right yourself. Or perhaps a crate will help, but you'll want to find a well tested and trusted one since you'll be relying on them getting it right. (ouroboros has less than 700 downloads from crates.io, for example.)

Who, this guy? :smile:
I think you're right -- avoid premature optimization -- don't go down this rabbit hole until you've measured things and determined the copies/clones are actually a problem. Especially if you can't compromise on having a single data structure. It's a deep and painful rabbit hole to implement.

But it least it's fun to discuss and hopefully illuminating!

2 Likes

Oh, here's another possibility I couldn't think of in the fog of late night sleepiness: double-boxing!

#[derive(Serialize, Deserialize)]
struct Block {
    id: u64,
    data: Arc<Vec<u8>>,
}

I haven't checked to make sure that derive(Deserialize) does the right thing here, but I'm about 65% sure it does not make unnecessary copies (when you read from a file as with bincode::deserialize_from). If I'm wrong, or if you don't want to enable the "rc" feature in serde, it would be pretty simple to write the custom serde logic. It owns its data and is cheap to clone.

The tradeoff is that it's a pointer-to-a-pointer, so while this is faster than a plain Arc<[u8]> when you do a lot of clones, it's slower if you do a lot of accesses. I thought there was a crate that provides an Arc-like pointer with separate allocation for the refcounts, which could be converted from Vec without copying, but I can't seem to find it now. That might be a good solution.

1 Like

Thanks for all the suggestions. I ended up going w/a custom serialize that works best for our usecase. It's slightly more cumbersome than simply deriving Serialize and Deserialize on some structs, and then being able to put them together and have it all "just work". However, it saves us a lot of copying.

For deserializing, it's simply a move of the data into an Arc<Vec<u8>> which requires zero copying or real work. I Also have a pointer (yup, a real *mut pointer) to the data in the same struct. So I'd have something like:

#[repr(packed)]
struct BlockMetadata {
  id: u64,
  version: u64
}

struct Block {
  metadata: *mut BlockMetadata,
  raw_data: Arc<Vec<u8>>
}

When I'm reading the data in from a file/network, I simply move it into the Arc, then construct the pointer to point into that Arc:

fn deserialize(data :Vec<u8>) -> Self {
  let mut ret = Block {
    metadata: std::ptr::null_mut() as *mut BlockMetadata,
    raw_data: Arc::new(data)
  };

  ret.metadata = ret.raw_data.as_ref().as_ptr() as *mut BlockMetadata;
}

Then I can access all the metadata fields via the pointer. I never need to worry about null references, because I always set it via deserialize. When I'm creating a new one, I have to do a memcpy, but that is not the hot-path for this code: deserializing and serializing is. To serialize, I can simply return a reference into raw_data that starts after the size_of::<BlockMetadata>() bytes.