Saving variables interwoven by reference in one structure

I'm working with piz crate and I want to put mmap, archive and tree to one structure. These three variables are connected together with references. This is one example how I tried to achieve that, but it unfortunately doesn't work:

use piz::read::*;
use piz::read::ZipArchive;
use memmap::Mmap;
use std::fs::File;

pub struct ZipCursor<'a>{
    pub path_in_archive:String,
    pub tree: Box<DirectoryContents<'a>>,
    mmap: Box<memmap::Mmap>,
    archive: Box<ZipArchive<'a>>,
    file: Box<File>
}

impl<'a> ZipCursor<'a>{
    pub fn open_zip(file: Box<File>) -> Option<Self> {
        let mmap = Box::new(unsafe {Mmap::map(&file).unwrap()});
        let archive = Box::new(ZipArchive::new(&mmap).unwrap());
        let tree = Box::new(as_tree(archive.entries()).unwrap());
        return Some(ZipCursor{
            path_in_archive: String::from(""),
            mmap:mmap,
            archive: archive,
            tree: tree,
            file: file
        })
    }
}

As far as I can see, although the archive is saved on heap, it's local when I'm creating the reference. Is there any solution to this problem?
Thanks in advance.

As far as I can tell, there's no good way of creating self-referential structures (where one field is a reference to another).

There are some workarounds, like:

I'm definitely interested in a good (or even standard?) way of achieving these. There were a couple of cases where my code would benefit from it too. E.g. having a String and a couple of substrings &str in the same structure. The easy workaround was to store ranges/indices.

There isn't one today. It comes up quite frequently, so you can find many other discussions in this forum or on internals (search for self referential), but here are some simple examples of things that must be handled:

struct StringParts {
    data: String,
    // Note: lifetimes are a static analysis of code and are erased during
    // compilation; they are not present during runtime and they are not
    // a property that data structures dynamically "carry around" with them.
    // But for the sake of illustration...
    prefix: &'self.data str,
    suffix: &'self.data str,
}

// Assume everything else works like it does in today's Rust

fn dangle_via_reallocation(sp: &mut StringParts) {
    let size_that_must_reallocate = 3 * (sp.data.capacity() + 1);
    sp.data.reserve(size_that_must_reallocate);
    // `prefix` and `suffix` are now invalid
}

fn references_implement_copy(sp: &StringParts) -> StringParts {
    StringParts {
        data: String::new(),
        prefix: sp.prefix,
        suffix: sp.suffix,
    }
}

fn dangle_after_reference_copy(sp: StringParts) {
    let other = references_implement_copy(&sp);
    drop(sp);
    // Oops, other.prefix and other.suffix now dangle
}

// Note: no allocation in this example, all the data lives on the stack.
// Also no need to implement `Copy` -- moves are not the same thing.
struct ArrayParts {
    data: [u8; 16],
    prefix: &'self.data [u8],
    suffix: &'self.data [u8],
}

fn take_by_value(ap: ArrayParts) {
    // When you call this function, the `ArrayParts` you pass in may get
    // copied to a new place on the stack (moved).  In which case,
    // `ap.prefix` and `ap.suffix` are dangling before you do anything.
}

In the Rust of today, you have to use unsafe and work very hard to avoid such dangles, not leak references (which implement Copy), uphold the "no aliases during mutation" contract, and so on. That's what the crates you mentioned aim to do.

So if it makes it into the language at some point, it will be a rather large change. If they're implemented using lifetimes, they won't really be the same thing that lifetimes are today (e.g. they'll have to penetrate the function declaration API, or there'll be a new syntax, etc.). And the references will probably be a new thing too (e.g. somehow heap-or-stack-aware, with the latter being implemented as a relative offset), though maybe Pin could be enough.

As far as I know there's still an interest in the teams, but I imagine it's quite a long way off. If you feel you really need it (versus e.g. using indices, ranges, interior mutability, Rc, etc), I suggest looking for a crate you feel highly confident in.

2 Likes

Thanks, I appreciate the explanation. I know that my example can't work in general, all that's needed is the String to grow beyond the capacity to make the string slices invalid although I'd expect that since there's a shared reference to that string, nobody would ever mutate it.

I think this is a big problem for me -- I don't have enough knowledge/experience to evaluate crates that try to make seemingly unsound concepts sound. My only option would be going by trust, e.g. if it's used by a large project/projects and actively used and tested I'd probably use it without understanding it. But if it's someone's hobby project, no matter how clever or perfect is the solution, I won't be able to recognise that and make the call.

I'm really interested in how to approach self-referential structures -- there are many zero-copy crates (usually parsers) and if I want to keep the owned buffer as well as parsed objects, self-referential struct seems like an obvious choice.

Honestly, I would just recommend that you avoid self-referential structs entirely. It's very very hard to correctly write one, and I find pretty much all of the self-referential struct crates I have looked at sketchy.

What approach would you recommend in cases when a parser takes a reference to owned data and I want to keep the parser around (because it's expensive to re-parse)? And hide this as an implementation detail in a struct so that I can swap parser impls without changing my public interface.

Generally I recommend splitting the data into two structs, such that one can borrow the other.

There is a common Rustic pattern you can use for the immutable case: Have the owned data somewhere and have a separate struct for the immutable view into it. I.e. a referential struct which is not self-referential.

struct BufferParts<'a> {
    the_whole_string_but_as_a_str: &'a str,
    prefix: &'a str,
    suffix: &'a str,
}

Assuming "I don't want to mutate the buffer at this point anyway" is accurate, the main challenge or ergonomic hit is in keeping the separately owned String around somewhere for the lifetime of all the BufferParts. One approach is to create them at the shallowest point in the stack that you need them. Another could be to carry around a String and Range like you mentioned, and dynamically create the referential structs as needed. Playground example.

That's a challenge with the entire crates.io ecosystem, really. Sometimes I recognize the maintainer or know of the crate because it's so widely used, but for the rest it has to be small enough and dependency light enough to review (rare), or you have to just make a judgement call on trust. There are some web-of-trust cargo tools available but I don't think they have much traction yet.

But for something as tricky as self-referential structs in particular, I'd agree that any hobby type crate is almost definitely unsound. I haven't had to make a call on one, so I don't know if there are any I'd consider non-hobby or otherwise trustworthy, either. I've seen a few crate names which tend to pop up in discussions, but when I skimmed them, I see a very low download count and immediately conclude it's not battle tested enough for something so tricky. When the urge has come up for a self-referential struct in my own projects, I've just found another approach instead.

1 Like

Well, I was googling and going through rust docs for some time, and maybe I have found a solution. The point is to create "memory leak" and immediately after that to use Box::into_pin(). This function is not stable yet, but it should ensure that data inside the box aren't moved after the memory leak.

se piz::read::*;
use piz::read::ZipArchive;
use memmap::Mmap;
use std::fs::File;
use std::mem::transmute;
use std::pin::Pin;

pub struct ZipCursor<'a>{
    pub path_in_archive:String,
    pub tree: DirectoryContents<'a>,
    file: Pin<Box<File>>,
    mmap: Pin<Box<Mmap>>,
    archive: Pin<Box<ZipArchive<'a>>>
}

impl<'a> ZipCursor<'a>{
    pub fn open_zip(file_boxed: Box<File>) -> Option<Self> {
        let (file_pin,mmap_pin,archive_pin,tree) = unsafe{
            let file_ptr: &'static File  = transmute(&*file_boxed);
            let file_pin = Box::into_pin(file_boxed);
            let mmap_boxed = Box::new(Mmap::map(file_ptr).unwrap());
            let mmap_ptr: &'static Mmap = transmute(&*mmap_boxed);
            let mmap_pin = Box::into_pin(mmap_boxed);
            let archive_boxed = Box::new(ZipArchive::new(mmap_ptr).unwrap());
            let archive_ptr = transmute(&*archive_boxed);
            let archive_pin = Box::into_pin(archive_boxed);
            let tree = as_tree(archive_ptr.entries()).unwrap();
            (file_pin,mmap_pin,archive_pin,tree)
        };
        Some(Self{
            path_in_archive: String::from(""),
            tree: tree,
            file: file_pin,
            mmap: mmap_pin,
            archive: archive_pin
        })
    }

    // do some stuff with tree

}

I'm not absolutely sure whether it's safe, but in this case (and in the case of string slices) it might be. If it's I think that it's still necessary to add custom implementation of Drop trait to ensure that tree, archive, mmap and file is dropped in the right order.

This kind of workaround is definitely possible. However, it's quite inconvenient in this case because I want to add the majority of error handling into the ZipCursor::open_zip() and to use everything together with the buffer and opened file (and to drop everything safely at a time).

Do you think that I can use this approach without risk (at least after the Box::into_pin() is made stable)?
Any hint about implementing the Drop trait is welcome.

All of those types (File, Mmap and ZipArchive) implement Unpin, so pinning them does nothing (and for the same reason, you can pin them with Pin::new(file_boxed).)

Pinning does not help in your case.

As for soundness, a Box asserts exclusive ownership of the items inside it, so having a pointer to the insides of the box, then moving the Box (by returning it from open_zip) invalidates the pointer, meaning that any further use of that pointer is insta-UB. This kind of self-referential struct would necessarily involve storing the Box as a raw pointer to not make the uniqueness assertion.

This is quite unfortunate. Sorry, I'm beginner, so I thought that Unpin trait is implemented only for really simple structures and primitive types. What's the point to implement Unpin for more complicated structures? Concerns about speed of application?
What about this scenario:

struct ZipCursor<'a>{
    file: File,
    mmap: Option<memmap::Mmap>,
    archive: Option<ZipArchive<'a>>,
    tree: Option<DirectoryContents<'a>>,
}
impl<'a> ZipCursor<'a>{
    pub fn open_zip(file: File) -> Option<Pin<Box<Self>>>{ 
        // now this instance of ZipCursor cant move on heap (since the ZipCursor doesnt implement Unpin)
        let retval = Pin::new(Box::new(Self{
            file: file,
            mmap: None,
            archive: None,
            tree: None
        }));
        unsafe{
            let ptr; // get raw pointer to the instance on heap, sadly I don't know how
            ptr.mmap = Some(Mmap::map(&ptr.file).unwrapp());
            ptr.archive = match &ptr.mmap{
                Some(m) => Some(ZipArchive::new(m).unwrap()),
                None => return None
            };
            ptr.tree = match &ptr.archive{
                Some(a) => Some(as_tree(a).unwrap()),
                None => return None
            };
        }
        return Some(retval);
        // the instance of structure will still be on its position
    }
}

The File, Mmap and ZipArchive types are not self-referential, so it does not make sense to pin them. If a Pin should go anywhere, it should be around your struct, but since you are pointing into memory owned by the struct, but not actually directly into the struct (i.e. moving the ZipCursor does not move the mmapped region), you do not gain anything from Pin whatsoever.

Actually, it does implement Unpin because all of its fields do. To opt-out, you need to add a field of type PhantomPinned.


If you want something that is at least close to a correct solution, it would look something like this:

// Order of fields is important to ensure that references are dropped before owner.
struct ZipCursor {
    file: File,
    archive: ZipArchive<'static>,
    tree: DirectoryContents<'static>,
    mmap: memmap::Mmap,
}
impl ZipCursor {
    pub fn open_zip(file: File) -> Self { 
        let mmap = Mmap::map(&file).unwrap();
        
        let (archive, tree) = create_archive(&mmap);
        let (archive, tree) = unsafe { transmute_lifetimes(archive, tree) };
        
        Self {
            file,
            archive,
            tree,
            mmap,
        }
    }
}

unsafe fn transmute_lifetimes(zip: ZipArchive<'a>, tree: DirectoryContents<'a>) -> (ZipArchive<'static>, DirectoryContents<'static>) {
    let new_zip = std::mem::transmute::<ZipArchive<'a>, ZipArchive<'static>>(zip);
    let new_tree = std::mem::transmute::<DirectoryContents<'a>, DirectoryContents<'static>>(tree);
    (new_zip, new_tree)
}

fn create_archive<'a>(data: &'a [u8]) -> (ZipArchive<'a>, DirectoryContents<'a>) {
    todo!()
}

Of course, you must be careful to never give out a static reference to the memmapped region. That would allow a use-after-free.

2 Likes

I'm not sure I understand the details of this. If you use the same ZipCursor cursor as you used in the "fake static" lifetime solution, how would one go about splitting data into two structs?

You would need one struct to hold the mmap, and another to hold the archive and tree fields. The latter would borrow from the former.

I was thinking whether instead of storing both owned and ref in a struct I could capture them in a closure. Naively something like

let owned = String::new();
let parsed_ref = expensive_parser(&owned);

// doesn't work
let f = move || {
    println!("{}, {}", owned, parsed_ref);
};

f();

But it has the same problem as moving owned and ref into a struct.

I think it should work with generators when owned is moved into the generator which then executes the parser and then in a loop keeps yielding the ref, tuple, BufferParts or whatever is needed.

That made me also think about how IntoIterator works, since that also takes ownership but then I would still have to store the parsed reference somewhere.

I guess faking 'static lifetime is the least worst option if I want to have a struct that owns both owned and ref without the caller having to store anything.

To understand why closures have the same problem, this may be illuminating. (In short, state captured by a closure is stored in a compiler-generated struct, so you end up with the same issue.)

2 Likes