Add cow to an optional foreign pointer [Soundness]

Hi,

In arrow2 we have a struct that stores a pointer whose region was allocated by either of two different ways:

  • via a foreign allocator (e.g. coming from C++)
  • via Vec

This pointer is usually wrapped in an Arc for the region to be sharable and allow for multi-threaded executions.

We are trying to add cow semantics to this struct, like Arc::get_mut, to allow users to not have to allocate if they need to perform small changes to the underlying data (and they can).

One important consideration is that because these pointers are used in all kinds of analytics, including random access, we need to have a fast Deref implementation.

I have been unable to find an existing idiom for such a struct, and before rolling our own, I would like to evaluate our options here.

To summarize, we are looking for something that (everything private) semantically equivalent to

struct A {}

enum Data<T: bytesmuk::Pod> {
    Native(Arc<Vec<T>>),
    Foreign{
         // this points to a memory region owned by `A`; 
         // FFI contract states that this region is read-only and fulfills
         // memory alignments for T. This ptr is not read after dealloc is dropped
         ptr: NonNull<T>,  
         len: usize,
         dealloc: Arc<A>
    },
}

that

  1. implements a Deref for &[T] as cheap as Vec's implementation
  2. we can implement the analogous of Arc::get_mut that returns None iff:
    • there are other strong refs, like Arc's implementation
    • the pointer is foreign allocated (we need to copy if we want to mutate it to fulfill FFI contract)

before requirement 2 was made, the approach we have been using so far is:

pub enum Deallocation {
    /// Native deallocation, using Rust deallocator with Arrow-specific memory alignment
    Native(usize), // capacity
    // Foreign interface, via a callback
    Foreign(Arc<A>),
}

pub struct Bytes<T: Pod> {
    /// The raw pointer to be beginning of the region
    ptr: NonNull<T>,

    /// The number of bytes visible to this region. This is always smaller than its capacity (when avaliable).
    len: usize,

    /// how to deallocate this region
    deallocation: Deallocation,
}

and store Arc<Bytes<T>> for shared access.

The pointer above will be either Rust-allocated or foreign-allocated depending on deallocation. This allows for fast Deref (via a single indirection, the Arc in Arc<Bytes<T>>).

The problem with the above is that we can't fulfill requirement 2 and write struct B(Arc<Bytes<T>>)::get_mut -> &mut Vec<T> because if the consumer reallocates the Vec, our pointer gets invalidated.

The current proposal (see link above) is to pass the foreign pointer to Vec::from_raw_parts and never allow it to be dropped when it is a foreign pointer. This allows us continue to have a single indirection (via impl Deref for Vec<T>), and at the same time allow for a sound implementation of make_mut.

However, this explicitly violates at least one of the assumptions of Vec::from_raw_parts, namely that the pointer was allocated by Rust own allocator. I do not think that this is an issue if we do not dealloc/realloc it, but you hard to be sure here.

Do you really need an &mut Vec<T> or is &mut [T] enough?

1 Like

Vec<T> is necessary for append-like operations. E.g. a user wants to add a new element to the column without allocating a new column.

This is a key aspect: you can't have a &mut Vec<T> if the allocation is not Rust-based, since things like replace, or .push()-triggered reallocations will let users Rust-deallocate the backing buffer.

Or, in other words, make_mut() should use a Rust Vec the moment get_mut() would yield None, and this includes the foreign alloc-case.

3 Likes

Would an API like this be acceptable:

fn with_mut<F, U>(&mut self, func: F) -> U
where
    F: FnOnce(&mut Vec<T>) -> U,
1 Like

This is catched because we never share the Vec<T> if its foreign allocated. The return type is Option<&mut Vec<T> which return None if not allocated by Vec<T>.

So a foreign allocated memory slab will only be put into Vec<T> and then get_ptr will be called.
On drop it will be leaked.

Ah okay, there were mentions of get_mut() and make_mut() a bit intermingled, hence why I wanted to have that point clarified :+1:
I'm now gonna assume that it's get_mut() and with None for ffi-alloc ones.

In that case,

Then I recommend that you use one of the many "derefs to Vec<T> but is technically not a Vec<T>" wrappers available out there, such as this one.

Mainly, ::safer-ffi's repr_c::Vec<T> has, as its name indicates, a well-defined C layout (ptr, len, cap triplet).

In order to mutate it, it features a with_rust_mut() scoped API similar to @alice's suggestion.


Addendum

FWIW, I see that there is a legitimate thing to want, here, which would be for the requirements for Vec::from_raw_parts to be loosened a bit, to only require that the given ptr be Rust-allocator-originated as a safety invariant of using Vec's &mut-based APIs other than its DerefMut to &mut [T]. In that case, a ManuallyDrop<Vec<T>> implementation would fit your use case quite nicely.

  • But they currently aren't that loose, so you can't do it right now; it's technically violating the safety invariants, library-wise, of the stdlib, which could trigger UB in any future update of the stdlib (unless the rules were loosened, since that would tie the hands of the stdlib in that regard).
1 Like

In case when we do DerefMut it always is a Vec<T>.

So the memory slab from a foreign interface is only constructed, leaked and never shared.

how about a union + enum?

#[repr(C)]
struct ForeignData<T> {
   ptr: *mut T,
   len: usize,
   dealloc: Arc<Dealloc>,
}

union BytesData<T> {
  vec: Vec<T>,
  foreign: ForeignData<T>,
}
enum BytesKind {
  Vec,
  Foreign
}
#[repr(C)]
struct Bytes<T> {
  data: BytesData<T>,
  kind: BytesKind
}

As long as you verify that it is indeed a Vec, you can then freely hand out a &mut reference to it.

Unfortunately Vec does not have a fixed layout, so the kind needs to be checked in order to dereference it.

1 Like

After reading the requirements and thinking about the layout again, iff you need a true &mut Vec<T> out of it, then the kind needs to be checked in order to dereference it. There is no way around it. That also means that there is no disadvantage in using a normal enum as you initially proposed.

If a branch-free and no-op deref is desired, you will have to write or get a Vec that has a #[repr(C)] layout in order to make sure both ptr and len fields line up.

1 Like

Note that T: bytesmuk::Pod and the FFI contract guarantees proper alignment. The main issue is whether it is fine to move a foreign allocated pointer to inside Vec<T> (where the pointer is aligned and etc. for T)

1 Like

I don't think that would be UB-free outside a MaybeUninit to make sure it never drops and never handing out a mutable reference.

Constructing a fake Vec is indeed a good idea.

struct Bytes<T> {
   hack: MaybeUninit<Vec<T>>,
   kind: DataKind,
}

That way the deref is still fast, you can hand out a &mut Vec<T> if it is of the correct kind and construct the fake Vec via from_raw_parts.

2 Likes

This should work:

use std::mem::ManuallyDrop;
use std::sync::Arc;
use std::ops::Deref;
use std::ptr::NonNull;

struct FancyVec<T> {
    ptr: NonNull<T>,
    len: usize,
    extra: Kind<T>,
}

enum Kind<T> {
    VecArc(*const Vec<T>),
    Foreign(unsafe fn(*mut T)),
}

impl<T> Deref for FancyVec<T> {
    type Target = [T];
    fn deref(&self) -> &[T] {
        unsafe {
            std::slice::from_raw_parts(self.ptr.as_ptr(), self.len)
        }
    }
}

impl<T> FancyVec<T> {
    fn with_mut<F, U>(&mut self, func: F) -> U
    where
        F: FnOnce(Option<&mut Vec<T>>) -> U,
    {
        match &mut self.extra {
            Kind::VecArc(ptr) => unsafe {
                let mut arc = ManuallyDrop::new(Arc::from_raw(*ptr));
                match Arc::get_mut(&mut arc) {
                    Some(vec) => {
                        let drop_bomb = DropBomb {
                            vec,
                            ptr: &mut self.ptr,
                            len: &mut self.len,
                        };
                        
                        func(Some(drop_bomb.vec))
                    }
                    None => func(None),
                }
            }
            Kind::Foreign(dealloc_fn) => unsafe {
                func(None)
            }
        }
    }
}

impl<T> Drop for FancyVec<T> {
    fn drop(&mut self) {
        match self.extra {
            Kind::VecArc(ptr) => unsafe {
                drop(Arc::from_raw(ptr));
            }
            Kind::Foreign(dealloc_fn) => unsafe {
                dealloc_fn(self.ptr.as_ptr());
            }
        }
    }
}

struct DropBomb<'a, T> {
    vec: &'a mut Vec<T>,
    ptr: &'a mut NonNull<T>,
    len: &'a mut usize,
}
impl<'a, T> Drop for DropBomb<'a, T> {
    fn drop(&mut self) {
        *self.len = self.vec.len();
        *self.ptr = unsafe { NonNull::new_unchecked(self.vec.as_mut_ptr()) };
    }
}

playground

3 Likes

Since writing the drop guards for unwind-safe code can be tricky / easy to get wrong, and more importantly, defined outside of the function where this happens[1] (whenever I see this I feel like being in C where the closures had to be defined outside of the function body where they would be used), I recommend using a helper crate such as ::scopeguard or

  impl<T> FancyVec<T> {
      fn with_mut<F, U>(&mut self, func: F) -> U
      where
          F: FnOnce(Option<&mut Vec<T>>) -> U,
      {
          match &mut self.extra {
              Kind::VecArc(ptr) => unsafe {
                  let mut arc = ManuallyDrop::new(Arc::from_raw(*ptr));
                  match Arc::get_mut(&mut arc) {
                      Some(vec) => {
-                         struct DropBomb<'a, T> {
-                             vec: &'a mut Vec<T>,
-                             ptr: &'a mut NonNull<T>,
-                             len: &'a mut usize,
-                         }
-
-                         impl<'a, T> Drop for DropBomb<'a, T> {
-                             fn drop(&mut self) {
-                                 *self.len = self.vec.len();
-                                 *self.ptr = unsafe { NonNull::new_unchecked(self.vec.as_mut_ptr()) };
-                             }
-                         }
-
-                         let drop_bomb = DropBomb {
-                             vec,
-                             ptr: &mut self.ptr,
-                             len: &mut self.len,
-                         };
-                         
-                         func(Some(drop_bomb.vec))
+                         ::unwind_safe::with_state(vec)
+                             .try_eval(|vec| func(Some(vec)))
+                             .finally(|vec| {
+                                 self.len = vec.len();
+                                 // No need for unsafe: https://rust.godbolt.org/z/Y63fTrdh3
+                                 self.ptr = ptr::NonNull::new(vec.as_mut_ptr()).unwrap();
+                             })
                      }
                      None => func(None),
                  }
              }
              Kind::Foreign(dealloc_fn) => unsafe {
                  func(None)
              }
          }
      }
  }

I also wonder about getting rid of the self-referential unsafety using ::ouroboros :thinking:


  1. Note that this can be palliated my moving the DropBomb definition inside the body of with_mut, for better locality

      use std::mem::ManuallyDrop;
      use std::sync::Arc;
      use std::ops::Deref;
      use std::ptr::NonNull;
      
      struct FancyVec<T> {
          ptr: NonNull<T>,
          len: usize,
          extra: Kind<T>,
      }
      
      enum Kind<T> {
          VecArc(*const Vec<T>),
          Foreign(unsafe fn(*mut T)),
      }
      
      impl<T> Deref for FancyVec<T> {
          type Target = [T];
          fn deref(&self) -> &[T] {
              unsafe {
                  std::slice::from_raw_parts(self.ptr.as_ptr(), self.len)
              }
          }
      }
      
      impl<T> FancyVec<T> {
          fn with_mut<F, U>(&mut self, func: F) -> U
          where
              F: FnOnce(Option<&mut Vec<T>>) -> U,
          {
              match &mut self.extra {
                  Kind::VecArc(ptr) => unsafe {
                      let mut arc = ManuallyDrop::new(Arc::from_raw(*ptr));
                      match Arc::get_mut(&mut arc) {
                          Some(vec) => {
    +                         struct DropBomb<'a, T> {
    +                             vec: &'a mut Vec<T>,
    +                             ptr: &'a mut NonNull<T>,
    +                             len: &'a mut usize,
    +                         }
    +
    +                         impl<'a, T> Drop for DropBomb<'a, T> {
    +                             fn drop(&mut self) {
    +                                 *self.len = self.vec.len();
    +                                 *self.ptr = unsafe { NonNull::new_unchecked(self.vec.as_mut_ptr()) };
    +                             }
    +                         }
    
                              let drop_bomb = DropBomb {
                                  vec,
                                  ptr: &mut self.ptr,
                                  len: &mut self.len,
                              };
                              
                              func(Some(drop_bomb.vec))
                          }
                          None => func(None),
                      }
                  }
                  Kind::Foreign(dealloc_fn) => unsafe {
                      func(None)
                  }
              }
          }
      }
      
      impl<T> Drop for FancyVec<T> {
          fn drop(&mut self) {
              match self.extra {
                  Kind::VecArc(ptr) => unsafe {
                      drop(Arc::from_raw(ptr));
                  }
                  Kind::Foreign(dealloc_fn) => unsafe {
                      dealloc_fn(self.ptr.as_ptr());
                  }
              }
          }
      }
      
    - struct DropBomb<'a, T> {
    -     vec: &'a mut Vec<T>,
    -     ptr: &'a mut NonNull<T>,
    -     len: &'a mut usize,
    - }
    - impl<'a, T> Drop for DropBomb<'a, T> {
    -     fn drop(&mut self) {
    -         *self.len = self.vec.len();
    -         *self.ptr = unsafe { NonNull::new_unchecked(self.vec.as_mut_ptr()) };
    -     }
    - }
    
    ↩ī¸Ž

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.