FFI: Sound way to turn `std::vector<uint8_t>` into `Vec<u8>`

So, I've been working on Rust bindings to a C++ library using the cxx crate, and I part of what I need to do involves getting a Vec<u8> where the C++ API provides a std::vector<uint8_t>. Originally, (with no FFI or C++ experience), I did stuff like this (in several places):

rust::Vec<uint8_t> AampToBinary(const RsParameterIO &pio)
{
  const auto oead_pio = PioFromFfi(pio);
  const auto data = oead_pio.ToBinary();
  rust::Vec<uint8_t> vec;
  std::move(data.begin(), data.end(), std::back_inserter(vec));
  return vec;
}

But I realized recently that this was slowing me down (by, like, 80%), as it involved basically duplicating all of the data every time. So I tried returning the C++ vector and turning it into a Rust vec like so:

pub(crate) fn cvec_to_vec(mut cvec: cxx::UniquePtr<cxx::CxxVector<u8>>) -> Vec<u8> {
    if cvec.is_null() {
        panic!("Should never ever happen");
    }
    let vec =
        unsafe { Vec::from_raw_parts(cvec.pin_mut().as_mut_slice().as_mut_ptr(), cvec.len(), cvec.len()) };
    std::mem::forget(cvec); // Attempt to prevent calling the C++ destructor
    vec
}

This is still somewhat new territory for me, but it seemed sound. What is really weird is that it works flawlessly on Linux (across ~200 tests), but on Windows it almost always crashes with STATUS_HEAP_CORRUPTION or STATUS_ACCESS_VIOLATION, or if it doesn't crash the tests fail from junk data. Obviously, then, this method is not sound on Windows, but I don't understand the platform difference here, and I cannot seem to find any alternative that does work.

Also, for clarity, the crash only happens when the resulting vec is dropped, e.g.:

    #[test]
    fn aamp_binary_roundtrip() {
        for file in glob::glob("include/oead/test/aamp/files/**/*.b*")
            .unwrap()
            .take(300)
            .filter_map(|f| f.ok())
        {
            let data = std::fs::read(&file).unwrap();
            let pio = ParameterIO::from_binary(&data).unwrap();
            let data2 = pio.to_binary();
            let pio2 = ParameterIO::from_binary(&data2).unwrap();
            assert_eq!(pio, pio2);
        } // Crash always occurs here
    }

Any suggestions?

The safety documentation of Vec::from_raw_parts says:

ptr needs to have been previously allocated via String/Vec<T> (at least, it’s highly likely to be incorrect if it wasn’t).

And in general, it's not safe to mix allocators across different libraries. It may work fine in Linux if they all lower down to your same libc malloc/free, but Rust also allows custom global allocators which then wouldn't match what your C++ library used. Windows is even less flexible about this.

3 Likes

That makes good sense, but in that case I'm at a loss as to whether there is a performant alternative to duplicating all the data again on the Rust side. This is a library crate, and I really don't want to force users to accept something other than a Vec<u8> in the end.

If you need precisely a Vec<u8>, no.

Eventually it'll probably be possible to have it be a Vec<u8, CppArrayNewDelete>, with a custom allocator so that it's deallocated properly. But since that's a different type, it might not be much better than just passing CxxVector<u8>s.

Can you find a way to use slices instead, one the Rust side? What specifically needs a Vec<u8>?

1 Like

The usual solution would be to let the user pass in their own buffer as a mutable slice, then have the C++ code fill it with data directly:

fn fill_with_data(slice: &mut [u8]) {
    unsafe {
        cxx_fill_with_data(slice.as_mut_ptr(), slice.len());
    }
}

Alternatively, if the length isn't known ahead of time, the user could pass in a &mut Vec<u8>:

fn fill_with_data(vec: &mut Vec<u8>) {
    vec.clear();
    unsafe {
        let len: usize = cxx_calculate_len();
        vec.reserve(len);
        cxx_fill_with_data(vec.as_mut_ptr(), len);
        vec.set_len(len);
    }
}
2 Likes

That's kind of the thing: there's not one single use case, though the most common cases would be either writing to disk or storing the resulting data somewhere else (e.g. a map representing some kind of archive). I suppose hypothetically it might be possible to handle all the major cases without a Vec<u8>, but it might be somewhat convoluted (esp. since I've myself already written a bunch of code with the old, slower version that would likely need to be updated).

One thought: if I have to, might there perchance be a faster way to copy the data than what I was using:

  std::move(data.begin(), data.end(), std::back_inserter(vec));

This makes some sense, but presumably if the library I'm wrapping already provides a std::vector then doing this would still, I suppose, require copying the whole thing from one to the other.

Can you get by with a newtype that hides the C++ reality?

struct PioData(cxx::UniquePtr<cxx::CxxVector<u8>>);
// impl Deref with Target=[u8], and maybe DerefMut too...

The user can copy that to a Vec<u8> if needed, otherwise just use it as a slice of data.

4 Likes

I would bet your C++ library specializes that to reserve+memcpy, so it's probably fine.

Of course, if you just give out a &[u8] on the rust side, the .to_owned() to turn that into a Vec<u8> would also be very fast.

1 Like

I might just have to do that. It seems like the cleanest option to save on performance. Can't be that bad as an API update, too, since I think the main use really is going to be writing to disk, or sometimes to an archive.

(Interestingly, this same library has Python bindings, too, and those used basically the same solution with a new Bytes type.)

So, yeah, I think I'll just mark this as the solution. Thanks!

The only problem with giving out a slice is that I have nowhere to store the CxxVector past that point, so I s'pose the newtype solution would be more useable. But I thank you as well for your aid.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.