[Solved] Mmap Vec of repr(C)


#1
  1. I have:
#[repr(C)]
pub struct Entry { // entry of sparse matrix
  lhs: u32,
  rhs: u32,
  weight: f32
}
```

2. I have
```
x: Vec<Entry>;
x.len() == 1_000_000
```

3. Is there a _defined behaviour_ way to (1) mmap x out to file and (2) when reading, mmap x from file?

#2

Vec by definition is always heap allocated and manages its own memory itself, so no way to safely read specifically Vec via mmap.

You might create a custom vec-like wrapper and expose memory as a slice, BUT you have to give strong guarantee that no other process will modify the file while it’s mmapped, because that would be UB in Rust.


#3

This is what I have so far (not mmapping anymore).

use super::*;


#[repr(C)]
pub struct SparseMatEntry {
    lhs: u32,
    rhs: u32,
    val: f32,
}

pub struct SparseMat {
    data: Vec<SparseMatEntry>,

}


impl SparseMat {
    pub fn write_bin(&self, fname: &str) {
        let f = File::create(fname).unwrap();
        let mut bw = BufWriter::new(f);

        let p = self.data.as_ptr();

        let content = unsafe {
            std::slice::from_raw_parts(
            p as *const u8,
            self.data.len() * std::mem::size_of::<SparseMatEntry>())
        };

        bw.write_all(content);
    }

    pub fn read_bin(fname: &str) -> SparseMat {
        let f = File::open(fname).unwrap();
        let mut br = BufReader::new(f);

        let mut content = Vec::<u8>::new();
        br.read_to_end(&mut content).unwrap();

        let n = content.len() / std::mem::size_of::<SparseMatEntry>();

        let data = unsafe {
            Vec::from_raw_parts(
                content.as_mut_ptr() as *mut SparseMatEntry,
                n,
                n)
        };

        // how do we deal with 'content' being dropped ?


        SparseMat { data }
    }
}

How do we deal with ‘content’ being dropped ? (causing the underlying storage of the Vec we want to return to be freed)


#4

Vec::from_raw_parts is super unsafe, and will crash your program.

  • To ensure the orignal vec is not freed, use Vec::into_raw_parts, BUT

  • u8 -> SparseMatEntry conversion is invalid, and in general can’t work. That’s because u8 has alignment 1, and your type has alignment 4.

  • You can’t use your own value for capacity. It has to be the actual size that is read from the original vec. And again, that capacity may not be multiple of your new vec’s size, so it won’t be safe.

Instead, make Vec<SparseMatEntry> (not u8). Resize it (either by initializing with zero value, or a dangerous combination of reserve + set_len), take a mutable slice from it, and then cast this slice and fill it.


#5

Vec::from_raw_parts is super unsafe, and will crash your program.

  • To ensure the orignal vec is not freed, use Vec::into_raw_parts, BUT

  • u8 -> SparseMatEntry conversion is invalid, and in general can’t work. That’s because u8 has alignment 1, and your type has alignment 4.

  • You can’t use your own value for capacity. It has to be the actual size that is read from the original vec. And again, that capacity may not be multiple of your new vec’s element size, so it won’t be safe.

Instead, make Vec<SparseMatEntry> (not u8). Resize it (either by initializing with zero value, or a dangerous combination of reserve + set_len), take a mutable slice from it, and then cast this slice and fill it.


#6
  1. I can’t find Vec::into_raw_parts

  2. Is “write_bin” is fine ?

  3. I have rewritten “read_bin” as follows:

   pub fn read_bin(fname: &str) -> SparseMat {
        let metadata = std::fs::metadata(fname).unwrap();
        let len_in_bytes = metadata.len() as usize;
        let entry_size = std::mem::size_of::<SparseMatEntry>();
        let num_elems = len_in_bytes / entry_size;

        let f = File::open(fname).unwrap();
        let mut br = BufReader::new(f);

        let mut data = Vec::<SparseMatEntry>::with_capacity(num_elems);

        let ptr = data.as_mut_ptr();

        let content = unsafe {
            std::slice::from_raw_parts_mut(
                ptr as *mut u8,
                num_elems * std::mem::size_of::<SparseMatEntry>())
        };

        br.read_exact(content);

        std::mem::forget(data);
        let data = unsafe { Vec::from_raw_parts(ptr, num_elems, num_elems) };


        SparseMat { data }

    }

I’m not happy with the forget/from_raw_parts, but I see no other way to tell the Vec “hey, you don’t have 0 elems, you actually have num_elems elemes”


#7

I can’t either, but into_boxed_slice should work just fine for that case; it gives a box that can then be Box::leak()ed into a pointer right after having acquired the len.

set_len should work for that


#9

Oops, sorry. It looks like I imagined that method! It should have existed :slight_smile: In that case you’d need to read the pointer, len, capactity and mem::forget.

You must not forget about the difference between capacity and length! These are two separate sizes of the vec, which can’t be expressed in a form of a slice.

into_boxed_slice is likely not what you want. If you haven’t exactly pre-allocated the vec (so that capacity == len), then it will re-allocate and copy, defeating the whole purpose of the entire unsafe pointer jugglling.


#10

Right, but I reserve the capacity via with_capacity, so in this particular case, it is guaranteed that len == capacity at the end of the operations. <- Is this assumption false?


#11

Even reserve_exact doesn’t guarantee it:

Note that the allocator may give the collection more space than it requests. Therefore capacity can not be relied upon to be precisely minimal.


#12

@zeroexcuses, have you considered/looked into using bincode to de/serialize from/to a file? At this point in the thread, it looks like you’re already reading the entire file into memory. Or is the file used/created by other (non-Rust) programs?


#13

@kornel Thanks! Does this look right ?

use super::*;

#[repr(C)]
#[derive(Debug)]
pub struct SparseMatEntry {
    pub lhs: u32,
    pub rhs: u32,
    pub val: f32,
}

#[derive(Debug)]
pub struct SparseMat {
    data: Vec<SparseMatEntry>,
}

impl SparseMat {

    pub fn new_from(data: Vec<SparseMatEntry>) -> SparseMat {
        SparseMat { data }
    }

    pub fn write_bin(&self, fname: &str) {
        let f = File::create(fname).unwrap();
        let mut bw = BufWriter::new(f);

        let p = self.data.as_ptr();

        let content = unsafe {
            std::slice::from_raw_parts(
            p as *const u8,
            self.data.len() * std::mem::size_of::<SparseMatEntry>())
        };

        bw.write_all(content);
    }

    pub fn read_bin(fname: &str) -> SparseMat {
        let metadata = std::fs::metadata(fname).unwrap();
        let len_in_bytes = metadata.len() as usize;
        let entry_size = std::mem::size_of::<SparseMatEntry>();
        let num_elems = len_in_bytes / entry_size;

        let f = File::open(fname).unwrap();
        let mut br = BufReader::new(f);

        let mut data = Vec::<SparseMatEntry>::with_capacity(num_elems);

        let ptr = data.as_mut_ptr();

        let content = unsafe {
            std::slice::from_raw_parts_mut(
                ptr as *mut u8,
                num_elems * std::mem::size_of::<SparseMatEntry>())
        };

        br.read_exact(content);

        unsafe { data.set_len(num_elems); }

        SparseMat { data }
    }
}

#14

@vitalyd :

  1. You’re right, I am reading the entire file into memory.

  2. I have used serde (frequently), bincode (once or twice), capn proto (recently).

  3. This was never stated in my original question – but after loading the data and doing some processing, I’ll be uploading the data to GPU/CUDA. Due to that, I do need byte level control over how the data is laid out.


#15

Yeah, to me those functions look great! Except one thing, you might want to check for file corruption like so:

let metadata = std::fs::metadata(fname).unwrap();
let len_in_bytes = metadata.len() as usize;
let entry_size = std::mem::size_of::<SparseMatEntry>();
let num_elems = len_in_bytes / entry_size;
assert!(num_elems * entry_size == len_in_bytes);

as all integer operations are floored, there is a chance that you’ll end up with a weird file. Also, you might want to think about the endianness of the system, as that can affect the portability of the file.


#16

Not too worried about this. The file stays on the same machine. One stage of the Rust program dumps the sparse mat, another stage reads it.

The main reason we dump it at all (rather than keep it in memory) is that it comes from preprocessing wikipedia to a ‘co-occurance’ matrix (which takes a long time [tens of minutes]) and thus makes sense to cache instead of recompute every time.

As an aside, I’m able to read a ~3 GB sparse matrix into memory in about 2 seconds. This sounds about right with respect to modern SSDs right?


#17

LGTM. Don’t forget to handle I/O errors, because set_len after failed read_exact could give you uninitialized garbage.


#18

@OptimisticPeach , @vitalyd , @kornel : Great. I think this s “solved”. Thanks everyone for your time / reading over my code / pointing out bugs / explaining the various pitfalls!


#19

I just got bincode working again, and I have to say: though not nearly as educational about the inner workings of Vec, @vitalyd 's suggestion of bincode is a far cleaner/simpler solution.