Memory allocation and writing to incomplete array field generated by bindgen

I am trying to write to an __IncompleteArrayField generated by bindgen for interaction with a C library.

Following the advice in another post, there is something I am not quite doing right as the snippet...

slice.copy_from_slice(&items);

... fails in runtime with the error:

thread 'test' panicked at library/core/src/panicking.rs:218:5:
unsafe precondition(s) violated: ptr::copy_nonoverlapping requires that both pointer arguments are 
aligned and non-null and the specified memory ranges do not overlap

The full example is available in the Rust playground, but here is the part relevant to the issue:

let item_layout = Layout::array::<Item>(2)?;
let buffer = alloc(item_layout) as *const [Item; 0];
let mut arr = __IncompleteArrayField(PhantomData, *buffer);

let items = [Item::new("apple"), Item::new("banana")];
let slice = arr.as_mut_slice(items.len());
slice.copy_from_slice(&items); // <-- panics

What am I doing wrong?

let buffer = alloc(item_layout) as *const [Item; 0];
let mut arr = __IncompleteArrayField(PhantomData, *buffer);

The second line above doesn't do what you want.[1] I guess you want something like

        let buffer = alloc(item_layout);
        let arr = &mut *(buffer as *mut __IncompleteArrayField<Item>);

(Miri still detects UB elsewhere at that point, but that's another topic.)


  1. It "moves" 0 bytes from *buffer to the stack. ↩︎

1 Like

Thanks!

I think I understand why my code was failing, but I don't understand about the Undefined Behaviour that you say Miri is reporting. Or rather, while trying to parse Miri's output, I even fail to see that it reports UB...

Maybe this is not so strange, given that I learned about the tool through this post.

What exactly is the problem with your solution, provided I clean up the allocated memory when done?

It's this issue basically. You could try running Miri with Tree Borrows instead of Stacked Borrows and see if that is happier. Or you could avoid &[mut] __IncompleteArrayField<_> and instead stick with raw pointers when the full size isn't represented by the type.

1 Like

rust handles dynamically sized types quite differently than C/C++, but when dealing with flexible array members, it's better to stick with raw pointers all the way to avoid surprising UB.

there exists crates like fambox to help reduce the amount of unsafe code required.


side note:

EDIT:

I just learned bindgen actually can generate similar DST definitions. see Builder::flexarray_dst().

END of EDIT

this is NOT applicable to bindgen generated code, but when I manually create fii bindings for struct with FAM, besides the ffi compatible "header" type, I usually make some "companion" types for rust-only use, this usually include a "conservatively"-sized type, a memory layout compatible DST, etc.

for example, it is common for windows APIs to use FAM for something like file paths, suppose we have this C definition:

typedef struct _FooInformation {
    // many fields
    DWORD dwPathLength;
    WCHAR wPath[1];
} FooInformation, *LPFooInformation;

then I would create these rust types:

#[repr(C)]
struct FooInformation<P: ?Sized> {
    // other fields
    path_len: u32,
    path: P,
}

type FooInformationHeader = FooInformation<[u16; 0]>;
type FooInformationShort = FooInformation<[u16; 260]>;  //<-- 260 is `MAX_PATH` in `windows.h`
type FooInformationLong<const N: usize = 1024> = FooInformation<[u16; N]>;
type FooInformationDST = FooInformation<[u16]>;

these type are safe to work in rust, yet all of them is memory layout compatible so you can pass a pointer directly to the ffi API.

2 Likes

In the updated example, the bindgen generated __IncompleteArrayFiels<T> you changed the references (&mut self) to pointers (*mut self). Can bindgen be made to generate code like this?

Checking the bindgen builder documentation I cannot find any suitable setting...

I was hoping to use the bindgen generated code as-is and just need a way to somehow pass a dynamically sized list of items to the FFI functions.

as I said before, when dealing with flex array, it's better to stick with raw pointers. (the generated methods are all unsafe anyway, but they have UB!).

I didn't know that bindgen actually can generate DST types (similar to what I did manually) for structs with flex array members. see Builder::flexarray_dst().

although the conversion methods will only be generated when targeting nightly rust, the data type definition alone is a vast improvement over the ugly IncompleteArrayField.

note, under strict provenance and stacked borrow model, the IncompleteArrayField is pretty much useless: it's UB to access the trailing payload data through a ZST reference; you can only access the payload using a pointer with correct provenance, e.g. from the allocator, or (unsafely) originated from the other side of the ffi boundary.

although the generated as_slice() and as_mut_slice() has UB in them under the stacked-borrow model, there's not much bindgen can do about it, unfortunately, see:

1 Like

Thanks,

I found the Builder::flexarray_dst() method and also the discussion on FAM in the bindgen book but as I am struggling with a basic understanding of the entire topic, I did not yet manage to come up with a meaningful reply or even question.

Also, when trying to follow the configuration recommended for using FAM with the nightly target and adding the raw line, bindgen still fails to generate the associated helper methods.

Turns out I missed a bindgen argument. With the following configuration in build.xml, the methods get generated as expected:

let bindings = bindgen::Builder::default()
    ...
    .raw_line("#![feature(ptr_metadata,layout_for_ptr)]")
    .flexarray_dst(true)
    .rust_target(RustTarget::nightly())
    ...
    .generate()
    ;

Still need to figure out how to make it all work properly...

1 Like

let me try to give you a short summary.

the definition for a struct with FAM consists of the fixed-sized "header"/"prefix" part of the packet, and the "payload" part as an array with either an unspecified size, or zero size.

if you don't use the flexarray_dst option, then the type definition generated by bindgen represents ONLY the header, which means when you use this type alone, e.g. an owned (unboxed) value, or a (safe) reference, it is always UB to access the payload data array through it! hence the name __IncompleteArrayField: this header type is INCOMPLETE, to represent the full packet! [1]

to be able use the full data, you need to combine this header with the actual array of data elements. you can use an array of statically known size, or you can use an array with a runtime known size (a.k.a. the slice DST), both are ok to use across the ffi boundary.

and this is exactly what the fambox crate does for you: glue the header (generated by bindgen) and the (dynamically sized) payload data together and handles the tricky unsafe part for you.

when the flexarray_dst option is enabled, the generated type definition can still be used as "header"-only, just like before, but it can also be used to represent the combined "header" + "payload" data. so you don't need to use fambox in this case, but I still recommend to use fambox if you are not comfortable to do write the unsafe code yourself. [2]

also note, some (legacy) C code would use an array of size 1 to emulate FAM, in order to compile on very old C compilers, which lack proper support for FAM. this is fine for C, but bindgen will parse these types as regular structs. fambox can still be used in this case though, just be aware when dealing with very old ffi libraries. [3]


long text and code wall alert

I think some code might help understand the concepts better, so here it is:

suppose we have a C struct like this:

struct Foo {
	/* other fields */
	int name_len; /* NOT including the NUL terminator */
	char name[];  /* must have the NUL terminator */
};

the bindgen generated header type looks like this:

#[repr(C)]
pub struct Foo {
	pub name_len: ::std::os::raw::c_int,
	pub name: __IncompleteArrayField<::std::os::raw::c_char>,
}

if you want to construct a Foo in rust, you cannot just use Foo alone (unless you use raw pointer for EVERYTHING), you need some wrapper type to "bundle" the header and payload.

let's start simple, with a fixed size array, this is the "consevatively-sized" type I mentioned, let's call it FooBuf:

const MAX_NAME_LEN: usize = 30;

#[repr(C)]
pub struct FooBuf {
	header: Foo,
	payload: [c_char; MAX_NAME_LEN + 1],
}

impl FooBuf {
	/// will panic if name exceeded MAX_NAME_LEN
	pub fn new(/* other header fields, */ name: &[c_char]) -> Self {
		assert!(name.len() <= MAX_NAME_LEN);
		let mut foo = Self {
			header: Foo {
				name_len: name.len().try_into().unwrap(),
				name: __IncompleteArrayField::new(),
			},
			payload: [0; MAX_NAME_LEN + 1],
		};
		foo.payload[..name.len()].copy_from_slice(name);
		foo
	}
}

here's how it is used:

/// hypothetical ffi APIs
unsafe extern "C" {
	/// insert the `foo` into some registry, takes ownership and returns an opaque handle
	fn register_foo(foo: *mut Foo) -> u32;
	/// remove the `foo` from the registry, give back ownership to the caller
	fn unregister_foo(handle: u32) -> *mut Foo;
}

fn main() {
	// create an `FooBuf`, which is ffi compatible for the `Foo` object 
	let my_foo = FooBuf::new(b"/foo/bar/baz");

	// we are going to register the object into the registry of the
	// ffi library, move it into the heap so we can transfer ownership
	let my_foo = Box::new(my_foo);

	// `FooBuf` is `Sized`, so this is just a regular "thin" pointer
	let handle = unsafe { register_foo(Box::into_raw(my_foo) as *mut Foo) };

	// ...

	// unregister it, get back the box allocated on rust allocator
	let my_foo = unsafe { Box::from_raw(unregister_foo(handle)) };

	// do something with `my_foo`, or just let it drop automatically
}

actually, this is enough for a large number of common use cases, where memory usage is not a constraint, so it is an acceptable trade-off of this conservative approach: some wasted memory for safety and simplicity.

in other situations, e.g., it's undesirable to use "maximum" memory for all the data, or it simply doesn't have a defined "maximum" size for the data, a DST must be used (again, unless you want to use raw pointers for everything).

the type definition itself is very similar to the fixed sized array, simple:

#[repr(C)]
pub struct FooDst {
	header: Foo,
	payload: [c_uchar],
}

however, since it's a DST, the implementation needs careful use of unsafe code. the most tricky part is to get the correct pointer metadata out of thin air (or, out of of "thin pointer"), here is why it woks: the metadata for a slice type is just the len() of the slice, this is a special case for the general ptr::metadata API, and is available in stable, see std::ptr::slice_from_raw_parts() (and the counter part std::slice::from_raw_parts()) for details.

impl FooDst {
	/// use this to convert a "thin" header pointer into a "fat" pointer
	///
	/// this is most tricky part, it may seems strange at first, but it is
	/// actually sound, and it is the only way to get the correct pointer
	/// metadata on stable.
	///
	///
	/// SAFETY:
	/// 
	/// the `name_len` field in the header must be correctly initialized
	///
	/// the pointer must have the correct provenance:
	///
	/// - when called with rust pointers, it cannot be derived from a reference
	///   of the header type `Foo`.
	///
	/// - when called with pointers received from ffi APIs, since ffi doesn't
	///   have the concept of provenance, it can only be assumed (unsafely)
	unsafe fn from_header(header: *mut Foo) -> *mut Self {
		let name_len = (*header).name_len as usize;
		// `+ 1` to include the NUL terminator
		let this = std::ptr::slice_from_raw_parts_mut(header, name_len + 1);
		this as *mut Self
	}
}

with this in hand, we can implement a constructor for boxed value easily:

impl FooDst {
	/// name_len does NOT include NUL terminator
	pub fn layout(name_len: usize) -> (Layout, usize) {
		let header = Layout::new::<Foo>();
		let payload = Layout::array::<c_uchar>(name_len + 1).unwrap();
		let (layout, payload_offset) = header.extend(payload).unwrap();
		(layout.pad_to_align(), payload_offset)
	}

	pub fn new_boxed(/* header fields omitted, */ name: &[u8]) -> Box<Self> {
		let (layout, payload_offset) = Self::layout(name.len());
		unsafe {
			let this = alloc_zeroed(layout);
			// init header:
			std::ptr::write(this as *mut Foo, Foo::new(name.len() as c_int));
			// init payload:
			std::ptr::copy_nonoverlapping(
				name.as_ptr(),
				this.add(payload_offset) as *mut c_uchar,
				name.len(),
			);
			// wrap it in a box
			Box::from_raw(Self::from_header(this as *mut Foo))
		}
	}
}

depending on the situation, you may want to MaybeUninit to init the payload data inplace instead of copying from a slice for efficiency, but I'll not go into the details.

the usage of creating an object in rust is similar to the fixed sized array case, so I'll not repeat, here's how you use it with a pointer received from ffi:

unsafe extern "C" {
	fn get_foo() -> *mut Foo;
}

fn main() {
	// get a thin pointer from ffi function
	let foo = unsafe { get_foo() };
	// turn it into a fat pointer using the information of the header
	let foo = unsafe { FooDst::from_header(foo) };

	// example use of an non-owning reference of the DST:
	let foo_dst_ref = unsafe { &*foo };
	let header: &Foo = &foo_dst_ref.header;
	let name_buf: &[c_uchar] = &foo_dst_ref.payload;

	// note, the `payload` slice includes the NUL terminator
	assert_eq!(header.name_len + 1, name_buf.len() as c_int);
}

as you can see, to get the unsafe bits right is tricky, luckily, these are all handled by fambox for you, you just need to implement the FamHeader trait for the header type Foo, no need to implement your own FooDst:

unsafe impl FamHeader for Foo {
	type Element = c_uchar;

	fn fam_len(&self) -> usize {
		// count the NUL terminator
		self.name_len as usize + 1
	}
}

compare the FooDst::new_boxed() function above, and the new_fam_box() funciton below:

/// CStr ensures a NUL terminator
fn new_fam_box(name: &CStr) -> FamBoxOwned<Foo> {
	let header = Foo::new(name.to_bytes().len() as i32);
	FamBox::from_fn(header, |i| name.to_bytes_with_nul()[i])
}

and usage is very similar:

fn gg() {
	// create a boxed DST value
	let my_foo = new_fam_box(c"/foo/bar/baz");

	// transfer ownership to ffi library
	let handle = unsafe { register_foo(my_foo.leak().as_mut()) };

	// use borrowed ffi pointer
	let mut foo = unsafe { FamBoxMut::from_raw(NonNull::new(get_foo()).unwrap()) };
	let (header, payload) = foo.as_parts_mut();
	assert_eq!(header.name_len + 1, payload.len() as i32);

	// get back ownership from ffi library
	let my_foo = unsafe { FamBoxOwned::from_raw(NonNull::new(unregister_foo(handle)).unwrap()) };
	// use it or drop it
}

  1. personally, I think it was a mistake for bindgen to generate those methods on the __IncompleteArrayField placeholder type ↩︎

  2. for what I can see, even for nightly targets, the generated code is still troublesome to use, far behind the usability of fambox ↩︎

  3. this is not bindgen doing something wrong, this kind of ambigouity can only be resolved by humans. granted, tools could use some heuristics, but it's impossible to do it complete right, and it's rare occurence anyway. ↩︎

3 Likes

Thanks for putting it all together - it all makes way more sense now.

Regarding the following line:

... it would seem to indicate that we depend on the order of the FooBuf fields and the fact that Foo comes as the first field in FooBuf. How else can register_foo know where to find the information?

Is this correct? If so, would the compiler catch the error of putting the payload field before the header field in FooBuf?

correct. that's why we need #[repr(C)].

no, it cannot.

it is UB if the memory layout of FooBuf is incompatible with the struct Foo (with FAM payload data) in C. even in C, there are requirement to use FAM correctly: the memory must be allocated and the initialized in a specific way, so that the length information in the header cannot lead to invalid memory access.

/*
 although C doesn't have the distinction between `safe` vs `unsafe`,
 there are still "safety" conditions, just not stated explicitly. for example,
 - the allocation size is (at least) `header_size + payload_size`
 - payload_size is (at least) `header.name_len + 1`, to include NUL terminator
*/
struct Foo *new_foo(char const *name) {
  int name_len = strlen(name);
  struct Foo *foo = malloc(sizeof(Foo) + name_len + 1);
  if (foo == NULL) return NULL;
  foo->name_len = name_len;
  strncpy(foo->name, name, name_len + 1);
  return foo;
}

no, not in the case where the header type (the Foo with __IncompleteArrayField in the example) and the wrapper type combining header + payload (the FooBuf in the example) are separately defined. and it cannot. it is up to the user to make sure the declaration for ffi types and functions are correct.

although for the DST case, it is a language requirement that DST can only be the last field in a struct, the "header ==> payload" order is automatically guaranteed, this is the same for C too, FAM can only be the last field in a struct.

this is a good reason to use the flexarray_dst option, which generates a generic type definition that is suitable for all the cases, by varying the generic type parameter.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.