How to deal best with size_t and bindgen

I have the following scenario:

Cargo.toml:

# …
[build-dependencies]
cc = "1.0.73"
bindgen = "0.59.2"

build.rs:

fn main() {
    bindgen::Builder::default()
        .header("src/c_code.c")
        .parse_callbacks(Box::new(bindgen::CargoCallbacks))
        .generate()
        .unwrap()
        .write_to_file(
            std::path::PathBuf::from(std::env::var("OUT_DIR").unwrap())
                .join("c_code.rs"),
        )
        .unwrap();
    println!("cargo:rerun-if-changed=src/c_code.c");
    cc::Build::new().file("src/c_code.c").compile("c_code.a");
}

src/c_code.c:

#include<stddef.h>

void return_size_t_through_ptr(size_t *ptr) {
    *ptr = 17;
}

src/main.rs:

#![feature(c_size_t)]

use std::ffi::c_size_t;
use std::mem::MaybeUninit;

mod bindings {
    #![allow(warnings)]
    include!(concat!(env!("OUT_DIR"), "/c_code.rs"));
}

fn main() {
    let mut x = MaybeUninit::<c_size_t>::uninit();
    unsafe {
        bindings::return_size_t_through_ptr(x.as_mut_ptr());
    }
    let x = unsafe { x.assume_init() };
    assert_eq!(x, 17);
}

And I get the following error.

Error:

error[E0308]: mismatched types
  --> src/main.rs:14:45
   |
14 |         bindings::return_size_t_through_ptr(x.as_mut_ptr());
   |                                             ^^^^^^^^^^^^^^ expected `u64`, found `usize`
   |
   = note: expected raw pointer `*mut u64`
              found raw pointer `*mut usize`

For more information about this error, try `rustc --explain E0308`.

I see several possibilities to solve this:

Solution 1: Use size_t_is_usize option

     bindgen::Builder::default()
         .header("src/c_code.c")
         .parse_callbacks(Box::new(bindgen::CargoCallbacks))
+        .size_t_is_usize(true)
         .generate()
         .unwrap()

This works now, but will this work in future? (edit: and on all platforms?)

Solution 2: Cast to *mut _

 fn main() {
     let mut x = MaybeUninit::<c_size_t>::uninit();
     unsafe {
-        bindings::return_size_t_through_ptr(x.as_mut_ptr());
+        bindings::return_size_t_through_ptr(x.as_mut_ptr() as *mut _);
     }
     let x = unsafe { x.assume_init() };

Doesn't feel safe if the API changes.

Solution 3: Cast to *mut bindings::size_t

 fn main() {
     let mut x = MaybeUninit::<c_size_t>::uninit();
     unsafe {
-        bindings::return_size_t_through_ptr(x.as_mut_ptr());
+        bindings::return_size_t_through_ptr(x.as_mut_ptr() as *mut bindings::size_t);
     }
     let x = unsafe { x.assume_init() };

But why do we have c_size_t at all? Maybe better not use it at all while it's still unstable?

Solution 4: Don't use std::ffi::c_size_t but take it from bindgen output

-#![feature(c_size_t)]
-
-use std::ffi::c_size_t;
 use std::mem::MaybeUninit;
 
 mod bindings {
     #![allow(warnings)]
     include!(concat!(env!("OUT_DIR"), "/c_code.rs"));
 }
 
+#[allow(non_camel_case_types)]
+type c_size_t = bindings::size_t;
+
 fn main() {
     let mut x = MaybeUninit::<c_size_t>::uninit();
     unsafe {
         bindings::return_size_t_through_ptr(x.as_mut_ptr());
     }
     let x = unsafe { x.assume_init() };

It all feels a bit awkward. What would you do?

After some more thinking, I came up with this, which I think is the cleanest solution:

#![feature(c_size_t)]

use std::ffi::c_size_t;
use std::mem::{size_of, MaybeUninit};

mod bindings {
    #![allow(warnings)]
    include!(concat!(env!("OUT_DIR"), "/c_code.rs"));
}

fn main() {
    let x: c_size_t = unsafe {
        let mut x = MaybeUninit::<bindings::size_t>::uninit();
        bindings::return_size_t_through_ptr(x.as_mut_ptr());
        assert_eq!(
            size_of::<bindings::size_t>(),
            size_of::<c_size_t>()
        );
        x.assume_init() as c_size_t
    };
    assert_eq!(x, 17);
}

Assuming I want to re-use x later where a std::ffi::c_size_t is expected. But it's a bit verbose.

I'd be in favor of Solution 4 if you're worried about usize vs. size_t. Since sizeof(size_t) is a property of the ABI, and bindgen knows the correct ABI from the compiler, bindings::size_t will be the correct type no matter what. (Normally, I'd just use Solution 1, since I don't expect anyone to run my programs on exotic platforms.)

Last time I saw semi-official advice on this, it was that we should treat C size_t as usize on the Rust side, and there were even plans for deprecating binding size_t as anything else. Of course, people then started discussing an exotic platform on IRLO (I don't remember what it was) where they did something weird with the address space and as a result, sizeof(usize) != sizeof(*const T) != sizeof(size_t). However, I don't think there has been actual support/implementation of that, and the idea that size_t is always usize is quite compelling, so I would advise you do the same.

2 Likes

The most recent chatter I've seen is the strict provenance experiment, where the idea is that we preserve size_t == usize and instead rewrite the C bindings to take pointers instead of pointer-sized ints in all cases of stuffing.

The issue is that some platforms have pointers that are bigger than the address space. One of these platforms apparently is the CHERI architecture, which I have never heard of before. Here, the address space is 64 bits but the pointers carry extra information to make them non-forgeable. Therefore the pointers are 128 bits in size, but maximum object size is 264. As Rust guarantees that usize can store a pointer value, usize must be 128 bits wide. Yet size_t will be 64 bit.

To a language with memory safety, like (safe) Rust, concepts like these might be unimportant, but we can't know if such platforms will be relevant in future. Thus, I think, people are right that size_of::<usize>() isn't neccessarily size_of::<c_size_t>().

However, I feel like there are bigger problems with these platforms, because a lot of Rust code would bloat up usize (which is returned by Vec::len, for example) by 100%, because usize is guaranteed to store a pointer value (of 128 bit) but often only needs to store sizes (as the name usize suggests).

I found a comment from @kornel, which resembled a bit on how I feel on that issue:

Rust already made a mistake of assuming uintptr_t == size_t , but maybe it should try to back out from it instead of solidifying it further?

(in Issue #1400 on libc crate)

But then there is the argument of backward compatibility (which requires that usize is big enough to store pointers). I just hope this won't cause real big trouble to Rust in the future. As of right now, agreed, I think it's safe to assume that size_t is the same as usize, but I would like my code to work in future too.

The thread on IRLO which you referenced is maybe this one.

Disregarding any considerations on core language design, I would come to the following conclusions:

  • bindgen could translate size_t to u32/u64/u128, or usize on platforms where sizeof(size_t) == sizeof(ptrdiff_t). I feel like translating it to usize would cause less problems. When dealing with Rust and FFI, there's always the problem of writing code that compiles well on your own platform but fails to compile on other platforms. That's because the mapping from the C types to Rust's primitive integer types is done via type aliases. A mismatch won't always cause compile-time errors, as I also figured out in this thread.
  • The clean way seems to either fix bindgen to return the same type as used in std::ffi::c_size_t (which is yet unstable!) or to simply use size_t as emitted by bindgen, i.e. work with bindings::size_t where bindings is the module containing the created bindings. When translating the integer to the Rust world, usize is used, so we could convert the returned integer with .try_into().unwrap() (or assert!(size_of::<usize>() >= size_of::<bindings::size_t>()); value as usize).

This is my real-life code as of now:

fn cursor_get_current_value_count<K, V, C>(
    &mut self,
    cursor: &Cursor<K, V, C>,
) -> Result<usize, io::Error>
where
    K: ?Sized + Storable,
    V: ?Sized + Storable,
    C: Constraint,
{
    cursor.backend.assert_txn_backend(self);
    unsafe {
        // TODO: use c_size_t when stabilized
        let mut count = MaybeUninit::<lmdb::size_t>::uninit();
        check_err_code(lmdb::mdb_cursor_count(
            cursor.backend.inner,
            count.as_mut_ptr(),
        ))?;
        Ok(count.assume_init().try_into().unwrap())
    }
}

I hope the try_into().unwrap() will be zero-cost.

Actually, in my opinion, the separation of uintptr_t and size_t is the mistake, platforms where they are different are weird, and Rust should not try to support such platforms at such a huge price as to break a fundamental (and very convenient) assumption of most of the FFI-facing Rust code people have ever written.

Well, at that point I'd much rather not assume anything about the conversion succeding. That last line could (and IMO should) be replaced by Ok(count.assume_init().try_into()?).

You mean they should have been the same in C? Maybe warming up the whole issue is then the mistake. But I'm undecided. I guess it depends on whether you want to support fat pointers in hardware (which are bigger than the address space) or not. If you want to support them, then it might have been a mistake to not introduce a upointer in addition to usize in Rust and make usize equal to size_t. If you don't want to support them, then it's all usize and the whole c_size_t stuff is unnecessary. I don't know how relevant these platforms or concepts of fat pointers in hardware are.

Not quite, I would need:

Ok(count
    .assume_init()
    .try_into()
    .map_err(|err| io::Error::new(io::ErrorKind::Other, err))?)

And not sure if I like that.

Alternatively:

assert!(size_of::<usize>() >= size_of::<lmdb::size_t>());
Ok(count.assume_init() as usize)

But maybe I worry too much about this. It's still annoying to get compiler errors regarding size_t all the time, though. I hope this issue will be fixed in one or the other way (in Rust, in C, or in bindgen, whereever). I personally don't need to support these exotic platforms either. I only see the risk of fat pointers being more widely used in hardware in future. (Not sure how big that risk is.)

Yeah, I meant that if you are returning Result anyway, you shouldn't surprise-panic because that would be a misleading API. In reality you should probably have your own Error type anyway and not just return a raw io::Error for exactly this reason (then you can impl From<TryFromIntError> for MyError).

That has the same problem, it suggests graceful error handling on the surface, but it sometimes panics. There's even a Clippy lint for it.

I don't care much about C; I think that C's integer types are mostly unsalvageable at this point, and there's not much point in arguing about them. My comment concerns Rust, and my firm opinion is that we should keep bad surprises from entering the core language as much as possible, before it's too late. If this means not supporting a niche platform at all, then so be it. They'll keep using C or whatever language they have been using so far.

I just noticed I would actually need an assert that panics at compile time (and not run-time) to achieve what I intended to do. What's the idiomatic way to do it?

Yeah, maybe orienting on C too much is the greater "solidification" of past mistakes. So messing up Rusts simple system of basic integer types "just because of C" might make things worse instead of better.

You could use static_assertions::assert_eq_size!() for that purpose.

1 Like

Thanks. Even if I don't use it here, it might help to make code rather fail to compile instead of causing issues at run-time when being ported in future.

I feel like it's reasonable to panic when an integer representing a length information doesn't fit into usize. Similar to how being out of memory would panic. Hmmmmm, though on the other hand, methods like std::io::Read::take use u64 for length informations instead of usize (because you don't need to be able to keep everything in memory). So a length can exceed usize.

I'll think about it again.

I'll also look at other occurrences of .try_into().unwrap() in my code. Maybe I use it too carelessly. I started getting used to write .try_into().unwrap() a lot due to the issues explained in this thread:

Interfacing C code with bindgen: #define and types

This it the statement that is currently being challenged by CHERI and Stacked Borrows, actually: while ptr as usize will always be available, it is currently envisioned that maybe it could be a lossy cast (like a uintptr_t -> size_t cast in C would), so that the usize as *mut _ cast, on the other hand, would simply not exist on such exotic platforms (and may even yield an unusable ptr on all platforms? I haven't followed all the implications of the SB trouble). This indeed aligns very well with the Stacked Borrows formal model, which has independently been struggling with the semantics of usize as ptr for a while. That shouldn't come that much as a surprise, however:

  • in the formal model, pointers have extra compile-time metadata attached to it —the pointer provenance–, and thus a usize has strictly less information than its pointer counterpart, so a cast from the former to the latter is problematic in that we need to forge some arbitrary magic provenance, which, in a way, could lead to preventing many kinds of pointer optimizations.

  • in CHERI, pointers have extra runtime metadata attached to it, hence the problems already described.

All in all, usize as ptr (and more generally, any kind of integer as ptr) operation is on the road of being more and more proscribed.

And thus, usize is becoming more and more size_t and less and less uintptr_t (it may end up being ptraddr_t, but I'd very much hope that size_t and ptraddr_t remain equal-sized on the platforms Rust supports, since otherwise things would be way too crazy).

See also: Rust's Unsafe Pointer Types Need An Overhaul - Faultlore. While the tone of the article is suboptimal, the points it makes ought to help provide more context to this area.

4 Likes

I'll look into that post.

Following that trend, maybe .size_t_is_usize(true) is the best option then (Solution 1 in my original post).


P.S.: I just tried to use .size_t_is_usize(true) and suddenly I could get rid of many try_into()s. Code feels much cleaner now (and I assume/hope that .size_t_is_usize(true) will cause a compile-time error on exotic platforms where this is dangerous to assume as of current Rust).

2 Likes

I don't think it was a mistake to separate these two. They are really different things. Most platforms have a difference between the pointer size and actual address space size (e.g. even current modern 64-bit platforms tend to use only 40-something bits of the address space). Embedding of extra metadata in pointers is useful, and has been done for a long time (often in these unused bits of the address space). The "weird" platforms just make this explicit, but I like the explicitness, and I like that it also helps pointer provenance problems.

Rust uses usize for indexing, so it needs another integer type that is actually large enough for pointers with their metadata. Otherwise it will inaccurately use a too-large integer for indexing (ideally usize should have been 31-bit integer due to LLVM, but I realize that's too quirky).

I would phrase it this way then: The original mistake was to assume that an integer type that's commonly used for indexing could also be used to store pointers. That assumtion is not platform independent, and if Rust wants to be platform independent (and not have an unnecessarily big integer on some platforms), this error needs to be corrected. I see breaking usize as *const _ to be the lesser problem compared to what would happen if .len() methods in std change their return type from usize to realusize.

Thus, I would appreciate if Rust settles on

  • using usize for indexing (like done in std already),
  • disallowing usize as *const _ (via an Edition / deprecation),
  • making usize and size_t be the same (as long as this doesn't cause any other issues on some platforms/ABIs?).

This would go into the same direction @Yandros indicated:

But wait, what is ptraddr_t? Searching for it on the web brings me to CHERI. I assume it means an integer big enough to hold an address (but not necessarily a fat pointer). Hmmmm… now we're talking about three different "size" integers :crazy_face:.

I will just hope that Rust follows the path of making usize being the same as size_t (and being the same as ptraddr_t on all supported platforms).

Following that hope, I think the best approach is to:

  1. Use size_t_is_usize(true)

  2. When being paranoid, adding something to the code that makes sure size_t really is the same as usize in practice (and to abort compilation otherwise), e.g.:

#![feature(c_size_t)]
const _: fn() = || {
    let _ = core::mem::transmute::<usize, core::ffi::c_size_t>;
};

(Playground) (Code inspired/taken from static_assertions::assert_eq_size)

How do I do this on stable? I tried this:

use std::mem::MaybeUninit;

mod bindings {
    #![allow(warnings)]
    include!(concat!(env!("OUT_DIR"), "/c_code.rs"));
}
const _: fn() = || {
    let _ = core::mem::transmute::<usize, bindings::size_t>;
};

fn main() {
    let x: usize = unsafe {
        let mut x = MaybeUninit::<usize>::uninit();
        bindings::return_size_t_through_ptr(x.as_mut_ptr());
        x.assume_init()
    };
    assert_eq!(x, 17);
}

But if I use .size_t_is_usize(true), then size_t apparently is removed from the bindings generated by bindgen. :exploding_head:

I could use __size_t then, but not sure if that's wise/correct?

Maybe the extra check is unnecessary anyway? Maybe I overcomplicate things.

1 Like