Converting String to *mut i8: Understanding the problem

I am working on a Rust / C binding and have to convert between C strings of the type char* to String and back. Basically the situation on Stackoverflow or in the forum.

My intention here is to understand, not how to make it work.

First: Why does Rust use u8 instead of i8, that would match the common C idiom of using char*?
I presume because u8 matches the underlying type, a text in utf-8. Signedness does not make much sense in the context of a text character - and I agree.

So we can go from String to Vec<u8>, borrow mutably and get a pointer to it, which always leaves us with *const u8 or *mut u8. However, C commonly uses char instead of unsigned char, which requires me to do something like:

use std::ffi::CString;

fn main() {
    let s = String::from("Hallo Welt!");
    let cs = CString::new(s).unwrap();
    let cv: Vec<u8> = cs.into_bytes_with_nul();
    let mut tmp: Vec<i8> = cv.into_iter().map(|c| c as i8).collect::<_>(); // line 7
    let _cptr: *mut i8 = tmp.as_mut_ptr();
}

To summarize:

  • String to CString to add the NUL termination
  • CString to Vec<u8> so we can iterate over bytes
  • Vec<u8> to Vec<i8> so we can get a char later; also need Vec to get a mutable reference to its contents

(I know there is CString::into_raw() -> *mut c_char, which does that all in one, but I'd have to reclaim the memory later, which does not work for me at the moment.)

I understand that passing a *const T instead of *mut T is dangerous:
Rust may correctly presume that the contents is unchanged, when in fact the C routine changed it (which it might anyways, disregarding the const qualifier). Also string literals might land in a read-only memory segment, so changing const to mut might segfault.

Second, however, why is a type cast from *mut u8 to *mut i8 dangerous in any way?

In the Forum post u/ExpHP writes it is undefined behavior. AFAICT "undefined behavior" is a matter of "the Rust compiler team defining it to be so" - and I am fine with that.

Why is line 7 from above better than:

use std::ffi::CString;

fn main() {
    let s = String::from("Hallo Welt!");
    let cs = CString::new(s).unwrap();
    let mut cv: Vec<u8> = cs.into_bytes_with_nul();
    let _cptr: *mut i8 = cv.as_mut_ptr() as *mut i8;      // typecast here!
}

What could possibly go wrong?

Thanks in advance!

1 Like

The *mut u8*mut i8 cast itself seems ok, perhaps ExpHP had something different in mind? Rust is not doing any TBAA (type-based-alias analysis) like C. What would be unsafe would be to obtain a *mut i8 directly from the CString, as the C code may insert some nuls, which would break CString's assumption that there no nuls. But when going through vector, you don't have to guarantee lack of nuls, so it seems ok.

By the way, you're most likely want to cast to *mut c_char instead of *mut i8, just as CString::as_ptr does CString in std::ffi - Rust. (char is i8 on some platforms, but u8 on others).

Out of curiosity, why do you need to pass *mut i8 to your C code? It's a little surprising to see that you're creating a string and expecting C code to modify it.

I've somehow missed this question. My guess – Rust is using u8 as it's kind of simpler than i8, and it's easier (mentally) to do bitwise math on unsigned integers. But regarding C, as I said, char is sometimes signed, sometimes unsigned. There's no "sometimes-signed-8-bits" type in Rust, so you can't copy C here. Instead, a type alias c_char is provided for compat.

5 Likes

By my reading of the linked forum post, the problem was a buffer overrun in C due to the lack of a \0 terminator. The only reason the u8 to i8 error was relevant is that it prompted the forum post, which allowed the underlying problem to be spotted.

2 Likes

@2e71828 Oh, thanks. I've missed the fact that OP included a link. Yeah, the UB was because it was about casting of &str → *char, not because the i8/u8. Let me un-mention ExpHP then :slight_smile:

Ok, thanks!

Thanks, will do so.

I am accessing a proprietery C API which takes char* on basic principles, which bindgen translates to *mut c_char.
I do not expect C to modify it, but I do not want to fix the C API for const-correctness, because I want to make a point of using Rust at my company when I am done - and the effect of having to "fix the C API" to create Rust bindings (a) will make them aware of what I am doing and (b) might feed the skeptics. :wink:

1 Like

I'd just cast away the const and leave a // SAFETY: ... above explaining why you are doing it and why it is necessary. A lot of C code I've seen doesn't use const pointers so it's okay if your Rust is a little sloppier here. With unsafe and FFI, at some point you've got to assume that the C code is correct even if the type signature can't prove it.

CString (deliberately) doesn't have an as_mut_ptr() method, so if C is mutating things you can either a) ignore it and risk messing up CString's internal book keeping, or b) drop CString and use raw pointers and allocator methods directly (e.g. std::alloc::alloc()).

1 Like

Rust does not assume anything about whether the target of a const pointer changes or not. What matters is where you got the raw pointer from — not its const/mut marker.

If you cast a &T to *const T, then you cannot modify things through that pointer. It is still perfectly sound to cast it to a *mut T as long as you don't modify the data.

Similarly, if you cast a &mut T to a *mut T, then you may modify it, and even if you cast it to a *const T, it is still valid to modify the target.

It is also perfectly sound to cast between i8 and u8, as they have compatible representations.

6 Likes

To recap: assuming

extern "C" {
    /// Note: the C code got const-correctness wrong,
    /// the arg is actually a `*const c_char`
    fn some_c_function (_: *mut c_char, ... ) -> ...;
}

then, to call it with some Rust str that has no inner nulls, you'd do:

let input: &'_ str = "example";
let c_str = CString::new(input).expect("Got inner null byte!");
unsafe {
    // Safety: see the const-correctness comment above
    some_c_function(c_str.as_ptr() as *mut c_char)
}
  • this performs a cast from *const c_char to *mut c_char, which is only dangeours "lint-wise": a human may thing it is thus safe to mutate the pointee, when it isn't! (if the C code uses the pointer to do mutate the contents of the string then that would be Undefined Behavior).

which is indeed a bit more cumbersome than just doing

some_c_function(c_str.as_ptr())

but this is the fault of the original C code not getting const-correctness right, and would be a problem from within C too.

3 Likes

I've never used bindgen, and have always written my FFI declarations by hand. As far as I know, C's ABI doesn't actually make a distinction between const and non-const pointers; that's just convention at the language level.

As long as C doesn't attempt to write through the pointer, I wouldn't think twice about writing this C declaration:

int some_c_function(char* s);

in Rust as this:

extern "C" {
    fn some_c_function(_:*const c_char)->c_int;
}

Is there some problem with doing it this way?

1 Like

The compiler cannot make an assumption on a *mut T, that is passed to a C function, but it could make the assumption on a *const T, that the source was not modified in the call to C:

let mut a: u32 = 5;
println!("before: {}", a);                  // a is read here
let a_ptr: *const u32 = &a;
unsafe {
    some_c_function(a_ptr as *mut c_int);
}
println!("after: {}", a);                  // does a need to be re-read?

Suppose a is not a u32, but some value burried in a nested data structure. It'd make sense that the compiler recognised that some_c_function() could (promised) not have modified a and re-use the value from a previous read in the second println!().

That's a good point, thanks!

You might want to read @alice's reply again.

In your particular example, the raw pointer is created from an immutable reference, so the compiler can assume that the integer is not modified through that raw pointer. As I said previously, it is irrelevant whether the pointer is marked const or mut.

2 Likes

Oh - I did not know you could do that - changing the type qualifier in the function declaration!

I tried const-correctness in C, but IMO it does not really scale well. I chose bindgen - because you have to start somewhere - and I thought it might save conversion time during updates and for porting constant declarations.
For now I'll continue there, but I'll keep your trick in mind, thanks!

Oh, I did not catch that, I'm sorry. I need to be more careful. Thanks for clearing that up!

This is a common misconception and strictly speaking not correct. The C standard does not specify if char is unsigned char, signed char but instead defines it as a separate type (unlike int which indeed is short for signed int) but implementations are permitted to use either of the other primitive types here. The fact that it appears to be an i8 depends entirely on your system/libc implementation. Thus you should not specify the type as i8 but instead use libc::c_char or std::os::raw::c_char.

In particular (6.2.5—15, ISO/IEC 9899:TC3):

The three types char,signed char,and unsigned char are collectively called the character types. The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char.

4 Likes

As characters are bitcodes and not not used for arithmetic, what practical difference is there between specifying u8, i8, or c_char in the Rust-side declaration, other than ease of conversion? (Assuming you're only compiling for architectures with 8-bit chars)

I was just expanding on this earlier comment with a reference to the standard to provide more insight into its assertion that char is sometimes signed and sometimes unsigned. If you declare the function yourself, the choice has no semantic difference as all types have the same ABI. However, if you only use some interface declared to take a *const c_char then this will break on platforms where the libc implementation has a different choice. Using c_char isn't any more complicated since you do not want to do arithmetic but avoids this breakage.

1 Like

Edit: Please disregard what I originally wrote, clearly my mind is elsewhere.

Yes, just to clarify where I was quoted, it is about a buffer overrun due to a lack of a NUL byte, as others have suggested.

If you want you can give your function declaration a completely different signature (e.g. fn (usize, usize, *const u8) -> usize instead of fn(*const u8) -> u8) and the compiler has no way of knowing. You'll just get crashes and UB at runtime because Rust is passing around arguments completely differently to what the C code expects.

That's one of the (many) reasons why extern functions are unsafe to call.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.