Converting String to *mut i8: Understanding the problem

merkosh · July 14, 2020, 9:00am

I am working on a Rust / C binding and have to convert between C strings of the type char* to String and back. Basically the situation on Stackoverflow or in the forum.

My intention here is to understand, not how to make it work.

First: Why does Rust use u8 instead of i8, that would match the common C idiom of using char*?
I presume because u8 matches the underlying type, a text in utf-8. Signedness does not make much sense in the context of a text character - and I agree.

So we can go from String to Vec<u8>, borrow mutably and get a pointer to it, which always leaves us with *const u8 or *mut u8. However, C commonly uses char instead of unsigned char, which requires me to do something like:

use std::ffi::CString;

fn main() {
    let s = String::from("Hallo Welt!");
    let cs = CString::new(s).unwrap();
    let cv: Vec<u8> = cs.into_bytes_with_nul();
    let mut tmp: Vec<i8> = cv.into_iter().map(|c| c as i8).collect::<_>(); // line 7
    let _cptr: *mut i8 = tmp.as_mut_ptr();
}

To summarize:

String to CString to add the NUL termination
CString to Vec<u8> so we can iterate over bytes
Vec<u8> to Vec<i8> so we can get a char later; also need Vec to get a mutable reference to its contents

(I know there is CString::into_raw() -> *mut c_char, which does that all in one, but I'd have to reclaim the memory later, which does not work for me at the moment.)

I understand that passing a *const T instead of *mut T is dangerous:
Rust may correctly presume that the contents is unchanged, when in fact the C routine changed it (which it might anyways, disregarding the const qualifier). Also string literals might land in a read-only memory segment, so changing const to mut might segfault.

Second, however, why is a type cast from *mut u8 to *mut i8 dangerous in any way?

In the Forum post u/ExpHP writes it is undefined behavior. AFAICT "undefined behavior" is a matter of "the Rust compiler team defining it to be so" - and I am fine with that.

Why is line 7 from above better than:

use std::ffi::CString;

fn main() {
    let s = String::from("Hallo Welt!");
    let cs = CString::new(s).unwrap();
    let mut cv: Vec<u8> = cs.into_bytes_with_nul();
    let _cptr: *mut i8 = cv.as_mut_ptr() as *mut i8;      // typecast here!
}

What could possibly go wrong?

Thanks in advance!

krdln · July 14, 2020, 10:00am

The *mut u8 → *mut i8 cast itself seems ok, perhaps ExpHP had something different in mind? Rust is not doing any TBAA (type-based-alias analysis) like C. What would be unsafe would be to obtain a *mut i8 directly from the CString, as the C code may insert some nuls, which would break CString's assumption that there no nuls. But when going through vector, you don't have to guarantee lack of nuls, so it seems ok.

By the way, you're most likely want to cast to *mut c_char instead of *mut i8, just as CString::as_ptr does CString in std::ffi - Rust. (char is i8 on some platforms, but u8 on others).

Out of curiosity, why do you need to pass *mut i8 to your C code? It's a little surprising to see that you're creating a string and expecting C code to modify it.

I've somehow missed this question. My guess – Rust is using u8 as it's kind of simpler than i8, and it's easier (mentally) to do bitwise math on unsigned integers. But regarding C, as I said, char is sometimes signed, sometimes unsigned. There's no "sometimes-signed-8-bits" type in Rust, so you can't copy C here. Instead, a type alias c_char is provided for compat.

2e71828 · July 14, 2020, 10:09am

By my reading of the linked forum post, the problem was a buffer overrun in C due to the lack of a \0 terminator. The only reason the u8 to i8 error was relevant is that it prompted the forum post, which allowed the underlying problem to be spotted.

krdln · July 14, 2020, 10:12am

@2e71828 Oh, thanks. I've missed the fact that OP included a link. Yeah, the UB was because it was about casting of &str → *char, not because the i8/u8. Let me un-mention ExpHP then

merkosh · July 14, 2020, 10:16am

Ok, thanks!

Thanks, will do so.

I am accessing a proprietery C API which takes char* on basic principles, which bindgen translates to *mut c_char.
I do not expect C to modify it, but I do not want to fix the C API for const-correctness, because I want to make a point of using Rust at my company when I am done - and the effect of having to "fix the C API" to create Rust bindings (a) will make them aware of what I am doing and (b) might feed the skeptics.

Michael-F-Bryan · July 14, 2020, 10:50am

I'd just cast away the const and leave a // SAFETY: ... above explaining why you are doing it and why it is necessary. A lot of C code I've seen doesn't use const pointers so it's okay if your Rust is a little sloppier here. With unsafe and FFI, at some point you've got to assume that the C code is correct even if the type signature can't prove it.

CString (deliberately) doesn't have an as_mut_ptr() method, so if C is mutating things you can either a) ignore it and risk messing up CString's internal book keeping, or b) drop CString and use raw pointers and allocator methods directly (e.g. std::alloc::alloc()).

alice · July 14, 2020, 11:12am

Rust does not assume anything about whether the target of a const pointer changes or not. What matters is where you got the raw pointer from — not its const/mut marker.

If you cast a &T to *const T, then you cannot modify things through that pointer. It is still perfectly sound to cast it to a *mut T as long as you don't modify the data.

Similarly, if you cast a &mut T to a *mut T, then you may modify it, and even if you cast it to a *const T, it is still valid to modify the target.

It is also perfectly sound to cast between i8 and u8, as they have compatible representations.

Yandros · July 14, 2020, 1:04pm

To recap: assuming

extern "C" {
    /// Note: the C code got const-correctness wrong,
    /// the arg is actually a `*const c_char`
    fn some_c_function (_: *mut c_char, ... ) -> ...;
}

then, to call it with some Rust str that has no inner nulls, you'd do:

let input: &'_ str = "example";
let c_str = CString::new(input).expect("Got inner null byte!");
unsafe {
    // Safety: see the const-correctness comment above
    some_c_function(c_str.as_ptr() as *mut c_char)
}

this performs a cast from *const c_char to *mut c_char, which is only dangeours "lint-wise": a human may thing it is thus safe to mutate the pointee, when it isn't! (if the C code uses the pointer to do mutate the contents of the string then that would be Undefined Behavior).

which is indeed a bit more cumbersome than just doing

some_c_function(c_str.as_ptr())

but this is the fault of the original C code not getting const-correctness right, and would be a problem from within C too.

2e71828 · July 14, 2020, 1:31pm

I've never used bindgen, and have always written my FFI declarations by hand. As far as I know, C's ABI doesn't actually make a distinction between const and non-const pointers; that's just convention at the language level.

As long as C doesn't attempt to write through the pointer, I wouldn't think twice about writing this C declaration:

int some_c_function(char* s);

in Rust as this:

extern "C" {
    fn some_c_function(_:*const c_char)->c_int;
}

Is there some problem with doing it this way?

merkosh · July 14, 2020, 1:58pm

The compiler cannot make an assumption on a *mut T, that is passed to a C function, but it could make the assumption on a *const T, that the source was not modified in the call to C:

let mut a: u32 = 5;
println!("before: {}", a);                  // a is read here
let a_ptr: *const u32 = &a;
unsafe {
    some_c_function(a_ptr as *mut c_int);
}
println!("after: {}", a);                  // does a need to be re-read?

Suppose a is not a u32, but some value burried in a nested data structure. It'd make sense that the compiler recognised that some_c_function() could (promised) not have modified a and re-use the value from a previous read in the second println!().

That's a good point, thanks!

leudz · July 14, 2020, 2:08pm

You might want to read @alice's reply again.

alice · July 14, 2020, 2:11pm

In your particular example, the raw pointer is created from an immutable reference, so the compiler can assume that the integer is not modified through that raw pointer. As I said previously, it is irrelevant whether the pointer is marked const or mut.

merkosh · July 14, 2020, 2:23pm

Oh - I did not know you could do that - changing the type qualifier in the function declaration!

I tried const-correctness in C, but IMO it does not really scale well. I chose bindgen - because you have to start somewhere - and I thought it might save conversion time during updates and for porting constant declarations.
For now I'll continue there, but I'll keep your trick in mind, thanks!

merkosh · July 14, 2020, 2:33pm

Oh, I did not catch that, I'm sorry. I need to be more careful. Thanks for clearing that up!

HeroicKatora · July 14, 2020, 2:53pm

This is a common misconception and strictly speaking not correct. The C standard does not specify if char is unsigned char, signed char but instead defines it as a separate type (unlike int which indeed is short for signed int) but implementations are permitted to use either of the other primitive types here. The fact that it appears to be an i8 depends entirely on your system/libc implementation. Thus you should not specify the type as i8 but instead use libc::c_char or std::os::raw::c_char.

In particular (6.2.5—15, ISO/IEC 9899:TC3):

The three types char,signed char,and unsigned char are collectively called the character types. The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char.

2e71828 · July 14, 2020, 3:00pm

As characters are bitcodes and not not used for arithmetic, what practical difference is there between specifying u8, i8, or c_char in the Rust-side declaration, other than ease of conversion? (Assuming you're only compiling for architectures with 8-bit chars)

HeroicKatora · July 14, 2020, 3:06pm

I was just expanding on this earlier comment with a reference to the standard to provide more insight into its assertion that char is sometimes signed and sometimes unsigned. If you declare the function yourself, the choice has no semantic difference as all types have the same ABI. However, if you only use some interface declared to take a *const c_char then this will break on platforms where the libc implementation has a different choice. Using c_char isn't any more complicated since you do not want to do arithmetic but avoids this breakage.

ExpHP · July 14, 2020, 5:27pm

Edit: Please disregard what I originally wrote, clearly my mind is elsewhere.

Yes, just to clarify where I was quoted, it is about a buffer overrun due to a lack of a NUL byte, as others have suggested.

Michael-F-Bryan · July 15, 2020, 5:14am

If you want you can give your function declaration a completely different signature (e.g. fn (usize, usize, *const u8) -> usize instead of fn(*const u8) -> u8) and the compiler has no way of knowing. You'll just get crashes and UB at runtime because Rust is passing around arguments completely differently to what the C code expects.

That's one of the (many) reasons why extern functions are unsafe to call.

system · October 13, 2020, 5:14am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Rust how to convert `&str` to `*const libc::c_char`? help	20	1120	November 12, 2023
How to convert a non-zero-terminated C string to Rust &str or String help	12	2638	January 12, 2023
Rust conversion int, string, array, u8 help	3	2421	January 7, 2021
Converting &str to *const c_char help	3	16581	January 12, 2023
*mut u_char to Vec<u8> help	5	1160	January 12, 2023

Converting String to *mut i8: Understanding the problem

Related Topics