So... I always figured that the reason why CString::{into_raw,from_raw} exist (and are the only way to work with *mut c_char) is because foreign code might replace one of the interior characters with a NUL. I figured they did some evil trickery behind the scenes like store the original length of the allocation at a fixed negative offset from the pointer.
I dunno. I didn't really think much about it... until today, when I really began wondering why CString::from_vec_unchecked is unsafe, and why the from_raw documentation mentions that "the length" is recomputed. What length, I wondered? Surely not the length of the allocated byte slice?
pub unsafe fn from_raw(ptr: *mut c_char) -> CString {
let len = sys::strlen(ptr) + 1; // Including the NUL byte
let slice = slice::from_raw_parts_mut(ptr, len as usize);
CString { inner: Box::from_raw(slice as *mut [c_char] as *mut [u8]) }
}
...yep. It recomputes the length of the allocated byte slice.
Which suggests to me that the following code is actually UB:
use ::std::ffi::CString;
fn main() {
// Allocate a CString of 14 bytes (including the NUL)
let ptr = CString::new(b"Hello, world!".to_vec()).unwrap().into_raw();
// Give it to some C function which destructively
// reads the string, inserting a NUL after "Hello,"
unsafe { *ptr.offset(6) = 0; }
// Recover the CString so rust can deallocate it
let string = unsafe { CString::from_raw(ptr) };
assert_eq!(string.as_bytes(), b"Hello,");
// !!!!! To my understanding, this invokes UB! !!!!!
// The allocator will be falsely told that the size is 7.
drop(string);
}
It's right there in the docs: (edit: which you did mention)
Additionally, the length of the string will be recalculated from the pointer.
Part of the CString's contract is that there are no \0 bytes except the terminator. Perhaps this hazard should just be added to this method's # Safety documentation?
I figured the internal representation of CString was maybe something like
struct CString {
data: Box<CStringInner>,
}
struct CStringInner {
// Precomputed index of the first NUL byte, which is what
// the end of the CString will appear to be for functions like as_bytes()
effective_len: usize,
// The allocated buffer
data: [u8],
}
and that "the length" might be referring to something like effective_len there.
Part of the CString's contract is that there are no \0 bytes except the terminator. Perhaps this hazard should just be added to this method’s # Safety documentation?
This "hazard" makes CString and into_raw practically useless!
It is quite difficult for me to come up with legitimate examples of C functions that:
Take a mutable *char
Do not have any code paths which may need to write a NUL to an interior byte.
except for I guess something like void convert_ascii_to_uppercase(char *).
I'm saying that virtually every C function I can think of that takes non-const char * probably writes internal NUL bytes, and yet people will use CString::into_raw anyways because it's the only monomorphic function in all of std that returns a *mut c_char.[^1] That's seriously not good.
does it correct but doesn't need *mut (could have used <&CStr>::as_ptr)
does it correct but doesn't need *mut (however into_raw was indeed necessary as ownership was temporarily given to C)
seems correct (too many usages for me to check them all), but I'm not sure if it needs *mut. It seems to me that all commonmark API functions take const char *.
One that does need *mut c_char.... but as I predicted, it writes an interior NUL: