Hello. I have a project to learn Rust where I am migrating a C program I've done some hacking on. I have done some easier projects in the past and diving into the deep end is being far more helpful this time around.
Here is what I came up with:
use std::env::args;
use std::ffi::{c_int, c_uchar, CString};
extern "C" {
pub fn main_legacy(argc: c_int, argv: *const *const c_uchar) -> c_int;
}
fn main() {
let args = args();
let arg_bytes: Vec<_> = args
.map(|arg| Vec::from(CString::new(arg).unwrap().as_bytes_with_nul()))
.collect();
let arg_ptrs: Vec<_> = arg_bytes.iter().map(|arg| arg.as_ptr()).collect();
let argc = i32::try_from(arg_ptrs.len()).unwrap();
unsafe {
main_legacy(argc, arg_ptrs.as_ptr());
}
}
I'm pretty much interested in all kinds of feedback that those here are kind enough to want to give. This is the first time I have written code (despite many many years programming) where I worked primarily from API documentation.
Also! One thing I was trying to make sense of was why doing into_iter() above caused the pointer to point at random garbage.
Also also! I'm assuming that I should have used the question mark operator + a main return type of Result instead of unwraping?
Also also also!! Am I making two copies of heap memory in the above? First when creating a new CString then when creating a new Vec?
AFAIR, calling the main of an arbitrary C executable is a large footgun because you do not initialize the runtime properly. But I am no expert in this regard.
Here is what I would write:
let cstring_args: Vec<CString> = args_os()
.map(|os_str| CString::new(os_str.into_encoded_bytes()))
.collect::<Result<_, _>>()
.expect("nul in args?");
let argc = i32::try_from(cstring_args.len()).unwrap();
let argv: Vec<*const c_uchar> = cstring_args.iter().map(|cstr| cstr.as_ptr().cast()).collect();
Alternatively you can leak the memory and save one allocation:
let argv: Vec<*const c_uchar> = args_os()
.map(|os_str| CString::new(os_str.into_encoded_bytes()).expect("nul in args?").into_raw().cast_const().cast())
.collect();
Also! One thing I was trying to make sense of was why doing into_iter() above caused the pointer to point at random garbage.
Because it drops the values, i.e. you are taking pointers and freeing the memory directly afterwards. Sadly rust has no way for a zero cost safe wrapper around the C main signature because &CStr is a fat pointer, i.e. it carries the length of the string in bytes as extra information. Otherwise you could define a safe wrapper like this which would have caught your use after free error with into_iter with the borrow checker:
fn main_wrapper(args: &[&CStr]) -> c_int {
unsafe {
// SAFETY: Not safe because CStr is a fat ptr :(
return main_legacy(args.len().try_into().unwrap(), args.as_ptr());
}
}
Thank you so much for your response! I may break up my questions into multiple posts as I have time to post, if that's okay.
I was worried about the into functions as I read you'd have to remember to call from again to make sure they get cleaned up, and honestly I was afraid of extra clutter/things to remember. Passing ownership to CString immediately didn't dawn on me!
I'm curious how into functions work. I read the ownership chapters of the book, but am still not sure how a structure such as OSString is giving up it's own ownership! Where can I read more about that?
Also is there a guide somewhere regarding the std naming conventions like when "into_" is used versus "as_"? And now I'm noticing there's a "cast".
Also also, I know what a fat pointer is (a pointer with additional sizing information..?), but how does that influence things such that safe code can't be written?
Thank you again, this is more helpful than you could believe!!!
I'm curious how into functions work. I read the ownership chapters of the book, but am still not sure how a structure such as OSString is giving up it's own ownership! Where can I read more about that?
Methods that take self instead of a reference consume the object. They are typically called into* if they return a new value. There is a style guide which also covers naming conventions: Naming conventions and Conversions [Rust issue #7087]. There is also a clippy lint. These are extremely useful. especially for beginners!
Also also, I know what a fat pointer is (a pointer with additional sizing information..?), but how does that influence things such that safe code can't be written?
If &CStr where a thin ptr it would have the same binary representation as *const c_uchar which means that a slice &[&CStr] would be equivalent to a C array of these pointers. But since &CStr is basically (*const (), usize) this is not possible. I was wrong that there is no zero-cost wrapper. There is just none in the standard library:
/// Thin owner of a nul terminated string
#[repr(transparent)]
struct ThinCString(*mut i8);
impl From<CString> for ThinCString {
fn from(cstr: CString) -> Self {
Self(cstr.into_raw())
}
}
impl Drop for ThinCString {
fn drop(&mut self) {
drop(unsafe {
// SAFETY: pointer was created from CString::into_raw
CString::from_raw(self.0)
})
}
}
impl ThinCString {
pub fn as_raw_slice(s: &[Self]) -> &[*const i8] {
let len = s.len();
let ptr: *const *const i8 = s.as_ptr().cast();
unsafe {
//SAFETY: ThinCString is a transparent wrapper
core::slice::from_raw_parts(ptr, len)
}
}
}
Another useful tool besides clippy is MIRI. It is an interpreter that detects undefined behaviour and helps a lot when writing unsafe code.
I jump around in the rust book when something takes my interest. Same with the standard library. The reference too, except I quickly find myself lost there with terminology I don't yet get.