FFI - Creating a "&[u8]" from "const char*" Slice

domibay-hugo · November 12, 2020, 12:45pm

I try to expose my Rust Development for External Usage as C Library.

The Official Documentation on CStr states

Note : This operation is intended to be a 0-cost cast but it is currently implemented with an up-front calculation of the length of the string. This is not guaranteed to always be the case.

So this might result into a big performance loss on big inputs.
(I'm working with big log files of > 100 MB of text)

I extended the example at:

to elaborate this prototype:

#![allow(unused)]

use std::ffi::CStr;

#[repr(C)]
pub struct Foo<'foolife> {
    stext: &'foolife str,
}

#[no_mangle]
#[allow(improper_ctypes_definitions)]
pub extern "C" fn foo_new(pstext: *const [u8], itextlen: u32) -> Box<Foo<'static>> {
    let pstxt = Box::new(pstext);
    let cstr = unsafe { CStr::from_bytes_with_nul_unchecked(*pstxt.as_ref()) };

    match cstr.to_str() {
        Ok(s) => {
            // Here `s` is regular `&str` and we can work with it
            Box::new(Foo { stext: s })
        }
        Err(_) => {
            // handle the error
            Box::new(Foo { stext: &"" })
        }
    }
}

#[no_mangle]
#[allow(improper_ctypes_definitions)]
pub extern "C" fn foo_delete(_: Option<Box<Foo>>) {}

fn main() {}

In most C Libraries the convention is

foo_t* foo_new(const char* pstext, uint32_t itextlen);

So even if pstext points to 100 MB of text only the first itextlen bytes are taken into account and no Length Calculation is needed.

This would represent in Rust an &[u8] slice.

But now my problem is that I can't convert
*const [u8] into &[u8]
The compilation fails with:

error[E0308]: mismatched types
  --> src/main.rs:14:61
   |
14 |     let cstr = unsafe { CStr::from_bytes_with_nul_unchecked(*pstxt.as_ref()) };
   |                                                             ^^^^^^^^^^^^^^^ expected `&[u8]`, found *-ptr
   |
   = note: expected reference `&[u8]`
            found raw pointer `*const [u8]`

I could not find any information on this Use Case.
So, I apreciate any advice.

matklad · November 12, 2020, 1:23pm

Only a cursory glance, but:

you need extern "C" fn foo_new(pstext: *const u8, itextlen: u32) (without [])
you first need to convert (*const u8, u32) into a &[u8] via from_raw_parts in std::slice - Rust, then you convert that slice into a CStr.

domibay-hugo · November 12, 2020, 1:48pm

Thank you very much!
The Example of std::slice::from_raw_parts() at:

was just what I was trying to do.

which results into the correct code:

#[no_mangle]
#[allow(improper_ctypes_definitions)]
pub extern "C" fn foo_new(pstext: *const u8, itextlen: u32) -> Box<Foo<'static>> {
    let slice = unsafe { slice::from_raw_parts(pstext, itextlen as usize) };
    let cstr = unsafe { CStr::from_bytes_with_nul_unchecked(slice) };

    match cstr.to_str() {
        Ok(s) => {
            // Here `s` is regular `&str` and we can work with it
            Box::new(Foo { stext: s })
        }
        Err(_) => {
            // handle the error
            Box::new(Foo { stext: &"" })
        }
    }
}

Yandros · November 12, 2020, 3:39pm

rust-lang/rust/blob/a38f8fb674e6a0a6fc358655c6ce6069235f621a/library/std/src/ffi/c_str.rs#L1370-L1376


      
          pub fn to_str(&self) -> Result<&str, str::Utf8Error> {
              // N.B., when `CStr` is changed to perform the length check in `.to_bytes()`
              // instead of in `from_ptr()`, it may be worth considering if this should
              // be rewritten to do the UTF-8 check inline with the length calculation
              // instead of doing it afterwards.
              str::from_utf8(self.to_bytes())
          }

It's probably easier for you to read through the source code than asking people for a second-hand explanation of what they think is going on.

String handling in C is a big mess, so there aren't any real best practices you can follow.

Most libraries will implement their own string type (see gnutls_datum_t, Qt's QString, GTK's GString, etc.) once they reach a certain size. Some libraries will just pass around a pointer and int as two separate function parameters, and others will use null-terminated strings.

Sure, a String (or str) is designed to only contain UTF-8 bytes, but there is no requirement to crash the program when you try to convert non-UTF-8 bytes to a String.

That's a decision the developer explicitly made when they used unwrap() instead of handling the error gracefully (e.g. by reporting to the user that the data they entered was invalid).

domibay-hugo · November 13, 2020, 10:04am

@quinedot
@Michael-F-Bryan
The BStr structure fits perfectly into the FFI Use Case
BStr representation

A &BStr has the same representation as a &str . That is, a &BStr is a fat pointer which consists of a pointer to some bytes and a length.

If I could construct it from *cont u8 with its length usize like the std::slice it would even be First Choice.

Michael-F-Bryan · November 13, 2020, 10:06am

It looks like BStr implements From<&[u8]> so you can just use into() to convert from a &[u8] to a &Bstr.

unsafe fn some_function(data: *const u8, length: c_int) {
  let slice = std::slice::from_raw_parts(data, length as usize);
  let b_str: &BStr = slice.into();
}

domibay-hugo · November 13, 2020, 10:12am

you might want to look at my former discussion to get an idea what it involves not to do just EXIT and out as it is common practice
Proper UTF-8 Error Handling
The Example Use Case also shows why it is a real Requirement in Real World.

Michael-F-Bryan · November 13, 2020, 10:21am

I feel like that was a bit of a straw man argument.

If you are writing a web spider then you know you need to deal with malformed data. Instead of blindly calling unwrap(), you'll use something which can handle malformed UTF-8 like String::from_utf8_lossy().

domibay-hugo · November 13, 2020, 10:27am

you might want to look on my analysis of String::from_utf8_lossy()
and why it is not desirable:
String::from_utf8_lossy() - analysis

Therefore to use the String::from_utf8_lossy() is contra-productive because it delete the original byte and replaces it with always the same value U+fffd

When I use it I will loose the Data I want to work with and inspect in Rust .

alice · November 13, 2020, 10:55am

If you want to work with non-utf8 data, don't use an utf-8 string. We have types like Vec<u8> or the bstr crate discussed above.

domibay-hugo · November 13, 2020, 11:47am

Yeah, actually that was my initial idea when writing the text-sanitizer:
text-sanitizer on playground
as its idea was to inspect the Byte Value of the Text to convert it into ASCII Code.

But in some point of time in your project you will always reach the point where you need to convert it to str or String to be able to print it as text or interact with the Rust Ecosystem.
Then then the Application crashes or exits when you haven't sanitized it correctly yet.
As this was the entrypoint of the former discussion ...

Now the new BStr seems worth study in which way it will thin out the "Long Way Round" of UTF-8 Error Handling.

domibay-hugo · November 13, 2020, 11:59am

The use case where the Length of the String is actually calculated you find in the Oniguruma Regex Engine
Oniguruma API definition

int onig_new(regex_t** reg, const UChar* pattern, const UChar* pattern_end,
            OnigOptionType option, OnigEncoding enc, OnigSyntaxType* syntax,
            OnigErrorInfo* err_info)

  Create a regex object.

  normal return: ONIG_NORMAL

  arguments
  1 reg:         return regex object's address.
  2 pattern:     regex pattern string.
  3 pattern_end: terminate address of pattern. (pattern + pattern length)
  4 option:      compile time options.

where obviously pattern_length = pattern_end - pattern;

That would be the fast way to actually calculate the Length of the String

domibay-hugo · December 23, 2020, 9:49am

While continuing to work on my FFI Rust Library I learned 2 things:

You can only correctly interchange Memory if it is allocated by the Host Application. The function foo_get_text() demonstrates how to extract the content of Foo.stext by the Host Application.
the variable itextlen is actually the Buffer Length of the string and must contain space for the NULL Byte.

#[no_mangle]
/// # Safety
///
/// pstext must not contain a NULL Byte before itextlen and have at least itextlen + 1 capacity
pub unsafe extern "C" fn foo_new(pstext: *const u8, itextlen: u32) -> Box<Foo<'static>> {
    let slice = slice::from_raw_parts(pstext, itextlen as usize);
    let cstr = CStr::from_bytes_with_nul_unchecked(slice);

    match cstr.to_str() {
        Ok(s) => {
            // Here `s` is regular `&str` and we can work with it
            Box::new(Foo { stext: s })
        }
        Err(_) => {
            // handle the error
            Box::new(Foo { stext: &"" })
        }
    }
}


#[no_mangle]
pub extern "C" fn foo_get_text(opfoo: Option<&Foo>, psbuffer: *mut u8, ibufferlength: u32) -> i32 {
  match opfoo {
    Some(f) => {
      let vstxt = if f.stext.as_bytes().len() < ibufferlength as usize {
          f.stext.as_bytes()
        } else {
          &f.stext.as_bytes()[0..ibufferlength as usize - 1]
        };
      let cstxt = unsafe { CString::from_vec_unchecked(vstxt.to_vec()) };

      unsafe {
        libc::strcpy(psbuffer as *mut i8, cstxt.as_ptr());
      }

      //Return Content Length
      vstxt.len() as i32
    } //Some(f)
    //Return -1 as Error Code
    , None => { -1 }
  } //match opfoo
}

#[no_mangle]
pub extern "C" fn foo_print(opfoo: Option<&Foo>) {
  if let Some(f) = opfoo {println!("foo text: '{}'", f.stext)}
}

#[no_mangle]
pub extern "C" fn foo_delete(_: Option<Box<Foo>>) {}

When itextlen is set to the String Length like with strlen() then only itextlen - 1 becomes part of the Rust Structure:

$ ./demorustfoo
Foo: building ...
Foo: input Content 'Foo Text Content 0123'.
Foo: built with Content 'Foo Text Content 0123'.
Foo: built.
Foo: printing ...
foo text: 'Foo Text Content 012'
Foo: printed.
Foo: get Text ...
Foo: got Text returned '20'.
Foo: got Text Content 'Foo Text Content 012'.
Foo: deleting ...
Foo: deleted.

The example shows that the Input Content "Foo Text Content 0123" is not changed by the Rust Library but only the a part of it like itextlen - 1 is referenced by the Rust Structure Foo.stext like "Foo Text Content 012"
Another interesting observation is that the Rust Structure Foo.stext keeps pointing to the buffer memory even after the pointer was dismissed by the Host Application:

$ ./demorustfoo
Foo: building ...
Foo: input Content 'Foo Têxte Contént utf-8 0123'.
Foo: built with Content 'Foo Têxte Contént utf-8 0123'.
Foo: built.
Content disposing ...
Content [Length '0']: Text ''.
Foo: printing ...
foo text: 'Foo Têxte Contént utf-8 012'
Foo: printed.
Foo: get Text ...
Foo: got Text returned '29'.
Foo: got Text Content 'Foo Têxte Contént utf-8 012'.
Foo: deleting ...
Foo: deleted.

which might be risky because this memory could be reassigned to another variable during the application life time.

Also remarkable is the *mut u8 to *mut i8 cast which is required by the libc::strcpy() function.
Since i8::MAX has its limit at 127
Rust Documentation i8 Data Type
I was wondering how it would handle UTF-8 text which contains values > 128 to represent Multi Byte Characters
We see the combinations of "195:170 - > '234/0000EA'" and "195:169 -> '233/0000E9'"

$ echo "Foo Têxte Contént utf-8 0123" | ../text*/text*.run -i -d
input: reading ...
chunk: cnt: '31' bytes read
chunk: cnt: '0' bytes read
input: cnt: '31' bytes read
input: cnt: '31'; 'Foo Têxte Contént utf-8 0123
'
input: done.
70:'F'|111:'o'|111:'o'|32:' '|84:'T'|195:170:'234/0000EA':'':|120:'x'|116:'t'|101:'e'|32:' '|67:'C'|111:'o'|110:'n'|116:'t'|195:169:'233/0000E9':'':
rs no rpl (cnt: '31'): resize to -> '62'
|110:'n'|116:'t'|32:' '|117:'u'|116:'t'|102:'f'|45:'-'|56:'8'|32:' '|48:'0'|49:'1'|50:'2'|51:'3'|10:'
'|read: done.
result (cnt: '45'): 'Foo T(?0000EA)xte Cont(?0000E9)nt utf-8 0123
'
output: writing ...
chunk '1/45/45': write go ...
Foo T(?0000EA)xte Cont(?0000E9)nt utf-8 0123
chunk: cnt: '45/45/45' bytes written
output: done.

Surprisingly the string data was copied without corruption.

Foo: printing ...
foo text: 'Foo Têxte Contént utf-8 012'
Foo: printed.
Foo: get Text ...
Foo: got Text returned '29'.
Foo: got Text Content 'Foo Têxte Contént utf-8 012'.

domibay-hugo · December 23, 2020, 12:03pm

@quinedot
@Michael-F-Bryan
@alice
Eventually I removed the bstr crate from the library because the missing interface with *mut u8 as in CString::from_raw() and CString::into_raw() force me always to go the way over the Vec<u8> and thus don't really help the development.

Given this Rust Structure:

#[repr(C)]
pub struct LibLog {
  pub iversion: u32
  , pub ierrorcode: i8
  , pub slogmessage: *mut u8
  , pub iloglength: u32
  , pub serrormessage: *mut u8
  , pub ierrorlength: u32
}

#[no_mangle]
pub extern "C" fn log_create() -> Box<LibLog> {
    let vslog = b"".to_vec();
    let vserr = b"".to_vec();
    let pslog = unsafe { CString::from_vec_unchecked(vslog).into_raw() as *mut u8 };
    let pserr = unsafe { CString::from_vec_unchecked(vserr).into_raw() as *mut u8 };

    Box::new(LibLog {iversion: 1, ierrorcode: 0
      , slogmessage: pslog, iloglength: 0
      , serrormessage: pserr, ierrorlength: 0})
}

fn log_report_add(mut oplog: &mut Option<&mut LibLog>, smessage: &str) {
  if ! smessage.is_empty() {
    match &mut oplog {
      Some(log) => {
        let mut vslgmsg = Vec::new();
        let slgmsg = unsafe { CString::from_raw(log.slogmessage as *mut i8) };


        vslgmsg.extend_from_slice(slgmsg.as_bytes());
        vslgmsg.extend_from_slice(smessage.as_bytes());

        log.iversion += 1;
        log.iloglength = vslgmsg.len() as u32;
        log.slogmessage = unsafe { CString::from_vec_unchecked(vslgmsg).into_raw() as *mut u8 };

      } //Some(log)
      , None => {}
    }  //match &mut oplog
  }  //if ! smessage.is_empty()
}


fn log_error_add(mut oplog: &mut Option<&mut LibLog>, smessage: &str, icode: i8) {
  match &mut oplog {
    Some(log) => {
      if ! smessage.is_empty() {
        let mut vserrmsg = Vec::new();
        let serrmsg = unsafe { CString::from_raw(log.serrormessage as *mut i8) };


        vserrmsg.extend_from_slice(serrmsg.as_bytes());
        vserrmsg.extend_from_slice(smessage.as_bytes());

        log.iversion += 1;
        log.ierrorlength = vserrmsg.len() as u32;
        log.serrormessage = unsafe { CString::from_vec_unchecked(vserrmsg).into_raw() as *mut u8 };
      }  //if smessage.is_empty()

      if log.ierrorcode < icode { log.ierrorcode = icode; }

    } //Some(log)
    , None => {}
  }  //match &mut oplog
}


fn log_error_code(mut oplog: &mut Option<&mut LibLog>, icode: i8) {
  match &mut oplog {
    Some(log) => {
      log.iversion += 1;
      log.ierrorcode = icode;
    }
    , None => {}
  }  //match &mut oplog
}


#[no_mangle]
pub extern "C" fn log_clear(mut oplog: &mut Option<&mut LibLog>) {
  match &mut oplog {
    Some(log) => {
      let vslog = Vec::new();
      let vserr = Vec::new();
      let _slgmsg = unsafe { CString::from_raw(log.slogmessage as *mut i8) };
      let _serrmsg = unsafe { CString::from_raw(log.serrormessage as *mut i8) };


      log.ierrorcode = 0;
      log.iversion = 0;

      log.iloglength = vslog.len() as u32;
      log.slogmessage = unsafe { CString::from_vec_unchecked(vslog).into_raw() as *mut u8 };
      log.ierrorlength = vserr.len() as u32;
      log.serrormessage = unsafe { CString::from_vec_unchecked(vserr).into_raw() as *mut u8 };
    } //Some(log)
    , None => {}
  }  //if let &mut Some(log) = oplog
}

#[no_mangle]
pub extern "C" fn log_free(_: Option<Box<LibLog>>) {}

The use of the additional crate did not change very much.

Topic		Replies	Views
Expected `const i8`, found `const u8` for C function pointer waiting for a `const char*` help	7	11197	August 16, 2015
Rust strings are not so friendly with C/C++ help	32	4668	May 18, 2021
Trying to compare codes between c++ and rust help	7	3677	November 7, 2015
What's the best practice to get string by FFI? help	3	2311	March 15, 2020
FFI C string input best practices with cbindgen: *const T, Option<NonNull<T>>, Option<&T> help	10	4223	September 2, 2021

FFI - Creating a "&[u8]" from "const char*" Slice

Related topics