FFI - Creating a "&[u8]" from "const char*" Slice

I try to expose my Rust Development for External Usage as C Library.

The Official Documentation on CStr states

Note : This operation is intended to be a 0-cost cast but it is currently implemented with an up-front calculation of the length of the string. This is not guaranteed to always be the case.

So this might result into a big performance loss on big inputs.
(I'm working with big log files of > 100 MB of text)

I extended the example at:


to elaborate this prototype:
#![allow(unused)]

use std::ffi::CStr;

#[repr(C)]
pub struct Foo<'foolife> {
    stext: &'foolife str,
}

#[no_mangle]
#[allow(improper_ctypes_definitions)]
pub extern "C" fn foo_new(pstext: *const [u8], itextlen: u32) -> Box<Foo<'static>> {
    let pstxt = Box::new(pstext);
    let cstr = unsafe { CStr::from_bytes_with_nul_unchecked(*pstxt.as_ref()) };

    match cstr.to_str() {
        Ok(s) => {
            // Here `s` is regular `&str` and we can work with it
            Box::new(Foo { stext: s })
        }
        Err(_) => {
            // handle the error
            Box::new(Foo { stext: &"" })
        }
    }
}

#[no_mangle]
#[allow(improper_ctypes_definitions)]
pub extern "C" fn foo_delete(_: Option<Box<Foo>>) {}

fn main() {}

In most C Libraries the convention is

foo_t* foo_new(const char* pstext, uint32_t itextlen);

So even if pstext points to 100 MB of text only the first itextlen bytes are taken into account and no Length Calculation is needed.

This would represent in Rust an &[u8] slice.

But now my problem is that I can't convert
*const [u8] into &[u8]
The compilation fails with:

error[E0308]: mismatched types
  --> src/main.rs:14:61
   |
14 |     let cstr = unsafe { CStr::from_bytes_with_nul_unchecked(*pstxt.as_ref()) };
   |                                                             ^^^^^^^^^^^^^^^ expected `&[u8]`, found *-ptr
   |
   = note: expected reference `&[u8]`
            found raw pointer `*const [u8]`

I could not find any information on this Use Case.
So, I apreciate any advice.

Only a cursory glance, but:

  1. you need extern "C" fn foo_new(pstext: *const u8, itextlen: u32) (without [])
  2. you first need to convert (*const u8, u32) into a &[u8] via https://doc.rust-lang.org/std/slice/fn.from_raw_parts.html, then you convert that slice into a CStr.

Thank you very much!
The Example of std::slice::from_raw_parts() at:


was just what I was trying to do.

which results into the correct code:

#[no_mangle]
#[allow(improper_ctypes_definitions)]
pub extern "C" fn foo_new(pstext: *const u8, itextlen: u32) -> Box<Foo<'static>> {
    let slice = unsafe { slice::from_raw_parts(pstext, itextlen as usize) };
    let cstr = unsafe { CStr::from_bytes_with_nul_unchecked(slice) };

    match cstr.to_str() {
        Ok(s) => {
            // Here `s` is regular `&str` and we can work with it
            Box::new(Foo { stext: s })
        }
        Err(_) => {
            // handle the error
            Box::new(Foo { stext: &"" })
        }
    }
}

You may be interested in using:

You only need to use CStr if the char* is meant to be interpreted as a string. The comment about not being a straightforward cast is because CStr is a dynamically sized type and we use strlen() to calculate the length up front.

Normally when I'd use const char *text, int length in C it's meant to be a bunch of bytes. In that case the Rust equivalent would be written as text: *const u8, length: c_int and converted to a &[u8] via std::slice::from_raw_parts().

Mostly what is refered as "calculate the length"
strlen() - man page
means in practice running through the memory looking for the NULL Byte (\0).
rawmemchr() - example

So, it seems there are different understandings about C Language Best Practices.
Because the the gnutls Library defined a dedicated type to store the length with the String, the gnutls_datum_t structure
gnutls_datum_t structure definition

typedef struct {
	unsigned char *data;
	unsigned int size;
} gnutls_datum_t;

The same practice you find as well in the Boyer Moore Search Algorithm implementation:
Boyer Moore Search Algorithm

uint8_t* boyer_moore (uint8_t *string, size_t stringlen, uint8_t *pat, size_t patlen);

As I explained this makes a lot of sense since this reading the whole string to find its end because expensive on large strings.
In my Use Case I have to deal with > 100 MB Strings and you will notice any performance issues right away.

I would love to provide performance test data between std::slice::from_rawparts() and CStr::from_ptr()
but I have not progressed that far yet.

Note that a Rust CStr contains the terminating Nul byte (and disallows Nul bytes in any other position). It wasn't clear to me from your phrasing if this was understood or not.

First, if this is all you're doing with the CStr, there's no real point to constructing it. You can can just use from_utf8 on the [u8] slice. This is all that CStr::to_str() is doing under the hood, and it removes the Nul byte requirements mentioned above.

Second, while your two let statements forego the validity checks to avoid a linear scan of the data, from_utf8() and thus to_str() also do a linear scan for UTF8 correctness.

If you need the &str and can't be sure the data is valid, you should keep the checks. But perhaps you would find the bstr crate useful for working with byte slices in a more String-like fashion instead.

I looked at your Reference:
string concat() example
And got curious in regards to our subject. How is the let fst = fst.to_str(); implemented?
How do you build the char_p::Ref ?

The short story is that byte strings are useful when it is inconvenient or incorrect to require valid UTF-8.

I must laugh because since ever I started with Rust I was always struggling with the strict Requirement: "be UTF-8 or EXIT"
:laughing:
I feel that some Rust Developers are coming the same way as me.

But then I'm concerned. When ByteSlice is fed into the Rust Ecosystem wouldn't it need to do the Validation Check (Linear Search) anyways?

Here's a link to the source code:

It's probably easier for you to read through the source code than asking people for a second-hand explanation of what they think is going on.

String handling in C is a big mess, so there aren't any real best practices you can follow.

Most libraries will implement their own string type (see gnutls_datum_t, Qt's QString, GTK's GString, etc.) once they reach a certain size. Some libraries will just pass around a pointer and int as two separate function parameters, and others will use null-terminated strings.

:face_with_raised_eyebrow:

Sure, a String (or str) is designed to only contain UTF-8 bytes, but there is no requirement to crash the program when you try to convert non-UTF-8 bytes to a String.

That's a decision the developer explicitly made when they used unwrap() instead of handling the error gracefully (e.g. by reporting to the user that the data they entered was invalid).

@quinedot
@Michael-F-Bryan
The BStr structure fits perfectly into the FFI Use Case
BStr representation

A &BStr has the same representation as a &str . That is, a &BStr is a fat pointer which consists of a pointer to some bytes and a length.

If I could construct it from *cont u8 with its length usize like the std::slice it would even be First Choice.

It looks like BStr implements From<&[u8]> so you can just use into() to convert from a &[u8] to a &Bstr.

unsafe fn some_function(data: *const u8, length: c_int) {
  let slice = std::slice::from_raw_parts(data, length as usize);
  let b_str: &BStr = slice.into();
}

you might want to look at my former discussion to get an idea what it involves not to do just EXIT and out as it is common practice
Proper UTF-8 Error Handling
The Example Use Case also shows why it is a real Requirement in Real World.

I feel like that was a bit of a straw man argument.

If you are writing a web spider then you know you need to deal with malformed data. Instead of blindly calling unwrap(), you'll use something which can handle malformed UTF-8 like String::from_utf8_lossy().

you might want to look on my analysis of String::from_utf8_lossy()
and why it is not desirable:
String::from_utf8_lossy() - analysis

Therefore to use the String::from_utf8_lossy() is contra-productive because it delete the original byte and replaces it with always the same value U+fffd

When I use it I will loose the Data I want to work with and inspect in Rust .

If you want to work with non-utf8 data, don't use an utf-8 string. We have types like Vec<u8> or the bstr crate discussed above.

1 Like

Yeah, actually that was my initial idea when writing the text-sanitizer:
text-sanitizer on playground
as its idea was to inspect the Byte Value of the Text to convert it into ASCII Code.

But in some point of time in your project you will always reach the point where you need to convert it to str or String to be able to print it as text or interact with the Rust Ecosystem.
Then then the Application crashes or exits when you haven't sanitized it correctly yet.
As this was the entrypoint of the former discussion ...

Now the new BStr seems worth study in which way it will thin out the "Long Way Round" of UTF-8 Error Handling.

The use case where the Length of the String is actually calculated you find in the Oniguruma Regex Engine
Oniguruma API definition

int onig_new(regex_t** reg, const UChar* pattern, const UChar* pattern_end,
            OnigOptionType option, OnigEncoding enc, OnigSyntaxType* syntax,
            OnigErrorInfo* err_info)

  Create a regex object.

  normal return: ONIG_NORMAL

  arguments
  1 reg:         return regex object's address.
  2 pattern:     regex pattern string.
  3 pattern_end: terminate address of pattern. (pattern + pattern length)
  4 option:      compile time options.

where obviously pattern_length = pattern_end - pattern;

That would be the fast way to actually calculate the Length of the String