Rust strings are not so friendly with C/C++

You're right, I confused the links, his is about memory errors caused by strings what I opened next to was about NUL

My position is that C strings were a mistake that we made because the alternatives we were staring at were all worse - in particular, "Pascal strings" combine the sins of not being sliceable with the sins of choosing a length type that's too short plus alignment issues. Also the compilers weren't up to the task of bundling pairs yet. Can't find the last time I wrote up this rant so that's all you're getting for now.

3 Likes

Importantly, &str can't be null-terminated if you want subslicing anywhere other than the end to be cheap. I'd much rather that subslicing work well than interop be easier -- compare str::split_whitespace to the C alternatives, for example.

8 Likes

Rust String is an utf8 native string. You can use international characters, for example 'á' or 'д'. You can use as vector of wchar_t.

let mut chv: Vec<_> = s.chars().map(|c| c as libc::wchar_t).collect();
chv.push(0);
unsafe {
    //demo_u8(s.as_ptr(), s.len());    // pass to C as utf8 byte array (char *, but no \0 at end)
    demo_char(chv.as_ptr());           // as wchar_t, see: wcscpy()
}

(thanks H2CO3 ... see below)

1 Like

That is almost certainly catastrophically wrong. Rust char is always 4 bytes wide, while wchar_t is not specified to be 4 bytes wide, and in fact, it is usually only 2 bytes wide. Therefore casting the *const char to a pointer of type *const wchar_t will result in gibberish (and potentially memory unsafety) if it is ever dereferenced by the callee.

2 Likes

Actually, by a strict reading of the standard, MSVC++ is actually nonconforming, as wchar_t is specified to be (paraphrased) "large enough to hold any character in the supported character set". Specifically, this means that "the size of a wide string literal is the total number of escape sequences [...] and other characters, plus one for the terminating L'\0'".

However, MSVC compiles L"\U000E0005" as an array of three wchar_t, as the non-BMP character requires a surrogate pair in UTF-16. Here's a StackOverflow QA about this.

Obviously, saying a 2 byte wchar_t is broken and refusing to support it is not practical, when Windows APIs require it (well, modulo the fact that everything is actually LPWCHAR style macros in the actual API). So programs that want to be portable have to accommodate this nonconformity (or, really, just go back to the everything is ASCII days) when written in C or C++.

All of this is really just to say that "C strings" are fundamentally broken for portable software, even ignoring the null termination issues. New software should use UTF-8 everywhere that they can internally, and translate to whatever system format when necessary to talk to the OS. Other reasonable approaches include a slice of bytes that's only conventionally UTF-8, using the type system to encapsulate the OS concepts, or even never deal with strings internally, only tokens, and only convert to strings when showing to the user.

IMHO, I think the third (or so) biggest mistake of the software industry is convincing ourselves that text, or time, or many other things are simple or even solved. (First is pervasive nullability, second is pervasive sentinel ended arrays.) Any time you have to deal with the real world (and that includes talking to the OS or over the network at all), things get horribly messy, and we have to acknowledge and deal with the inherent complexity, not paper over and pretend it doesn't exist.

8 Likes

Note that according to cppreference.com (which I think is quite reliable?) this seems to have been "fixed" in C++23 by making the conflicting cases for wide string literals illegal. Since otherwise C++ seemingly doesn't care generally what the encoding is (exceptions being locale dependent functions where the locale defines it and the deprecated codecvt_utf* conversions), I think this makes VC++ conformant again here?

If a character lacks representation in the associated character encoding,

  • if the string literal is an ordinary string literal or wide string literal, it is conditionally-supported and an implementation-defined code unit sequence is encoded;
  • [...]

Each numeric escape sequence corresponds to a single element. If the value specified by the escape sequence fits within the unsigned version of the element type, the element has the specified value (possibly after conversion to the element type); otherwise (the specified value is out of range), the string literal is ill-formed. (since C++23)

String literal - cppreference.com

By the way, I suggest two important improvements to your crate, which I had finally committed to fully prototyping around two weeks ago (Playground):

  • Use const magic to perform the nul-byte scanning at const time rather than macro expansion time. This allows usage of consts as input to the macro, or usage of other macros that expand to strings, such as env!, include_str!, or concat!.

  • Do not emit a non-yet-const call to CStr::from_bytes_with_nul_unchecked, since it isn't const. Instead, expand to a wrapper that performs the call on Deref. This way we still have a fully const-constructible value (that does not assume the layout of &'_ CStr), that can still be used in any place that expects a &'_ CStr (and the Deref operation will be a no-op anyways, no matter what the underlying layout / implementation of &'_ CStr is).

  • Error message when the macro is from an external crate:

    error[E0308]: mismatched types
     --> src/main.rs:5:51
      |
    5 |         const CS: &'static ::example::ConstCStr = ::example::const_cstr!(S);
      |                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^ expected `13_usize`, found `4_usize`
      |
      = note: expected struct `zstr::__::c_str_len<13_usize>`
                 found struct `zstr::__::c_str_len<4_usize>`
      = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
    
  • EDIT: I've just noticed I incorrectly handle the case where no string or an empty string is given (it will cause a const_err by out-of-bounds indexing); that's just a matter of guarding the [len - 1] indexing with a check against that case, and defining the input byte-string as b"\0" in that case)

3 Likes

Thanks, those are valuable suggestions. My only comment is that I'd rather wait for CStr::from_bytes_with_nul_unchecked to become const, since the transmute is explicitly documented not to be guaranteed to work. Which is understandable, because I reckon there are efforts to change the representation of CStr from [u8] to a single pointer, calculating the length on-demand.

Indeed that function should become const in any case, since, as I mentioned, it will be a no-op in all cases:

  • it currently can be implemented as a transmute / cast;

  • Once &CStr becomes a slim pointer, it will be able to just ditch the len component of &[u8], in a const fashion.

I personally believe that you could use the wrapper around [u8] that Derefs as suggested, and once CStr unchecked constructor becomes const, you can just change replace the definition of the wrapper with a #[deprecated(…)] type alias :slightly_smiling_face: (that would be breaking w.r.t. impl Deref<Target = CStr> :weary:). Well, just change back the implementation then with a semver bump: I suspect it will take longer than expected for CStr::from_bytes_with_nul_unchecked to become const, since it hasn't already.