More efficient conversion from utf8 bytes to a string?

I've made a function to reliably convert utf-8 grapheme clusters stored in a 4-byte variable into a String.

I'd like to know if there could be some alterantive way of making this more efficient by avoiding some steps, ideally making this zero copy by avoiding the heap, by maybe using some kind of stack string from an external crate.

Thank you

fn egc_to_string(egc: u32) -> Option<String> {
    let bytes = egc.to_ne_bytes();
    let no_nuls = bytes.split(|b|*b == 0).next().unwrap();
    std::str::from_utf8(no_nuls).ok().map(|s| s.to_string())
}
3 Likes

I don't see why you go through a CStr, instead of using String::from_utf8 ?

For avoiding the heap, you might want to try smartstring - Rust .

Or you can write your own type for this particular kind of short string, that keeps the data as an u32 or [u8;4], but implements Deref<Target=str>.

1 Like

Oops I was so fixated on interoperating with FFI that I didn't realize I could easily avoid CStr all together...

Thanks for the smartstring reference. And I also like the idea of a custom type that derefs to str.

Just a small remark regarding terminology: What you likely meant is a Unicode Code Point or Unicode Scalar Value (opposed to a Grapheme Cluster or EGC). But maybe I misunderstood something?

Grapheme clusters might not fit into a 32 bit integer. (And code points that aren't a Unicode Scalar Value can't be converted into a Rust str.) See also Rust's primitive char type:

Primitive Type char

A character type.

The char type represents a single character. More specifically, since ‘character’ isn’t a well-defined concept in Unicode, char is a ‘Unicode scalar value’.

For the difference between a code point and a grapheme cluster, see Unicode Standard Annex #29.


See also Primitive str::chars, which clarifies:

It’s important to remember that char represents a Unicode Scalar Value, and might not match your idea of what a ‘character’ is. Iteration over grapheme clusters may be what you actually want. This functionality is not provided by Rust’s standard library, check crates.io instead.

I can't help it. I want you to review the code for me. lol~

Yeah! Thanks, I had to learn the differece a while ago, since it was really confusing for me at first.

I do really mean a grapheme cluster in this case. I'm working on the bindings for a C library that stores the grapheme cluster only if it's up to 4 bytes, or a pointer to where it's stored if it's bigger than that, see: libnotcurses_sys::NcCell

Note that the latter version of that can be found in Utf8Char in encode_unicode - Rust

(No need for the extra alignment that's forced by storing it as a u32 -- the array of bytes is plenty.)

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.