String unicode awareness

Hi,

This is a test fragment:

    let en1 = "The Good Soldier SvejkThe Good Soldier Svejk";
    let sv1 = "Osudy dobrého vojáka Švejka za světové války";
    let svu1 = sv1.chars().collect::<Vec<char>>();

    assert_eq!(en1.len(), 44);
    assert_eq!(sv1.len(), 50);

    let (head, _) = sv1.split_at(10);
    assert_eq!(head, "Osudy dobr");
    // let (head, _) = sv1.split_at(11);
    // assert_eq!(head, "Osudy dobre");
    let (head, _) = sv1.split_at(12);
    assert_eq!(head, "Osudy dobré");

    let (head, _) = svu1.split_at(10);
    assert_eq!(head.into_iter().collect::<String>(), "Osudy dobr");
    let (head, _) = svu1.split_at(11);
    assert_eq!(head.into_iter().collect::<String>(), "Osudy dobré");

If you remove the comment, the test panics with a very clear message:

thread 'test_main::test_truncate_str' panicked at 'byte index 11 is not a char boundary; it is inside 'é' (bytes 10..12) of `Osudy dobrého vojáka Švejka za světové války`', /rustc/5531927e8af9b99ad923af4c827c91038bca51ee/library/core/src/str/mod.rs:584:13

The string is indexed by the byte; has length in bytes. The purpose, one is supposed to think, is to allow efficiency when you really need it. There is a lot of use cases where you can dispense with splitting. You can appreciate efficiency in the embedded environment, for instance.

I'm really curious how the string is capable of checking Unicode boundaries without the efficiency going to... wherever it goes. If the efficiency is already sacrificed, why bothering with bytewise indexing and len()?

UPD:

If you want to split by 11 real bad, courtesy of kaj:

    let (head, _) = sv1.as_bytes().split_at(11);
    assert_eq!(String::from_utf8_lossy(head), "Osudy dobr�");

You don't need to parse the whole string to identify whether a particular byte index is a character boundary or not. All you need to do is examine the byte. This is one of the advantages of UTF-8 over some other encodings.

Here's the implementation of std::str::is_char_boundary:

    pub fn is_char_boundary(&self, index: usize) -> bool {
        // 0 is always ok.
        // Test for 0 explicitly so that it can optimize out the check
        // easily and skip reading string data for that case.
        // Note that optimizing `self.get(..index)` relies on this.
        if index == 0 {
            return true;
        }

        match self.as_bytes().get(index) {
            // For `None` we have two options:
            //
            // - index == self.len()
            //   Empty strings are valid, so return true
            // - index > self.len()
            //   In this case return false
            //
            // The check is placed exactly here, because it improves generated
            // code on higher opt-levels. See PR #84751 for more details.
            None => index == self.len(),

            // This is bit magic equivalent to: b < 128 || b >= 192
            Some(&b) => (b as i8) >= -0x40,
        }
    }

Looks reasonably efficient to me (and well-documented to boot).

5 Likes

You mean, since we have a safe language and have to check boundaries at all times, this overhead looks affordable?

I'm no longer sure I understand the question. What do you mean "this overhead"? What part of the cost of this code is overhead, in your estimation?

2 Likes

It is affordable (and about as efficient as possible in any language) if you want to split it in two substrings with the guarantee that both substrings are valid utf8. If you don't care for that guarantee, you can skip the test by calling .as_bytes() before splitting.

4 Likes

It should be noted that Rust's str (or String as owned/growable variant) doesn't care about Unicode grapheme cluster boundaries but only knows something about UTF-8 encoding and disallowing UTF-16 surrogates. Thus you could say that str (and also char) care more about UTF-8/UTF-16 encoding rather than Unicode in particular.

Also see documentation on str::chars:

It’s important to remember that char represents a Unicode Scalar Value, and might not match your idea of what a ‘character’ is. Iteration over grapheme clusters may be what you actually want. This functionality is not provided by Rust’s standard library, check crates.io instead.

Handling UTF-8 (and excluding surrogate characters) is much more easy than supporting Unicode completely.

Having written that, I just realized that some methods of the str type indeed support Unicode in a deeper way than just the UTF-8 encoding. For example str::trim, which removes whitespace.

3 Likes

Rust does runtime checks in several occasions (not just when working with strings). There are usually unsafe interfaces if you want to avoid the overhead, like str::get_unchecked for example. (Edit: And of course you can decide to just work on [u8] instead of using str, which won't require unsafe code.)

So Rust can provide the same performance as languages which do not do these runtime checks. But at a price. (Besides, the overhead isn't really much.)

But after having experimented with unsafe myself, I'd really recommend to be cautious with using it :sweat_smile:.

3 Likes

The part that may not be zero-cost about this abstraction is the panic. I would prefer a method try_split_at(&self, mid: usize) -> Option<(&str, &str)> that returns the split if there is a char boundary and None if there is not a char boudary at the position.

There doesn't seem to be such a method in the std library though. Might be worth taking up on https://internals.rust-lang.org/ ?

Thanks! There is UnicodeSegmentation, too :slight_smile: . I'm trying to get comfortable about the ideas behind the Rust string management.

Doing the safety checks isn't "zero-cost" in either case, but "little cost", I'd say.

You might use str::is_char_boundary to check whether you can split a string at a certain position. It might be two extra calls, but not sure if it's worth adding a function to do a split without panic to std (if it doesn't exist, I didn't check).

It may be worth noting that most of the time, when you're doing string processing, you probably don't have an arbitrary byte index to split at - it likely comes from another text processing function which already guarantees it is a valid character boundary. This may be why there's no try_split_at in std.

In that context, checking the boundary "again" would indeed be overhead, which you could avoid by using unsafe and get_unchecked (which is what split_at does internally). But then, even that might be unnecessary. If the compiler can inline enough to mash the checks together, it won't even matter. Even if there are multiple checks, the second failure branch will never be taken, so branch prediction will probably reduce it to 1-2 cycles: something to consider if you're in a very tight loop, but not a likely bottleneck for typical text processing tasks.

3 Likes

Oh, then it can be indeed "zero-cost" in some cases.

You can do exactly that using str::get(..index) and str::get(index..).

5 Likes

Thanks! I hadn't realized I could use slice indexes as arguments for get. TIL. :slight_smile:

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.