How to get a substring of a String

something like Rust Playground should do the trick for indexing unicode codepoints but this still would not handle strings with combining characters. I wonder if there is a more obvious or easier way to do this.

1 Like

You want graphemes, but I believe that was de-stabilised because it might/might not be moving to an external crate.

Frankly, "how do I get the first X characters" is almost never a valid question in the first place: there's pretty much no reason to ever do it.

1 Like

Thanks for all the answers. Very helpful!!!.

I will look at the links.
I understand that the problem is harder than it seems, particularly if one wants Unicode support and efficiency at the same time :smile:

be careful - this code can cause panicking.

3 Likes

It's only useful in this special case. See the rest of the comments.

This code implements both substring-ing and string-slicing, and should never panic:

use std::ops::{Bound, RangeBounds};

trait StringUtils {
    fn substring(&self, start: usize, len: usize) -> &str;
    fn slice(&self, range: impl RangeBounds<usize>) -> &str;
}

impl StringUtils for str {
    fn substring(&self, start: usize, len: usize) -> &str {
        let mut char_pos = 0;
        let mut byte_start = 0;
        let mut it = self.chars();
        loop {
            if char_pos == start { break; }
            if let Some(c) = it.next() {
                char_pos += 1;
                byte_start += c.len_utf8();
            }
            else { break; }
        }
        char_pos = 0;
        let mut byte_end = byte_start;
        loop {
            if char_pos == len { break; }
            if let Some(c) = it.next() {
                char_pos += 1;
                byte_end += c.len_utf8();
            }
            else { break; }
        }
        &self[byte_start..byte_end]
    }
    fn slice(&self, range: impl RangeBounds<usize>) -> &str {
        let start = match range.start_bound() {
            Bound::Included(bound) | Bound::Excluded(bound) => *bound,
            Bound::Unbounded => 0,
        };
        let len = match range.end_bound() {
            Bound::Included(bound) => *bound + 1,
            Bound::Excluded(bound) => *bound,
            Bound::Unbounded => self.len(),
        } - start;
        self.substring(start, len)
    }
}

fn main() {
    let s = "abcdèfghij";
    // All three statements should print:
    // "abcdè, abcdèfghij, dèfgh, dèfghij."
    println!("{}, {}, {}, {}.",
        s.substring(0, 5),
        s.substring(0, 50),
        s.substring(3, 5),
        s.substring(3, 50));
    println!("{}, {}, {}, {}.",
        s.slice(..5),
        s.slice(..50),
        s.slice(3..8),
        s.slice(3..));
    println!("{}, {}, {}, {}.",
        s.slice(..=4),
        s.slice(..=49),
        s.slice(3..=7),
        s.slice(3..));
}
13 Likes

Maybe this?

let s = "Golden Eagle".chars();
let sub : String = s.into_iter().take(6).collect();

I don't know how to avoid allocation, but AFIK language like Java or go substring always requires an allocation.

4 Likes

I'd say this is almost correct.
Except as a user you'll likely want to use the UnicodeSegmentation::graphemes() function from the unicode-segmentation crate rather than the built-in .chars() method.

See the link for an example on how to use theUnicodeSegmentation::graphemes() function.

The difference is that the graphemes fn accounts for non-ascii unicode “characters“ (as in the elements from which a word is formed at the human level; as noted before Unicode doesn't properly define what a character is), and the .chars() method does not.

1 Like

If you want to know how to avoid allocation, read the routines I posted before your message. Using them, you can write:

let sub = "Golden Eagle".substring(0, 6);

or:

let sub = "Golden Eagle".slice(0..6);

avoiding any allocation.

2 Likes

can you elaborate a bit?

Here is what documented in Rust:

The char type represents a single character. More specifically, since 'character' isn't a well-defined concept in Unicode, char is a 'Unicode scalar value', which is similar to, but not the same as, a 'Unicode code point'.

So unicode-segmentation is able to handle all unicode code point?

yeah, but that's too much code just for substring...

how about this?

    let s = "Golden Eagle";
    let mut end : usize = 0;
    s.chars().into_iter().take(6).for_each(|x| end += x.len_utf8());
    println!("{}", &s[..end]);
2 Likes

unicode-segmentation doesn't deal with code points, it deals with grapheme clusters, which are one or more scalars that combine together into a single thing, which might or might not appear as a single symbol.

"á" and "á" are a single grapheme cluster each, but the first has two chars in it, whilst the second has one.

3 Likes

One issue is that char is fixed width, whereas unicode graphemes are not. They can be 1 char (e.g. all ascii letters), or compound which makes them multichar (e.g. ë or ö).

Aside from that, graphemes don't always end at the char boundary, so taking ʼnʼ chars from the iterator provided by ʼ.chars()ʼ might not provide the output you think it will when iterating over non-English languages e.g. Danish, or Chinese. You might for example end up with an a when you intended to take an á from a Spanish text, as accented letters are compound graphemes i.e. technically it consists of 2 graphemes: the accent, and the base letter.

TL;DR: unicode is very messily defined, and way more complex than programmers in general assume.
However, I've generally found that with the UnicodeSegmentation::graphemes() method such issues tend to become nonissues, as it handles all the nastiness for me. In contrast, .chars() behaves in rather surprising ways due to being char-based.

1 Like

graphemes() should ideally be a method in std library like chars(). Since the issue can be so confusing for the average programmer, the existence of both these methods on the same struct in the std lib will at least make him or her pause and think instead of blindly going with chars(). In other words it'll be a usability improvement.

1 Like

It was, and then it was pushed out into unicode-segmentation. It's unlikely to ever go back.

Why so? Especially when the chars() documentation still urges the user to use graphemes instead of chars:

It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.

Because it's not something the standard library has to include, not everyone is writing text manipulation code (beyond simple interpolation), it's one less thing the core devs have to maintain forever, the tables required can be quite large, and it allows the version of Unicode supported to be updated independently of the compiler.

Rust is not trying, and has never tried, to be "batteries included".

3 Likes

Oh I know and support std library's approach to being minimal, but chars() being there and graphemes() not is kinda broken. In my view one is much more likely to need grapheme clusters than Unicode scalar values (as the discussion above indicates). So if we had to pick one method it should have been graphemes() rather than chars(). Why was chars() preferred then?

The standard library has to be able to convert UTF8 into UTF16 in order to provide file system access under Windows, so it can't just treat str as a bag of bytes. There exists one, and only one, right way to convert between Unicode encodings, and it is unlikely to ever change.

Being able to encode/decode Unicode is absolutely required for Rust to be able to talk to the OS (at least on Windows). Splitting on grapheme cluster boundaries isn't required for anything else Rust provides.

2 Likes