How to get a substring of a String

Hi,
what is the best way to get a substring of a String?
I couldn't find a substr method or similar.

Let's assume I have a String like "Golden Eagle" and I want to get the first 6 characters, that is "Golden".

How can I do that?

Markus

5 Likes

Strings can be sliced using the index operator:

let slice = &"Golden Eagle"[..6];
println!("{}", slice);

The syntax is generally v[M..N], where M < N. This will return a slice from M up to, but not including, N. There are also some more sugary syntax, like [..N] (everything up to N), [N..] (everything from N and forwards) and [..] (everything).

It's the same for String, as well as vector/array types.

6 Likes

It's important to note that this is a slice of bytes, it will not actually return the first six characters.

let slice = &"Können"[..6];
println!("{}", slice);

prints Könne.

16 Likes

Good point. I thought that something felt fishy when I remembered that you cannot index a string and get a character.

1 Like

Yes, and that's exactly why :smile:

The issue with your question, @mjais, is that 'character' isn't a well-defined thing in the unicode universe. Check out Strings

5 Likes

something like Rust Playground should do the trick for indexing unicode codepoints but this still would not handle strings with combining characters. I wonder if there is a more obvious or easier way to do this.

1 Like

You want graphemes, but I believe that was de-stabilised because it might/might not be moving to an external crate.

Frankly, "how do I get the first X characters" is almost never a valid question in the first place: there's pretty much no reason to ever do it.

1 Like

Thanks for all the answers. Very helpful!!!.

I will look at the links.
I understand that the problem is harder than it seems, particularly if one wants Unicode support and efficiency at the same time :smile:

be careful - this code can cause panicking.

3 Likes

It's only useful in this special case. See the rest of the comments.

This code implements both substring-ing and string-slicing, and should never panic:

use std::ops::{Bound, RangeBounds};

trait StringUtils {
    fn substring(&self, start: usize, len: usize) -> &str;
    fn slice(&self, range: impl RangeBounds<usize>) -> &str;
}

impl StringUtils for str {
    fn substring(&self, start: usize, len: usize) -> &str {
        let mut char_pos = 0;
        let mut byte_start = 0;
        let mut it = self.chars();
        loop {
            if char_pos == start { break; }
            if let Some(c) = it.next() {
                char_pos += 1;
                byte_start += c.len_utf8();
            }
            else { break; }
        }
        char_pos = 0;
        let mut byte_end = byte_start;
        loop {
            if char_pos == len { break; }
            if let Some(c) = it.next() {
                char_pos += 1;
                byte_end += c.len_utf8();
            }
            else { break; }
        }
        &self[byte_start..byte_end]
    }
    fn slice(&self, range: impl RangeBounds<usize>) -> &str {
        let start = match range.start_bound() {
            Bound::Included(bound) | Bound::Excluded(bound) => *bound,
            Bound::Unbounded => 0,
        };
        let len = match range.end_bound() {
            Bound::Included(bound) => *bound + 1,
            Bound::Excluded(bound) => *bound,
            Bound::Unbounded => self.len(),
        } - start;
        self.substring(start, len)
    }
}

fn main() {
    let s = "abcdèfghij";
    // All three statements should print:
    // "abcdè, abcdèfghij, dèfgh, dèfghij."
    println!("{}, {}, {}, {}.",
        s.substring(0, 5),
        s.substring(0, 50),
        s.substring(3, 5),
        s.substring(3, 50));
    println!("{}, {}, {}, {}.",
        s.slice(..5),
        s.slice(..50),
        s.slice(3..8),
        s.slice(3..));
    println!("{}, {}, {}, {}.",
        s.slice(..=4),
        s.slice(..=49),
        s.slice(3..=7),
        s.slice(3..));
}
13 Likes

Maybe this?

let s = "Golden Eagle".chars();
let sub : String = s.into_iter().take(6).collect();

I don't know how to avoid allocation, but AFIK language like Java or go substring always requires an allocation.

4 Likes

I'd say this is almost correct.
Except as a user you'll likely want to use the UnicodeSegmentation::graphemes() function from the unicode-segmentation crate rather than the built-in .chars() method.

See the link for an example on how to use theUnicodeSegmentation::graphemes() function.

The difference is that the graphemes fn accounts for non-ascii unicode “characters“ (as in the elements from which a word is formed at the human level; as noted before Unicode doesn't properly define what a character is), and the .chars() method does not.

1 Like

If you want to know how to avoid allocation, read the routines I posted before your message. Using them, you can write:

let sub = "Golden Eagle".substring(0, 6);

or:

let sub = "Golden Eagle".slice(0..6);

avoiding any allocation.

2 Likes

can you elaborate a bit?

Here is what documented in Rust:

The char type represents a single character. More specifically, since 'character' isn't a well-defined concept in Unicode, char is a 'Unicode scalar value', which is similar to, but not the same as, a 'Unicode code point'.

So unicode-segmentation is able to handle all unicode code point?

yeah, but that's too much code just for substring...

how about this?

    let s = "Golden Eagle";
    let mut end : usize = 0;
    s.chars().into_iter().take(6).for_each(|x| end += x.len_utf8());
    println!("{}", &s[..end]);
2 Likes

unicode-segmentation doesn't deal with code points, it deals with grapheme clusters, which are one or more scalars that combine together into a single thing, which might or might not appear as a single symbol.

"á" and "á" are a single grapheme cluster each, but the first has two chars in it, whilst the second has one.

3 Likes

One issue is that char is fixed width, whereas unicode graphemes are not. They can be 1 char (e.g. all ascii letters), or compound which makes them multichar (e.g. ë or ö).

Aside from that, graphemes don't always end at the char boundary, so taking ʼnʼ chars from the iterator provided by ʼ.chars()ʼ might not provide the output you think it will when iterating over non-English languages e.g. Danish, or Chinese. You might for example end up with an a when you intended to take an á from a Spanish text, as accented letters are compound graphemes i.e. technically it consists of 2 graphemes: the accent, and the base letter.

TL;DR: unicode is very messily defined, and way more complex than programmers in general assume.
However, I've generally found that with the UnicodeSegmentation::graphemes() method such issues tend to become nonissues, as it handles all the nastiness for me. In contrast, .chars() behaves in rather surprising ways due to being char-based.

1 Like

graphemes() should ideally be a method in std library like chars(). Since the issue can be so confusing for the average programmer, the existence of both these methods on the same struct in the std lib will at least make him or her pause and think instead of blindly going with chars(). In other words it'll be a usability improvement.

1 Like

It was, and then it was pushed out into unicode-segmentation. It's unlikely to ever go back.