How to get a substring of a String

mjais · May 14, 2015, 7:48pm

Hi,
what is the best way to get a substring of a String?
I couldn't find a substr method or similar.

Let's assume I have a String like "Golden Eagle" and I want to get the first 6 characters, that is "Golden".

How can I do that?

Markus

ogeon · May 14, 2015, 8:34pm

Strings can be sliced using the index operator:

let slice = &"Golden Eagle"[..6];
println!("{}", slice);

The syntax is generally v[M..N], where M < N. This will return a slice from M up to, but not including, N. There are also some more sugary syntax, like [..N] (everything up to N), [N..] (everything from N and forwards) and [..] (everything).

It's the same for String, as well as vector/array types.

steveklabnik · May 14, 2015, 9:52pm

It's important to note that this is a slice of bytes, it will not actually return the first six characters.

let slice = &"Können"[..6];
println!("{}", slice);

prints Könne.

ogeon · May 14, 2015, 9:54pm

Good point. I thought that something felt fishy when I remembered that you cannot index a string and get a character.

steveklabnik · May 14, 2015, 10:07pm

Yes, and that's exactly why

The issue with your question, @mjais, is that 'character' isn't a well-defined thing in the unicode universe. Check out Strings

rusty · May 15, 2015, 4:15am

something like Rust Playground should do the trick for indexing unicode codepoints but this still would not handle strings with combining characters. I wonder if there is a more obvious or easier way to do this.

DanielKeep · May 15, 2015, 5:54am

You want graphemes, but I believe that was de-stabilised because it might/might not be moving to an external crate.

Frankly, "how do I get the first X characters" is almost never a valid question in the first place: there's pretty much no reason to ever do it.

mjais · May 15, 2015, 8:24am

Thanks for all the answers. Very helpful!!!.

I will look at the links.
I understand that the problem is harder than it seems, particularly if one wants Unicode support and efficiency at the same time

OZ_ · March 27, 2016, 9:37am

be careful - this code can cause panicking.

ogeon · March 27, 2016, 11:21am

It's only useful in this special case. See the rest of the comments.

carlomilanesi · February 2, 2019, 2:52pm

This code implements both substring-ing and string-slicing, and should never panic:

use std::ops::{Bound, RangeBounds};

trait StringUtils {
    fn substring(&self, start: usize, len: usize) -> &str;
    fn slice(&self, range: impl RangeBounds<usize>) -> &str;
}

impl StringUtils for str {
    fn substring(&self, start: usize, len: usize) -> &str {
        let mut char_pos = 0;
        let mut byte_start = 0;
        let mut it = self.chars();
        loop {
            if char_pos == start { break; }
            if let Some(c) = it.next() {
                char_pos += 1;
                byte_start += c.len_utf8();
            }
            else { break; }
        }
        char_pos = 0;
        let mut byte_end = byte_start;
        loop {
            if char_pos == len { break; }
            if let Some(c) = it.next() {
                char_pos += 1;
                byte_end += c.len_utf8();
            }
            else { break; }
        }
        &self[byte_start..byte_end]
    }
    fn slice(&self, range: impl RangeBounds<usize>) -> &str {
        let start = match range.start_bound() {
            Bound::Included(bound) | Bound::Excluded(bound) => *bound,
            Bound::Unbounded => 0,
        };
        let len = match range.end_bound() {
            Bound::Included(bound) => *bound + 1,
            Bound::Excluded(bound) => *bound,
            Bound::Unbounded => self.len(),
        } - start;
        self.substring(start, len)
    }
}

fn main() {
    let s = "abcdèfghij";
    // All three statements should print:
    // "abcdè, abcdèfghij, dèfgh, dèfghij."
    println!("{}, {}, {}, {}.",
        s.substring(0, 5),
        s.substring(0, 50),
        s.substring(3, 5),
        s.substring(3, 50));
    println!("{}, {}, {}, {}.",
        s.slice(..5),
        s.slice(..50),
        s.slice(3..8),
        s.slice(3..));
    println!("{}, {}, {}, {}.",
        s.slice(..=4),
        s.slice(..=49),
        s.slice(3..=7),
        s.slice(3..));
}

rockmen1 · February 2, 2019, 4:54pm

Maybe this?

let s = "Golden Eagle".chars();
let sub : String = s.into_iter().take(6).collect();

I don't know how to avoid allocation, but AFIK language like Java or go substring always requires an allocation.

jjpe · February 2, 2019, 5:12pm

I'd say this is almost correct.
Except as a user you'll likely want to use the UnicodeSegmentation::graphemes() function from the unicode-segmentation crate rather than the built-in .chars() method.

See the link for an example on how to use theUnicodeSegmentation::graphemes() function.

The difference is that the graphemes fn accounts for non-ascii unicode “characters“ (as in the elements from which a word is formed at the human level; as noted before Unicode doesn't properly define what a character is), and the .chars() method does not.

carlomilanesi · February 2, 2019, 5:26pm

If you want to know how to avoid allocation, read the routines I posted before your message. Using them, you can write:

let sub = "Golden Eagle".substring(0, 6);

or:

let sub = "Golden Eagle".slice(0..6);

avoiding any allocation.

rockmen1 · February 3, 2019, 2:01am

can you elaborate a bit?

Here is what documented in Rust:

The char type represents a single character. More specifically, since 'character' isn't a well-defined concept in Unicode, char is a 'Unicode scalar value', which is similar to, but not the same as, a 'Unicode code point'.

So unicode-segmentation is able to handle all unicode code point?

rockmen1 · February 3, 2019, 2:15am

yeah, but that's too much code just for substring...

how about this?

    let s = "Golden Eagle";
    let mut end : usize = 0;
    s.chars().into_iter().take(6).for_each(|x| end += x.len_utf8());
    println!("{}", &s[..end]);

DanielKeep · February 3, 2019, 3:28am

unicode-segmentation doesn't deal with code points, it deals with grapheme clusters, which are one or more scalars that combine together into a single thing, which might or might not appear as a single symbol.

"á" and "á" are a single grapheme cluster each, but the first has two chars in it, whilst the second has one.

jjpe · February 3, 2019, 3:35am

One issue is that char is fixed width, whereas unicode graphemes are not. They can be 1 char (e.g. all ascii letters), or compound which makes them multichar (e.g. ë or ö).

Aside from that, graphemes don't always end at the char boundary, so taking ʼnʼ chars from the iterator provided by ʼ.chars()ʼ might not provide the output you think it will when iterating over non-English languages e.g. Danish, or Chinese. You might for example end up with an a when you intended to take an á from a Spanish text, as accented letters are compound graphemes i.e. technically it consists of 2 graphemes: the accent, and the base letter.

TL;DR: unicode is very messily defined, and way more complex than programmers in general assume.
However, I've generally found that with the UnicodeSegmentation::graphemes() method such issues tend to become nonissues, as it handles all the nastiness for me. In contrast, .chars() behaves in rather surprising ways due to being char-based.

gurry · February 3, 2019, 3:39am

graphemes() should ideally be a method in std library like chars(). Since the issue can be so confusing for the average programmer, the existence of both these methods on the same struct in the std lib will at least make him or her pause and think instead of blindly going with chars(). In other words it'll be a usability improvement.

DanielKeep · February 3, 2019, 3:45am

It was, and then it was pushed out into unicode-segmentation. It's unlikely to ever go back.

Topic		Replies	Views
Rust substring function?	8	19264	July 3, 2022
How to slice a `str` properly? help	5	1741	January 12, 2023
Can slice but can't index an str help	10	2500	July 21, 2021
Slices why can't I use just one number? help	17	689	November 16, 2020
String processing best practices help	7	2677	January 12, 2023

How to get a substring of a String

Related topics