Hi,
what is the best way to get a substring of a String?
I couldn't find a substr method or similar.
Let's assume I have a String like "Golden Eagle" and I want to get the first 6 characters, that is "Golden".
How can I do that?
Markus
Hi,
what is the best way to get a substring of a String?
I couldn't find a substr method or similar.
Let's assume I have a String like "Golden Eagle" and I want to get the first 6 characters, that is "Golden".
How can I do that?
Markus
Strings can be sliced using the index operator:
let slice = &"Golden Eagle"[..6];
println!("{}", slice);
The syntax is generally v[M..N]
, where M
< N
. This will return a slice from M
up to, but not including, N
. There are also some more sugary syntax, like [..N]
(everything up to N
), [N..]
(everything from N
and forwards) and [..]
(everything).
It's the same for String
, as well as vector/array types.
It's important to note that this is a slice of bytes, it will not actually return the first six characters.
let slice = &"Können"[..6];
println!("{}", slice);
prints Könne
.
Good point. I thought that something felt fishy when I remembered that you cannot index a string and get a character.
Yes, and that's exactly why
The issue with your question, @mjais, is that 'character' isn't a well-defined thing in the unicode universe. Check out Strings
something like Rust Playground should do the trick for indexing unicode codepoints but this still would not handle strings with combining characters. I wonder if there is a more obvious or easier way to do this.
You want graphemes
, but I believe that was de-stabilised because it might/might not be moving to an external crate.
Frankly, "how do I get the first X characters" is almost never a valid question in the first place: there's pretty much no reason to ever do it.
Thanks for all the answers. Very helpful!!!.
I will look at the links.
I understand that the problem is harder than it seems, particularly if one wants Unicode support and efficiency at the same time
be careful - this code can cause panicking.
It's only useful in this special case. See the rest of the comments.
This code implements both substring-ing and string-slicing, and should never panic:
use std::ops::{Bound, RangeBounds};
trait StringUtils {
fn substring(&self, start: usize, len: usize) -> &str;
fn slice(&self, range: impl RangeBounds<usize>) -> &str;
}
impl StringUtils for str {
fn substring(&self, start: usize, len: usize) -> &str {
let mut char_pos = 0;
let mut byte_start = 0;
let mut it = self.chars();
loop {
if char_pos == start { break; }
if let Some(c) = it.next() {
char_pos += 1;
byte_start += c.len_utf8();
}
else { break; }
}
char_pos = 0;
let mut byte_end = byte_start;
loop {
if char_pos == len { break; }
if let Some(c) = it.next() {
char_pos += 1;
byte_end += c.len_utf8();
}
else { break; }
}
&self[byte_start..byte_end]
}
fn slice(&self, range: impl RangeBounds<usize>) -> &str {
let start = match range.start_bound() {
Bound::Included(bound) | Bound::Excluded(bound) => *bound,
Bound::Unbounded => 0,
};
let len = match range.end_bound() {
Bound::Included(bound) => *bound + 1,
Bound::Excluded(bound) => *bound,
Bound::Unbounded => self.len(),
} - start;
self.substring(start, len)
}
}
fn main() {
let s = "abcdèfghij";
// All three statements should print:
// "abcdè, abcdèfghij, dèfgh, dèfghij."
println!("{}, {}, {}, {}.",
s.substring(0, 5),
s.substring(0, 50),
s.substring(3, 5),
s.substring(3, 50));
println!("{}, {}, {}, {}.",
s.slice(..5),
s.slice(..50),
s.slice(3..8),
s.slice(3..));
println!("{}, {}, {}, {}.",
s.slice(..=4),
s.slice(..=49),
s.slice(3..=7),
s.slice(3..));
}
Maybe this?
let s = "Golden Eagle".chars();
let sub : String = s.into_iter().take(6).collect();
I don't know how to avoid allocation, but AFIK language like Java or go substring always requires an allocation.
I'd say this is almost correct.
Except as a user you'll likely want to use the UnicodeSegmentation::graphemes()
function from the unicode-segmentation crate rather than the built-in .chars()
method.
See the link for an example on how to use theUnicodeSegmentation::graphemes()
function.
The difference is that the graphemes fn accounts for non-ascii unicode “characters“ (as in the elements from which a word is formed at the human level; as noted before Unicode doesn't properly define what a character is), and the .chars()
method does not.
If you want to know how to avoid allocation, read the routines I posted before your message. Using them, you can write:
let sub = "Golden Eagle".substring(0, 6);
or:
let sub = "Golden Eagle".slice(0..6);
avoiding any allocation.
can you elaborate a bit?
Here is what documented in Rust:
The
char
type represents a single character. More specifically, since 'character' isn't a well-defined concept in Unicode,char
is a 'Unicode scalar value', which is similar to, but not the same as, a 'Unicode code point'.
So unicode-segmentation is able to handle all unicode code point?
yeah, but that's too much code just for substring...
how about this?
let s = "Golden Eagle";
let mut end : usize = 0;
s.chars().into_iter().take(6).for_each(|x| end += x.len_utf8());
println!("{}", &s[..end]);
unicode-segmentation
doesn't deal with code points, it deals with grapheme clusters, which are one or more scalars that combine together into a single thing, which might or might not appear as a single symbol.
"á" and "á" are a single grapheme cluster each, but the first has two char
s in it, whilst the second has one.
One issue is that char
is fixed width, whereas unicode graphemes are not. They can be 1 char (e.g. all ascii letters), or compound which makes them multichar (e.g. ë or ö).
Aside from that, graphemes don't always end at the char
boundary, so taking ʼnʼ chars from the iterator provided by ʼ.chars()ʼ might not provide the output you think it will when iterating over non-English languages e.g. Danish, or Chinese. You might for example end up with an a
when you intended to take an á
from a Spanish text, as accented letters are compound graphemes i.e. technically it consists of 2 graphemes: the accent, and the base letter.
TL;DR: unicode is very messily defined, and way more complex than programmers in general assume.
However, I've generally found that with the UnicodeSegmentation::graphemes()
method such issues tend to become nonissues, as it handles all the nastiness for me. In contrast, .chars()
behaves in rather surprising ways due to being char-based.
graphemes()
should ideally be a method in std library like chars()
. Since the issue can be so confusing for the average programmer, the existence of both these methods on the same struct in the std lib will at least make him or her pause and think instead of blindly going with chars()
. In other words it'll be a usability improvement.
It was, and then it was pushed out into unicode-segmentation
. It's unlikely to ever go back.