String library respecting Unicode

Is there a library in crates.io that performs common operations on strings that deal in Unicode characters instead of bytes? For example, I'm looking for functions like trim, find, split, match (a regex), and so on that don't just look at a string as a sequence of bytes. I know I can call the chars() method on a String and then could write those functions myself. But it seems like someone has probably already done that.

There's regex, but its a fairly heavy crate.

1 Like

The built-in str type and the standard library String type are Unicode strings, stored in UTF-8. All of the standard string methods like find and trim work on arbitrary Unicode text. For example, trim removes all characters with the Unicode White_Space property, not just ASCII whitespace bytes. And find can match arbitrary Unicode characters or substrings.

Another crate you might find useful is unicode-segmentation, if you need to break strings on grapheme or word boundaries as specified by UAX29.

4 Likes

How would you implement this JavaScript code in Rust?

const s = "January|February|March";
const months = s.split('|'); // ["January", "February", "March"]

like this: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=0f5ecdc8eac91c1885ec223fca5eb5cd

fn main() {
  let s = "January|February|March";
  let months : Vec<&str> = s.split('|').collect();
  println!("{:?}", months);
}

basically split returns something that's an iterator over references to slices of the original string.

Edit: if you want months to be a Vec<String> instead, use this:

let months : Vec<String> = s.split('|').map(str::to_owned).collect();
2 Likes