Unpacking a &str into individual chars

I'm writing a solution to advent of code 2016 day 7, and i need to create a function that takes a 4-letter sequence and compares the characters between each other.
Is there a concise way to bind the first n characters of a string to their variables? Something like let (a,b,c,d) = &s[..4] or let [a,b,c,d,_] = &s[..].

I remember reading something about pattern matching slices of the source string a few months back, but I can't find it
I also apologize if this has already been asked with a different phrasing i couldn't think of

With itertools

use itertools::Itertools;

fn foo(s: &str) {
    let (a, b, c, d) = s.chars().next_tuple().unwrap();
    dbg!(a, b, c, d);
}

fn main() {
    foo("Hello World!");
}
[src/main.rs:5] a = 'H'
[src/main.rs:5] b = 'e'
[src/main.rs:5] c = 'l'
[src/main.rs:5] d = 'l'
1 Like

Or without them:

fn foo(s: &str) {
    let mut chars = s.chars();
    let (a, b, c, d) = (
        chars.next().unwrap(),
        chars.next().unwrap(),
        chars.next().unwrap(),
        chars.next().unwrap(),
    );
    dbg!(a, b, c, d);
}

fn main() {
    foo("Hello World!");
}
4 Likes

There is no pattern matching of chars within a str, as a char is basically a u32 with invariants, while a str is UTF-8 (an encoding wherein each char-equivalent is encoded with a variable number of bytes). Ranged slices of a str are possible, but problematic in practice as they will panic on a non-char boundary. You can also pattern match against bytes with .as_bytes(), which is sufficient for some online drill type problems, but not necessarily a good practice (unless you're in an ASCII-constrained environment, perhaps).

2 Likes

Now I'm wondering: how common are actual production environments that are ASCII-constrained?

Also, I remember the rust manual mentioning libraries for isolating grapheme clusters. Can a function that assumes ASCII-only be safely converted into a grapheme clusters one? This is so I can know how urgently I should learn how to properly handle full unicode

(sorry for the very specific additional questions)

Anecdotal, but increasingly rare I feel. Most environments I personally work with are

  • More OsString like at the system level -- supersets of ASCII mainly
    • Sometimes programs/libraries assume ASCII in such environments, or conflate "bytes string" with ASCII, but they're incomplete/buggy and generally must be fixed if they start becoming more widely used
  • Unicode of some flavor
  • Variable encoding

It does come up from time to time in e.g. long-lived protocols, but it's nothing I would assume in the modern age.

unicode-segmentation is perhaps the library in question.

There's no direct conversion as each ASCII byte is, well, a fixed size (a byte -- technically a 7-bit value, but almost always present as a byte) whereas a grapheme cluster may consist of multiple code points.

Encodings, the presentation width of strings, the definition of a character, et cetera are tricky problems on the universal level. The best approach is still a largely use-case specific concern, but generally speaking, I recommend developing a habit of considering encodings like UTF-8 (which is increasingly common) over assuming fixed-width or no-invaraints encodings like ASCII or byte strings.

In particular, Rust Strings/strs are UTF8, so at least be aware that you can't always logically split a str at an arbitrary byte offset (as you may be in the middle of a code point encoding). If you're dealing with splitting Strings at the presentation level, aspire to graphemes. At the parsing/searching/matching level, chars or even bytes are often sufficient.

3 Likes

Many of the Advent of Code puzzles seem to more naturally work on sequences of characters than Rust strings. In some of them I wrote

let s: Vec<char> = s.chars().collect();

Then you can for example use std::slice::Windows.

1 Like