I can't `match` into two different iterators

One approach is to define an enum implemented with Iterator:

enum Iter<'c>{
    Chars(Chars<'c>),
    Once(std::iter::Once<char>)
}

Ah! What a shame. There's a customized error for the case where you're returning the iterator directly (instead of using it as part of a chain), which explains what you might want to do:

help: you could change the return type to be a boxed trait object
  |
1 | fn inner(c: char) -> Box<dyn Iterator<Item = char>> {
  |                      ~~~~~~~                      +
help: if you change the return type to expect trait objects, box the returned expressions
  |
3 ~         '.' => Box::new("[.]".chars()),
4 ~         c => Box::new(std::iter::once(c)),

This (using Box<dyn Trait> to erase the type of both arms and end up with the same type) is not the only solution, as @vague points out you can also create a new type that implements the trait you need (in this case Iterator) so that there's a single type that dispatches to the right variant. The benefit of that approach is that you no longer need to go through a vtable, which might be more efficient (depending on a lot of different things, if you don't know about this already, you probably shouldn't be too concerned about it) and because it let's you use traits that are not "object safe" (traits that can't be used as dyn Trait because they have associated functions, constants or types).

Long story short, I'd write:

fn myfunc(input: String) -> String {
    input.chars().flat_map(|c| {
        match c {
            '.' => Box::new("[.]".chars()) as Box<dyn Iterator<Item = char>>,
            c => Box::new(std::iter::once(c)),
        }
    }).collect()
}
3 Likes

Not to use flat_map is also a choice for the snippet:

pub fn myfunc(input: String) -> String {
    let mut s = String::with_capacity(input.len() * 2);
    for c in input.chars() {
        match c {
            '.' => s.push_str("[.]"),
            _ => s.push(c),
        }
    }
    s
}
3 Likes

Thanks. I understand that there are tons of other ways. I got to the point when I can't find a way to to make those two iterators to be "the same" for the flat_map (... match, to be precise).

Given the discussion from above I start to believe, that there is no clear natural way to do so, and it's not my ignorance, but "you can't do this".

Is it so? Because when I got to the error, I assumed I just miss something obvious and well-known.

The obvious and well-known thing is to stay the same type on each match branch. So you just can't do that AFAIK.

Sometimes you can force the type of two branches to be the same by adding some "redundant" adaptors. Eg. if you take(n) in one branch, you can take(usize::MAX) in the other. It's possible here too, but YMMV if it's worth the readability hit:

fn myfunc(input: String) -> String {
    input.char_indices().flat_map(|(i, c)| {
        match c {
            '.' => "[.]".chars().take(3),
            _ => input[i..].chars().take(1),
        }
    }).collect()
}

This could be improved by using the currently-unstable ceil_char_boundary to extract the slice input[i..j] that comprises exactly the char at i; then there wouldn't be a need for the chars().take() dance at all.

4 Likes

(But I though there is some kind of 'converting iterator' which can stuck to both branches to converge them to the same type... There is none, I got this).

Thank you.

Unfortunately not in the standard library right now, but there's the either crate that you can use for this purpose, among others.

2 Likes

Wowo elegent! Thank you for mentioning either. I'm aware of that crate and its implementations, but I just haven't tried it like this :laughing:

use either::Either;
pub fn myfunc(input: String) -> String {
    input
        .chars()
        .flat_map(|c| match c {
            '.' => Either::Left("[.]".chars()),
            c => Either::Right(std::iter::once(c)),
        })
        .collect()
}
1 Like

To point out the elephant in the room, your function is equivalent with input.replace('.', "[.]").

2 Likes

I looked how it's done, and huh, it's not for faint-hearted...

    #[cfg(not(no_global_oom_handling))]
    #[rustc_allow_incoherent_impl]
    #[must_use = "this returns the replaced string as a new allocation, \
                  without modifying the original"]
    #[stable(feature = "rust1", since = "1.0.0")]
    #[inline]
    pub fn replace<'a, P: Pattern<'a>>(&'a self, from: P, to: &str) -> String {
        let mut result = String::new();
        let mut last_end = 0;
        for (start, part) in self.match_indices(from) {
            result.push_str(unsafe { self.get_unchecked(last_end..start) });
            result.push_str(to);
            last_end = start + part.len();
        }
        result.push_str(unsafe { self.get_unchecked(last_end..self.len()) });
        result
    }

Well, it's not exactly complicated though, is it? It just repeatedly finds the next match of the pattern and copies over the preceding part followed by the replacement. It's not even particularly optimized (apart from the unsafe range getters).

Another trick that can sometimes work is to give out slices in both branches.

That doesn't work with .chars()s, since that gives char not a reference, but UTF-8 has the nice property that all ASCII characters only appear in UTF-8 as that character, never as part of something else.

So here's another way that works:

pub fn myfunc(input: &str) -> String {
    let bytes: Vec<u8> = input
        .as_bytes()
        .into_iter()
        .flat_map(|c| match c {
            b'.' => b"[.]",
            c => std::slice::from_ref(c),
        })
        .copied()
        .collect();
    String::from_utf8(bytes).unwrap()
}

The core trick here is from_ref in std::slice - Rust, which turns &T into &[T], allowing the arm to type-unify with the &'static [u8; 3] in the other arm.

As others have done here, I won't say that this is a better solution in this particular case, but hopefully it gives you a useful technique to think about in other things later.


EDIT: I looked at this again and realized there's a better way. It can map to a &str instead of needing flat_map, since you can collect a String from &strs:

pub fn myfunc(input: &str) -> String {
    input
        .as_bytes()
        .into_iter()
        .map(|c| match c {
            b'.' => "[.]",
            c => std::str::from_utf8(std::slice::from_ref(c)).unwrap(),
        })
        .collect()
}

That'll be better for cache locality too, since it doesn't have that extra pass at the end for UTF-8 rechecking.

And, more generally, that leads into better text-handling patterns, since char is really not what you usually want.

For example, to support emoji

assert_eq!(myfunc("as🇨🇦df"), "as🍁df");
assert_eq!(myfunc(".a s.d f."), "[.]a s[.]d f[.]");

you can do this:

use unicode_segmentation::UnicodeSegmentation; // 1.10.0
pub fn myfunc(input: &str) -> String {
    input
        .graphemes(true)
        .map(|c| match c {
            "🇨🇦" => "🍁",
            "." => "[.]",
            c => c,
        })
        .collect()
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=b71f6e76e782b57ae81a6474cf3034d8

And if you're wondering why chars don't work for that, it's because the flags aren't one codepoint:

error: character literal may only contain one codepoint
  --> src/main.rs:17:13
   |
17 |     let x = '🇨🇦';
   |             ^^^^

(Nor are a variety of other emoji, or even certain accented letters.)

8 Likes

Thank you!

This is the wisdom I come here for!

I continue to think about this problem...

Can I iterate over &str by &str with slice size of a single character inside utf-8?

Something like .iter_as_ref_str() which have Item=&str?, and that item always has len() == 1.

Like this?

fn iter_as_ref_str(s: &str) -> impl Iterator<Item = &str> {
    s.char_indices().map(|(idx, c)| &s[idx..idx + c.len_utf8()])
}
1 Like

A person on r/rust showed me str.split_inclusive(|_| true), and it really worked!

    pub fn myfunc(input: String) -> String {
        input.split_inclusive(|_| true).map(|c|{
                match c {
                  "." => "[.]",
                  c => c
                }
           }).collect()
    }

It's actually looks like the simplest of all (except of replace, of course).

3 Likes

You can't do that because UTF-8 is a variable-width encoding. Code points >= 128 will always be encoded ad multiple bytes.

Oh, len for str return not the number of characters, but bytes. Oh, I meant the "one character per slice".

fn main() {
    println!("{:?}", "🥒🍅".split_inclusive(|_| true).map(|c| c.len()).collect::<Vec<usize>>());
}

=> [4,4]

It may not matter here, but note that a char is a unicode scalar value, which is not necessarily what a human would consider a character. When you want the latter, you probably want grapheme clusters.

2 Likes