Unicode-lowercasing as an iterator adapter

#1

Is there already a crate that provides an iterator adapter that consumes an iterator yielding char and itself yields char with default (non-Turkish) lower-casing applied (taking into account Greek final sigma and such)?

AFAICT, the unic family of crates provides e.g. normalization as iterator adapters, but I fail to find a lower-casing iterator adapter in the unic family of crates.

#2

You could use the default char::to_lowercase and build a string out of it, e.g.

fn main() {
    let s = "Δ-Straẞe-İ";
    println!("{}", s.chars().flat_map(|c| c.to_lowercase()).collect::<String>());
}
3 Likes
#3

No, you can’t. It doesn’t handle the final Greek sigma correctly. See: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=ce0c978e42c3e05025db6278c26e2fad

2 Likes
#4

Then that’s something we need to add. Feel free to open an issue for it. (Progress is very slow, sadly.)

#5

Sadly, it’s not even possible to implement this without the possibility of heap allocation. The definition is that C is a final sigma if C is sigma and “C is preceded by a sequence consisting of a cased letter and then zero or more case-ignorable characters, and C is not followed by a sequence con-sisting of zero or more case-ignorable characters and then a cased letter.” The before condition can be handled with a finite-space state machine, but the number of case-ignorable characters after C is not bounded, so the iterator adapter has to buffer an unbounded number of characters.

#6

I needed something now, so I published https://crates.io/crates/iterlower . I’d be happy to add a retirement notice to the crate when unic implements analogous functionality.

2 Likes