Unicode-lowercasing as an iterator adapter

Is there already a crate that provides an iterator adapter that consumes an iterator yielding char and itself yields char with default (non-Turkish) lower-casing applied (taking into account Greek final sigma and such)?

AFAICT, the unic family of crates provides e.g. normalization as iterator adapters, but I fail to find a lower-casing iterator adapter in the unic family of crates.

You could use the default char::to_lowercase and build a string out of it, e.g.

fn main() {
    let s = "Δ-Straẞe-İ";
    println!("{}", s.chars().flat_map(|c| c.to_lowercase()).collect::<String>());
}
3 Likes

No, you can't. It doesn't handle the final Greek sigma correctly. See: Rust Playground

2 Likes

Then that's something we need to add. Feel free to open an issue for it. (Progress is very slow, sadly.)

Sadly, it's not even possible to implement this without the possibility of heap allocation. The definition is that C is a final sigma if C is sigma and "C is preceded by a sequence consisting of a cased letter and then zero or more case-ignorable characters, and C is not followed by a sequence con-sisting of zero or more case-ignorable characters and then a cased letter." The before condition can be handled with a finite-space state machine, but the number of case-ignorable characters after C is not bounded, so the iterator adapter has to buffer an unbounded number of characters.

I needed something now, so I published https://crates.io/crates/iterlower . I'd be happy to add a retirement notice to the crate when unic implements analogous functionality.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.