Replacing german umlauts


#1

Hello,
I have to replace german umlauts in a file name with ae, ue, oe etc… But I can’t think of an effective solution. In other languages, for example D, I’m used to a translate method that accepts a hash map and replaces the keys with the respective values. However, I have not yet discovered something similar in Rust.
Here is my current solution, but I think that it could be improved:

use std::collections::HashMap;

trait Transcode {
    fn encode(&self) -> String;
}

impl Transcode for str {
    fn encode(&self) -> String {
        let mut trans = HashMap::new();
        trans.insert('ä', 'a');
        trans.insert('ü', 'u');
        trans.insert('ö', 'o');
        trans.insert('Ä', 'A');
        trans.insert('Ü', 'U');
        trans.insert('Ö', 'O');
        
        let mut cs = Vec::new();
        for c in self.chars() {
            if let Some(t) = trans.get(&c) {
                cs.push(*t);
                cs.push('e');
            } else {
                cs.push(c);
            }
        }

        let s: String = cs.into_iter().collect();

        return s;
    }
}

#2

A few comments:

  • no need to push into a Vec, you can directly push into a new String (push and push_str).
  • instead of hardcoding the e, I’d put strings into the replacement map, which makes it easier to generalize later (ß -> ss comes to mind).
  • depending on the frequency of calls, and the performance requirement, you want to either build the HashMap outside the function (see lazy_static crate), or even use a static hash (see phf family of crates).

#3

Well, first of all, using chars for this is just wrong: that fails to take decomposed characters into account. You’ll need to use something like the unicode-segmentation crate to split the string into grapheme clusters. That, or normalise the string before processing.

Secondly, if you’ve got a relatively limited list of replacements like this, using match instead of dynamic lookup would probably be faster.

birkenfeld has already covered the rest. :slight_smile:


#4

Damn, I always fail to think of Unicode normalization. Good point!


#5

No one expects Unicode normalisation! Cardinal Fang, fetch… the NFC algorithm!


#6

There’s unidecode that handles it for many languages.


#7

Thanks for the suggestions! I’ll give rust-unidecode a try. :slight_smile: