How to do Python's str.translate() in rus?

I need to be able to replace ligatures in text with their spelled out characters. I'm doing it like this:

fn main() {
    let t = "please file the flight plans";
    let s = t.replace("ff", "ff").replace("fi", "fi").replace("fl", "fl")
             .replace("ffi", "ffi").replace("ffl", "ffl").replace("ſt", "ſt")
             .replace("st", "st");
    println!("{}\n{}", t,s);
}

This works fine, but I can't help wondering if there's a faster way?

In Python you can create a translation table (essentially a map from char to String), but I don't know if there's any equivalent in Rust.

I can not test, but I'd do it roughly this way:

let mut table: HashMap<char, String> = HashMap::new();
table.insert(...);
...
let t = "please file the flight plans";
let s = t.chars().map(|c| match table.get(&c) {
  Some(rep) => rep,
  None => c.to_string(),
}).collect::<String>();

Are you sure about this one? The ligature looks like it were "f" and "t", not "long-s" and "t", also at least by german rules for the long-s a ligature containing it as the first letter does not even make sense, as it is only allowed as last letter in a word or in combined words when it is at the word boundary, and at word boundaries ligatures are not allowed.

@NobbZ: I like the general strategy that you are proposing. Some nitpicks:

  • I'm not sure if you are allowed to collect an Iterator<String> into a String. I think something like flat_map might be needed here, but maybe I'm wrong.
  • This solution allocates one String per input char, whether it's a ligature or not. There has to be a way of doing this which is less allocation-heavy.
  • For even more speed, one might want to avoid the overhead of UTF-8 decoding and directly provide the raw UTF-8 for the ligatures, but that might be over-optimizing.
  • I think that as currently defined, this solution does not give the output String any hint about the capacity that should be allocated, which will result in reallocation and memory traffic. The length of the input string should be a decent guess of the output capacity.

Here is a working demo which features (almost) all of the above changes:

// NOTE: Here I also assume that a ligature table is available at
//       compile time and &'static strings can thus be used.
let mut table: HashMap<char, &str> = HashMap::new();
table.insert('fi', "fi");
table.insert('fl', "fl");
/* ... */

let t = "please file the flight plans";
let mut s = String::with_capacity(t.len());
for c in t.chars() {
    match table.get(&c) {
        Some(rep) => s.push_str(rep),
        None => s.push(c),
    }
}
    
assert_eq!(s, "please file the flight plans");

I initially tried to stick with Iterator<char>s, but that ultimately ended up bringing more trouble than a regular for loop, because you need a different iterator type depending if the string is in the table or not. Sometimes, imperative code wins the readability war :slight_smile:

I did not include the change of directly parsing the ligatures in the input UTF-8, because it requires more work and might not be needed.

I'm not sure about the last one. But regarding speed, using a hashmap seems a lot slower, at least the way I've done it:

extern crate fnv;

use fnv::FnvHashMap;
use std::collections::HashMap;
use std::time::{Duration, Instant};

fn main() {
    let n = 1_000_000;
    let now = Instant::now();
    test(normalize1, n);
    let elapsed = now.elapsed();
    println!("String::replace {}", secs(&elapsed));
    let now = Instant::now();
    test(normalize2, n);
    let elapsed = now.elapsed();
    println!("HashMap         {}", secs(&elapsed));
    let now = Instant::now();
    test(normalize3, n);
    let elapsed = now.elapsed();
    println!("FnvHashMap      {}", secs(&elapsed));
}

fn test(normalize: fn(&str)->String, n: u32) -> bool {
    for _ in 0..n {
        let t = "please file the flight plans";
        let e = "please file the flight plans";
        let s = normalize(t);
        if e != s {
            panic!("test_replace failed: {} != {}", e, s);
        }
    }
    return true;
}

fn normalize1(s: &str) -> String {
    s.replace("ff", "ff").replace("fi", "fi").replace("fl", "fl")
     .replace("ffi", "ffi").replace("ffl", "ffl").replace("ſt", "ſt")
     .replace("st", "st")
}

fn normalize2(s: &str) -> String {
    let mut table: HashMap<char, String> = HashMap::new();
    table.insert('ff', "ff".to_string());
    table.insert('fi', "fi".to_string());
    table.insert('fl', "fl".to_string());
    table.insert('ffi', "ffi".to_string());
    table.insert('ffl', "ffl".to_string());
    table.insert('ſt', "ſt".to_string());
    table.insert('st', "st".to_string());
    s.chars().map(|c| match table.get(&c) {
        Some(rep) => rep.to_string(),
        None => c.to_string(),
    }).collect::<String>()
}

fn normalize3(s: &str) -> String {
    let mut table: FnvHashMap<char, String> = FnvHashMap::default();
    table.insert('ff', "ff".to_string());
    table.insert('fi', "fi".to_string());
    table.insert('fl', "fl".to_string());
    table.insert('ffi', "ffi".to_string());
    table.insert('ffl', "ffl".to_string());
    table.insert('ſt', "ſt".to_string());
    table.insert('st', "st".to_string());
    s.chars().map(|c| match table.get(&c) {
        Some(rep) => rep.to_string(),
        None => c.to_string(),
    }).collect::<String>()
}

fn secs(duration: &Duration) -> String {
    format!("{:.03} sec", duration.as_secs() as f64 +
                          duration.subsec_nanos() as f64 * 1e-9)
}

For me the timings are:

String::replace 0.531 sec
HashMap         2.384 sec
FnvHashMap      2.170 sec

+- 0.1 sec on multiple runs

@mark: The difference likely emerges from the inefficiencies which I pointed out above. This version is comparable with the replace-based version in my quick tests:

fn normalize2(s: &str) -> String {
    let mut table: HashMap<char, &str> = HashMap::new();
    table.insert('ff', "ff");
    table.insert('fi', "fi");
    table.insert('fl', "fl");
    table.insert('ffi', "ffi");
    table.insert('ffl', "ffl");
    table.insert('ſt', "ſt");
    table.insert('st', "st");

    let mut o = String::with_capacity(s.len());
    for c in s.chars() {
        match table.get(&c) {
            Some(rep) => o.push_str(rep),
            None => o.push(c),
        }
    }
    o
}

Unlike the replace-based solution, this code should scale more nicely as more ligature patterns are added, since HashMap lookup is O(1) per character no matter how many patterns are around whereas replace requires O(N) scans through the string.

In addition, you can easily get more speed on short input strings by caching the ligature HashMap at a larger scope. Building a new HashMap on every input string is unnecessarily unfair to this solution.

Finally, for even more speed, as I said, I would just ditch the character-based parsing (which implicitly requires UTF-8 decoding under the hood) and just directly do pattern matching on the raw UTF-8 bytes. But I'm not sure if we already have a crate for that, and if not that may require more custom developments than you are willing to commit to.


EDIT: I can confirm that by memoizing the ligatures map, the HashMap-based version finally outperforms unambiguously the replace-based one by a factor of 2. On larger input strings, or with a larger ligature table the impact would likely be larger, because the repeated parsing of the input string would cause more memory traffic and start to have a more averse impact on CPU performance.

Here is a modified version of your benchmark. The FnvHashMap version could not be tested here because the playground does not provide the fnv crate.

3 Likes

@HadrienG: Your solution is super fast! Here are my timings:

String::replace      0.487 sec
HashMap              2.355 sec
FnvHashMap           2.118 sec
HashMap (static)     1.906 sec
FnvHashMap (static)  1.818 sec
HadrienG             0.111 sec

I used lazy static rather than a separate function + FnvHashMap:

fn normalize6(s: &str) -> String {
    lazy_static! {
        static ref TABLE: FnvHashMap<char, &'static str> = {
            let mut table: FnvHashMap<char, &'static str> =
                FnvHashMap::default();
            table.insert('ff', "ff");
            table.insert('fi', "fi");
            table.insert('fl', "fl");
            table.insert('ffi', "ffi");
            table.insert('ffl', "ffl");
            table.insert('ſt', "ſt");
            table.insert('st', "st");
            table
        };
    }
    let mut o = String::with_capacity(s.len());
    for c in s.chars() {
        match TABLE.get(&c) {
            Some(rep) => o.push_str(rep),
            None => o.push(c),
        }
    }
    o
}

Thanks!

2 Likes

Maybe you have enough speed for now, but just in case you want more, did you know that there are ways to build an optimal HashMap for a given dataset at compile time, using so-called "Perfect Hash Functions"? If you end up needing more speed later on, that could be another good topic to look up, alongside working at the raw UTF-8 level.

2 Likes

If the text you're doing the replacement in is large, then you might benefit from a regex. e.g. (not tested)

let re = Regex::new(r"ff|fi|fl|ffi|ffl|ſt|st").unwrap();
let replaced = re.replace_all(text, |caps| {
    table[&caps[0]].clone()
});

Depending on how that performs, it should be possible to avoid the clone(), by implementing your own Replacer.

On Rust nightly, if you use the unstable regex feature, then the above routine probably won't even use a DFA, and will instead use a special AVX2 algorithm for finding your chars. :slight_smile:

You can also use the regex::bytes API to avoid creating a String in the first place.

6 Likes

Here is another small performance optimization that may or may not be useful, but is very easy to implement: you can attempt to provide a better guess of the output string length. This should take you from an average of one output string reallocation to an average of zero reallocations.

The difference should be much smaller than in my previous proposal, where I went from from ~log2(input.len()) output reallocations to a single one. But it might still be noticeable when your input strings are either very small (in which case your benchmark is sensitive to the overhead of every malloc()) or very large (in which case your benchmark is sensitive to the memory copy that occurs at reallocation time).

As a crude and memory-hungry approximation to evaluate how useful this optimization is, you can add a factor of 2 on the output string's capacity. If it proves worthwhile, then you can refine by studying the relationship between the input and output string lengths on representative inputs, and using that to figure out a more accurate factor on the output capacity.


EDIT: In the same vein, due to the fact that lazy_static is not a zero-cost abstraction (though it gets very close on x86), you might get a small speedup from caching a pointer to the actual FnvHashMap during the loop instead of going through the lazy_static every time:

let ligatures: &FnvHashMap<_, _> = &*TABLE;

But again, I don't expect this to lead big performance changes. If more of these are to come, I expect them to come from improving the efficiency of the scan through the input string and/or the lookups through the ligatures table. However, that's more work :slight_smile:

@BurntSushi: I tried with a regex, but it wasn't as fast as @HadrianG's. However, I only use stable, so nightly might be better. Here are the timings:

String::replace      0.493 sec
HashMap              2.333 sec
FnvHashMap           2.074 sec
HashMap (static)     1.844 sec
FnvHashMap (static)  1.712 sec
HadrienG             0.105 sec
Regex                0.376 sec

And here's the regex version I used:

fn normalize7(s: &str) -> String {
    lazy_static! {
        static ref TABLE: FnvHashMap<&'static str, &'static str> = {
            let mut table: FnvHashMap<&'static str, &'static str> =
                FnvHashMap::default();
            table.insert("ff", "ff");
            table.insert("fi", "fi");
            table.insert("fl", "fl");
            table.insert("ffi", "ffi");
            table.insert("ffl", "ffl");
            table.insert("ſt", "ſt");
            table.insert("st", "st");
            table
        };
        static ref LIG_RX: Regex = Regex::new("ff|fi|fl|ffi|ffl|ſt|st").unwrap();
    }
    LIG_RX.replace_all(s, |caps: &Captures|
                       TABLE.get(&caps[0]).unwrap().to_string()).to_string()
}

Here's a link to all the tests in the playground: ligature normalization speed tests.

Thanks for all the help, I've now put @HadrienG's code in my project:-)

Indeed! Your test string is tiny. That's why I qualified my suggestion with "if you string is large." :slight_smile:

I would certainly agree that now that every iteration of the benchmark loop is at 100ns (which is an average of 4 ns per char, or 8~16 CPU cycles), we are close to entering the ugly realm of micro-optimizations. More interesting "high-level" optimizations could be enabled by switching to larger inputs, which IIRC are anyway closer to the OP's target use cases.

EDIT: In particular, I suspect that larger inputs would reveal that this code is mostly a memory copy with occasional edge cases. I do not expect my version to be especially fast at the "bulk memory copy" and "find the needle in the haystack" parts, unless rustc or LLVM are very good at optimizing this kind of loop.

For my use case I'll be working on words or lines, so the strings will always be tiny.

I've also realised that I can use this technique to replace the regexes I've been using to 'normalize' whitespace and hyphens, so thank you v. much for your help!

2 Likes

There's an existing implementation of perfect hash maps: https://crates.io/crates/phf

1 Like

@mark: Out of personal curiosity, can you benchmark this version against normalize6? You will probably need to increase the amount of benchmark iterations by 10x to see any difference, as 0.1s is the benchmark measurement noise level on most machines.

fn normalize8(s: &str) -> String {
    lazy_static! {
        static ref TABLE: FnvHashMap<char, &'static str> = {
            let mut table: FnvHashMap<char, &'static str> =
                FnvHashMap::default();
            table.insert('ff', "ff");
            table.insert('fi', "fi");
            table.insert('fl', "fl");
            table.insert('ffi', "ffi");
            table.insert('ffl', "ffl");
            table.insert('ſt', "ſt");
            table.insert('st', "st");
            table
        };
    }
    let table: &FnvHashMap<_, _> = &*TABLE;
    let mut o = String::with_capacity(2*s.len());
    let mut first_idx = 0usize;
    for (idx, c) in s.char_indices() {
        if let Some(rep) = table.get(&c) {
            o.push_str(&s[first_idx..idx]);
            o.push_str(rep);
            first_idx = idx + c.len_utf8();
        }
    }
    o.push_str(&s[first_idx..]);
    o
}

What this does with respect to normalize6:

  • Use a bigger output allocation to avoid any reallocation (can make a difference on small strings)
  • Memoize the reference to the TABLE lazy_static (avoids constantly checking if the data has been initialized + memory barriers on some CPU arches like ARM and Power).
  • Use bulk copy operations instead of copying char-by-char (may be beneficial because ligatures are rare, and since the compiler does not know that they are it is unlikely to have done this optimization on its own).

(PS: This was written without access to a compiler for testing, so it may need some minor corrections)

@HadrienG: I copied & pasted your code & it compiled as-is. I changed the secs() function to make the numbers line up and I increased n to 10_000_000 as requested. Here're the results (running on an i5 laptop; prev results were from an i7 desktop):

String::replace        7.017 sec
HashMap               33.206 sec
FnvHashMap            29.558 sec
HashMap (static)      25.824 sec
FnvHashMap (static)   24.453 sec
HadrienG               1.500 sec
Regex                  5.349 sec
HadrienG/2             1.729 sec

And a second run:

String::replace        7.334 sec
HashMap               33.369 sec
FnvHashMap            29.964 sec
HashMap (static)      25.699 sec
FnvHashMap (static)   24.502 sec
HadrienG               1.499 sec
Regex                  5.251 sec
HadrienG/2             1.640 sec

Your new version is the `/2' in the above. I'm using 1.24.1 on 64-bit Linux.

Avoiding unnecessary allocations is always a good idea and I would also have tried allocating twice the input size as a first guess. However, it occurred to me that the ligatures take up multiple bytes in the UTF-8 encoding in the input string. I checked "ff" which is U+FB00 and is encoded with three bytes in UTF-8: 0xEF 0xAC 0x80.

The same applies to the next few f-ligatures I checked. If that holds for all the ligatures, using text.len() will ensure zero reallocations and only a minimum of wasted space. That's kinda neat :slight_smile:

1 Like

I got rid of all the functions except @HadrienG's two and weirdly this affected the timings.

HadrienG               1.447 sec
HadrienG/2             1.668 sec

HadrienG               1.465 sec
HadrienG/2             1.676 sec

HadrienG               1.475 sec
HadrienG/2             1.687 sec

The first one has two changes from before:

    let table: &FnvHashMap<_, _> = &*TABLE;

and, of course, uses table rather than TABLE. This seems to make it consistently ~1/10 sec faster than before.

For the second one the only change I made was to use s.len() to set the capacity.

So the first one, esp. using table, seems to be unbeatable.

So are you saying that should be replaced with st the same as ?

No, I'm saying that the glyph looks like FT.