Optimize the code to reduce malloc

#1

I’m not familiar with Rust. So I started to write a simple spelling corrector based on http://norvig.com/spell-correct.html.
The code can be found here.
It seems it’s just a little faster than Python when you try to correct a long word like “dictionare” to “dictionary”. (cargo run --release)
I use “perf” to profiling this and found lots of “malloc”. I guess the code has too many unnecessary “malloc”.
Hope anyone can help me optimize this. Thanks.

#2

To be crash safe for unicode probably best to just use chars/bytes or else;
.filter_map(|i| if word.is_char_boundary(i) { Some((&word[..i], &word[i..])) } else { None })
More complex delete, transpose, replace needed too. Change for alphabet if it was not just ASCI.

edit_twice then known would possibly contain duplicates.

replaces / inserts would be better iterating over the alphabet rather than error prone numbers.

You don’t need to collect ‘deletes, transposes, replaces, inserts’ into intermediate Vecs.

#3

Optimising further would be removing more of the vecs and sticking with iterators.

To the extreme is removing the concat() calls so not having Strings generated and just keeping three slices. (Using extra empty where needed.)
This blog then gives hit how to use.

(edit: three slices only works with edit_once need more elaborate for twice.)
Could sometimes be reusing a single String instead to avoid allocations.

Also could try multithreaded but that is more likely done for checking a document rather than single words.

#4

Thanks a lot. But the HashMap tricks seem too complex.
I tried to remove duplicated strings with HashSet, but it will cost more time. Maybe just sort and remove from Vec?

#5

Extra lazy approach to avoid unicode is to first check string with is_ascii.

A note about timing, it is including the print and there is possibility python/other uses something with differing accuracy.

For what your timing I would guess removing the duplicates (eg sorting early) would not give any gain. The correct function does sort-alternative after lookup.

Tricks are how you gain the performance, and learn at same time. To me it looks like mostly copying the example in the blog.

trait Key {
    fn key(&self) -> (&str, &str, &str, &str);
}

impl Key for (&str, &str, &str, &str) {
    fn key(&self) -> (&str, &str, &str, &str) {
        (self.0, self.1, self.2, self.3)
    }
}

(Correction from my previous post, would be four slices. Could use enum but don’t see benefit.)

edit_once would need changing to return impl Iterator, again complex for someone new to language but gives the experience.

Couple extra things that stick out, (I don’t think are massive performance wise, vs vecs and concat);

.filter(|(_, right)| right.len() > 1) could be replaced with take since you know the item.

(Breaking last suggested item in process:) Iterating over splits once for the four could possibly give a little boost, due to cached use of left,right.

#6

Okay, this was a fun exercise.

So, starting from your code:

running 5 tests
test benchmarks::correct_dictionere ... bench: 100,213,490 ns/iter (+/- 8,694,330)
test benchmarks::correct_helle      ... bench:      70,696 ns/iter (+/- 4,302)
test benchmarks::correct_nica       ... bench:      58,901 ns/iter (+/- 2,991)
test benchmarks::correct_pythn      ... bench:      69,910 ns/iter (+/- 33,639)
test benchmarks::correct_world      ... bench:         169 ns/iter (+/- 29)

Here is my suggestion:

running 5 tests
test benchmarks::correct_dictionere ... bench:  46,960,510 ns/iter (+/- 4,447,458)
test benchmarks::correct_helle      ... bench:      30,638 ns/iter (+/- 7,767)
test benchmarks::correct_nica       ... bench:      24,738 ns/iter (+/- 4,850)
test benchmarks::correct_pythn      ... bench:      30,700 ns/iter (+/- 3,958)
test benchmarks::correct_world      ... bench:         103 ns/iter (+/- 8)

Strategy used

  1. I have chosen to use byte slices instead of string slices, since your code was indexing at all indices (which cannot be done for Unicode strings). The other option would have been to iterate over grapheme clusters and would have been another whole exercise on its own. Hence my choice to skip the problem altogether to start with. But I think that could be an interesting extension.

  2. Be lazy!

    By that I mean, of course, to use lazy iteration as much as possible, which actually means that the programmer be less lazy :smile:

Featured example:

Your function

fn edit_once(word: &str) -> Vec<String> {
    let splits = (0..=word.len())
        .map(|i| (&word[.. i], &word[i ..]))
        .collect::<Vec<_>>();

    let deletes = splits
        .iter()
        .filter(|(_, right)| right.len() > 0)
        .map(|(left, right)| [left, &right[1..]].concat())
        .collect::<Vec<_>>();

    let transposes = splits
        .iter()
        .filter(|(_, right)| right.len() > 1)
        .map(|(left, right)| [left, &right[1..2], &right[0..1], &right[2..]].concat())
        .collect::<Vec<_>>();

    let replaces = (0..26)
        .flat_map(|i| {
            splits
                .iter()
                .filter(|(_, right)| right.len() > 0)
                .map(|(left, right)| [left, &ALPHABET[i..i + 1], &right[1..]].concat())
                .collect::<Vec<_>>()
        })
        .collect::<Vec<_>>();

    let inserts = (0..26)
        .flat_map(|i| {
            splits
                .iter()
                .map(|(left, right)| [left, &ALPHABET[i..i + 1], right].concat())
                .collect::<Vec<_>>()
        })
        .collect::<Vec<_>>();

    let mut candidates = vec![];
    for words in [deletes, transposes, replaces, inserts].iter() {
        candidates.extend(words.iter().cloned());
    }
    candidates
}

becomes

fn edit_once (
    word: &'_ b_str,
) -> impl Iterator<Item = BString> + '_
{
    static ALPHABET: &b_str = b"abcdefghijklmnopqrstuvwxyz";

    /// helper
    fn iter_rc_slice<'slice, 'elem, T : Copy + 'elem> (
        rc_slice: &'slice Rc<[T]>,
    ) -> impl Iterator<Item = T> + 'elem // look, no 'slice !!
    {
        (0 .. rc_slice.len())
            .map({
                let rc_slice = Rc::clone(&rc_slice);
                move |i| rc_slice[i]
            })
    }

    // /// This is actually slightly slower!
    // let mk_splits = move || {
    //     (0 ..= word.len())
    //         .map(move |i| word.split_at(i))
    // };

    let splits = Rc::<[(&b_str, &b_str)]>::from(
        Vec::into_boxed_slice(
            (0 ..= word.len())
                .map(move |i| word.split_at(i))
                .collect()
        )
    );
    let mk_splits = || iter_rc_slice(&splits);

    let deletes =
        mk_splits()
            .filter_map(|(left, right)| Some(
                [left, right.get(1 ..)?].concat()
            ))
    ;

    let transposes =
        mk_splits()
            .filter_map(|(left, right)| Some({
                let end = right.get(2 ..)?;
                [left, &right[1 .. 2], &right[0 .. 1], end].concat()
            }))
    ;

    let replaces =
        mk_splits()
            .filter_map(|(left, right)| Some({
                let right = right.get(1 ..)?;
                ALPHABET
                    .chunks(1)
                    .map(move |letter| {
                        [left, letter, right].concat()
                    })
            }))
            .flatten()
    ;

    let inserts =
        mk_splits()
            .flat_map(|(left, right)| {
                ALPHABET
                    .chunks(1)
                    .map(move |letter|
                        [left, letter, right].concat()
                    )
            })
    ;

    chained![deletes, transposes, replaces, inserts]
}

As you can see, instead of returning Vec<_>, I always try to return impl Iterator<Item = _>.
So the only place where I allocate/.collect() in this function is for the splits “sequence”.

I had two choices:

  1. Recomputing the “sequence” each time (commented out solution).

    • Problem: Even if it works and has the advantage of being simple, my benchmarks showed that this solution was around 5% to 10% slower.
  2. Storing / cache-ing the sequence (.collect()).

    • Problem: We cannot splits.iter() over it for each following iterator since that would mean all those iterators would be borrowing a local value and it would thus be impossible to return them.

    • Solution: Iterate by value / take ownership of the sequence.

    • Problem: It requires cloning the sequence.

    • Solution: Use reference counting.

    • Problem: The iteration by value is then not trivial to write

    • Solution: http://smallcultfollowing.com/babysteps/blog/2018/09/02/rust-pattern-iterating-an-over-a-rc-vec-t/

(Remark: I have used Rc<[_]> instead of Rc<Vec<_>> to avoid the extra indirection. I haven’t benchmarked this choice, though).

Allocations?

Regarding the .correct() method, there are now only allocations in 3 places:

  • at each .concat() call obviously. The HashMap key-lookup trick seems like an interesting extension to prevent that;

  • at splits, as justified above;

  • at .candidates(), because impl Iterator cannot be written for mixed return types, and thus requires writing down an enum with all the explicit types by hand, which is tedious;

  • at .edit_twice(), because of a lifetime issue of Iterator traits and .map that prevents us from describing the lifetime invariants. By writing down our own mapped streaming iterator we could have prevented the inner non-lazy iterator.

2 Likes