Why my rust code is 2 times slower than my JS code

ExpHP · August 11, 2019, 7:38pm

TL;DR

The issue is in the many calls to str::contains, which computes information about the needle on every call (and there is no way to precompute it, it seems?).
aho_corasick is amazing.

All of the changes I made can be found in the commit history here:

Okay, so I did the following:

Record performance data

I also disable multithreading because (a) the JS one isn't multithreaded, and (b) it's easier to benchmark with fewer threads.

export RUSTFLAGS=-g
export RAYON_NUM_THREADS=1
cargo build --release
perf record target/release/open_learning
perf report

  68.26%  open_learning  open_learning       [.] core::str::pattern::StrSearcher::new
  19.84%  open_learning  open_learning       [.] core::str::pattern::TwoWaySearcher::next
   3.62%  open_learning  libc-2.29.so        [.] __memcmp_sse4_1
   2.99%  open_learning  open_learning       [.] open_learning::filter
   2.32%  open_learning  open_learning       [.] <core::str::pattern::StrSearcher as core::str::pattern::Searcher>::next_match
   0.50%  open_learning  [unknown]           [k] 0xffffffffaba00a27
   0.48%  open_learning  libc-2.29.so        [.] malloc
   0.39%  open_learning  libc-2.29.so        [.] cfree@GLIBC_2.2.5
   0.39%  open_learning  libc-2.29.so        [.] _int_free

This caught me by surprise. I expected the problem to have to do with all of the unnecessary allocations in your code (for instance, you have things like y.to_string().contains(...)), but instead it is all occurring in preparation work performed by str::contains.

Identify the most costly calls

We find functions in main.rs that call str::contains and add #[inline(never)] to them. Because you suspect filter is the problem, and it has two calls, I isolate them into two separate functions:

https://github.com/exphp-share/open-learning-opt/commit/a866002494c0d3bca0c69b3a468ec22400682db5

Then I use gdb to find out which function it is:

$ RAYON_NUM_THREADS=1 gdb target/release/open_learning

(wait until it's halfway done learning and hit Ctrl-C)
(gdb) thread apply all bt

          --- (snip) ---
#4  core::str::pattern::Pattern::is_contained_in (self=..., haystack=...) at /rustc/2fe7b3383c1e0a8b68f8a809be3ac21006998929/src/libcore/str/pattern.rs:39
#5  <&alloc::string::String as core::str::pattern::Pattern>::is_contained_in (self=0x7ffff0038aa0, haystack=...) at /rustc/2fe7b3383c1e0a8b68f8a809be3ac21006998929/src/liballoc/string.rs:1817
#6  core::str::<impl str>::contains (self=..., pat=0x7ffff0038aa0) at /rustc/2fe7b3383c1e0a8b68f8a809be3ac21006998929/src/libcore/str/mod.rs:2834
#7  open_learning::filter_gdat (dat=..., vpn=...) at src/main.rs:155
#8  0x00005555555629b4 in open_learning::filter (dat=..., vpn=...) at src/main.rs:146
#9  open_learning::main::{{closure}} (p=0x5555555d0990) at src/main.rs:84
          --- (snip) ---

Right, so it is in the creation of gdat.

Add unit tests

I don't want to break your code by optimizing it incorrectly! Therefore, we need test cases. Let's see what kind of outputs we're dealing with:

diff --git a/src/main.rs b/src/main.rs
index ada88d9..65bc1b3 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -81,8 +81,11 @@ fn main() {
     let u: Vec<String> = vpn.par_iter().flat_map(|p| {
         let hu = input(p.to_string(),&normal_string);
         let dat = filter(&hu,&vpn);
+        if !dat.is_empty() {
+            println!("{:?}: {:?}", p, dat);
+        }

"comtabo mhf": ["comtabo mh"]
"online spas heizer": ["online spas heize"]
"datanetwork nl": ["datanetwork n"]
"experience project law enforcement ltd.": ["experience project law enforcement ltd"]
"light twenty seconds": ["light twenty second"]
"e70b wide ku band": ["e70b wide ku ban"]

Uhmmmm.... so... this always produces a string that is one character shorter than p? Are you sure this is doing what you want?

...since that hardly excercises the behavior of filter_gdat, I'm going to write my own test based on what the function appears to be trying to do. I tried to exercise all the edge cases I could think of, but may have missed a couple.

#[test]
fn gdat() {
    let strings = |strs: &[&str]| strs.iter().map(|s| s.to_string()).collect::<Vec<_>>();
    assert_eq!(
        filter_gdat(
            &strings(&[
                "a", // appears in three inputs
                "aaaabbaa", // never appears
                "lol", // appears once at the end and once in the middle
                "wards", // only appears in one input, but twice
                "copter", // only appears once at the end
                "be", // appears twice in the middle
                "n't", // one of the occurences is the complete string
                "b", // overlaps with "be"
            ]),
            &strings(&["abea", "roflolacopter", "don't wardsy wards be happy lol"]),
        ),
        strings(&["a", "lol", "be", "n't", "b"]),
    );
}

With that, I can verify that the following implementation is correct:

#[inline(never)]
fn filter_gdat(dat: &[String], vpn: &[String]) -> Vec<String> {
  dat.iter()
    .filter(|q| vpn.iter().filter(|d| d.contains(&q[..])).count() > 1)
    .map(|s| s.to_string())
    .collect()
}

Optimize the code

First try

Unfortunately, there's no fast way in std to search for a single needle in multiple haystacks, even with nightly features. So we look to aho_corasick, which provides fast searching for mutliple needles in multiple haystacks.

#[inline(never)]
fn filter_gdat(dat: &[String], vpn: &[String]) -> Vec<String> {
  let ac = AhoCorasick::new(dat);
  let matched_indices = {
    vpn.iter().flat_map(|haystack| ac.find_overlapping_iter(haystack).map(|m| m.pattern()))
  };

  get_counts(matched_indices).into_iter()
    .filter(|&(_, count)| count > 1)
    .map(|(index, _)| dat[index].to_string())
    .collect()
}

fn get_counts<T: Ord, I: IntoIterator<Item=T>>(iter: I) -> BTreeMap<T, usize> {
    let mut map = BTreeMap::new();
    for value in iter {
        *map.entry(value).or_insert(0) += 1;
    }
    map
}

This reduces the total runtime from 4.5 seconds down to 0.7s! But... whoops! It fails the unit test because it counts wards, which appears twice in a single string. Good thing I wrote that test!

More correct

Lets see how much of that boost we can keep after fixing the implementation to use individual AhoCorasicks for each str:

#[inline(never)]
fn filter_gdat(dat: &[String], vpn: &[String]) -> Vec<String> {
  dat.iter()
    .filter(|q| {
      let ac = AhoCorasick::new(&[q][..]);
      vpn.iter().filter(|d| ac.is_match(d)).count() > 1
    })
    .map(|s| s.to_string())
    .collect()
}

This passes the test and is more correct, and simpler, but takes 3.5s. Hm. That's not nearly as nice.

Correct AND fast

So it looks like AhoCorasick really does excel when given multiple needles at once. So let's see if we can write an impl that is correct AND mega-fast. I'll take the fast implementation from before, and simply add a call to Itertools::unique.

use itertools::Itertools;

#[inline(never)]
fn filter_gdat(dat: &[String], vpn: &[String]) -> Vec<String> {
  let ac = AhoCorasick::new(dat);
  let matched_indices = {
    vpn.iter().flat_map(|haystack| {
        ac.find_overlapping_iter(haystack).map(|m| m.pattern())
          .unique() // only count the first match of each needle in the haystack
    })
  };

  get_counts(matched_indices).into_iter()
    .filter(|&(_, count)| count > 1)
    .map(|(index, _)| dat[index].to_string())
    .collect()
}

Now it passes the test, AND runs in 0.7s.

Again, the final code is here:

https://github.com/exphp-share/open-learning-opt

To answer the original question, it seems possible to me that JS String.includes precomputes information similar to Rust, but caches it globally for each string object.

Topic		Replies	Views
Why is RUST traversing directories slower than node code review	21	1344	December 6, 2022
Why my rust program is so slow? help	7	3750	November 23, 2020
Why my Rust multithreaded solution is slow as compared to the same c++ solution? code review	7	811	September 15, 2023
Why rust multithreading is taking more time than Node.js multithreading help	2	310	June 27, 2023
Python code fast than the equivalent in Rust help	5	467	July 14, 2023