Why my rust code is 2 times slower than my JS code

I don't understand why my code is slower in rust than JS I have also added multithreading in my rust code and not in my JS but Rust still slower.

extern crate rayon;
extern crate console;
#[macro_use]
extern crate lazy_static;

use rayon::prelude::*;
use std::sync::Mutex;
use console::style;
use std::fs;

use std::time::Instant;
use std::fs::File;
use std::io::Write;

use std::env;



mod terminal_render;

#[warn(unused_must_use)]
fn main() {
    println!();
    let now1 = Instant::now();
    let mut vpn: Vec<String> = Vec::new();
    terminal_render::set_loading(1,33.3,None);
    let vpn_string = fs::read_to_string("VPN.txt").unwrap();
    let mut normal: Vec<String> = Vec::new();
    terminal_render::set_loading(1,66.6,None);
    let normal_string = fs::read_to_string("NORMAL.txt").unwrap();

    for x in vpn_string.split('\n') {
        if x.len() >= 3 {
            vpn.push(clear(x.trim().to_string()));
        }
    }
    for x in normal_string.split('\n') {
        if x.len() >= 3 {
            normal.push(clear(x.trim().to_string()));
        }
    }
    terminal_render::set_end(1,now1.elapsed().as_millis());
    let now = Instant::now();
    terminal_render::set_loading(0,0.0,None);
    let bigger_vec: Mutex<Vec<String>> = Mutex::new(Vec::new());
    let finished: Mutex<terminal_render::Type> = Mutex::new(terminal_render::Type::new(0));
    let ln = vpn.len() as f32;

    let lastp: Mutex<terminal_render::Type> = Mutex::new(terminal_render::Type::new(0));
    vpn.par_iter().for_each(|p| {
        let hu = input(p.to_string(),&normal_string);
        let dat = filter(&hu,&vpn);
        if let Ok(mut e) = bigger_vec.lock() {
            e.extend(dat);
        }

        if let Ok(mut e) = finished.lock() {
            e.tp += 1;
            let y = (e.tp  as f32)/ln * 100.0;
            if let Ok(mut i) = lastp.lock() {
                if i.tp < y as u32 {
                    terminal_render::set_loading(0,y,Some(temps_restant(&now,y)));
                    i.tp = y as u32;
                }
            }
        }
    });
    terminal_render::set_end(0,now.elapsed().as_millis());

    let now = Instant::now();
    terminal_render::set_loading(2,0.0,None);
    let o = format!("{:?}",filter(&bigger_vec.lock().unwrap(),&vpn));
    terminal_render::set_loading(2,50.0,None);
    let mut file = File::create("MODEL.txt").unwrap();
    file.write_all(o.as_bytes()).unwrap();
    terminal_render::set_end(2,now.elapsed().as_millis());
}

fn temps_restant(i: &Instant,p: f32) -> u32 {
    let ela = (i.elapsed().as_millis()/100) as f32 / 10.0;
    let time_pp = ela/p;
    (time_pp*(100.0-p)) as u32
}

fn input(b: String, normal: &str) -> Vec<String> {
    let mut dat: Vec<String> = Vec::with_capacity(b.len());
    for x in 0..(b.len() - 3) {
      for y in 3..b.len()-x {
          let str = &b[x..(x+y)];
          if normal.contains(str) {
              continue;
            }
          dat.push(str.to_string());
        }
    }
    dat
}

fn clear(e: String) -> String {
    let mut i = String::new();
    let p = "abcdefghijklmnopqrstuvwxyz0123456789. ";
    for n in e.to_lowercase().split("") {
        if p.contains(n) {
            i.push_str(n);
        }
    }
    i
}

fn filter(dat: &[String],vpn: &[String]) -> Vec<String> {
  let mut gdat = Vec::with_capacity(dat.len());
  for q in dat {
    let mut count: u32 = 0;
    for d in vpn {
      if d.contains(q) {
        count+=1;
      }
    }
    if count > 1 {
        gdat.push(q.to_string());
    }
  }

  let mut lua: Vec<String> = Vec::new();
  'loop3: for x in &gdat {
    for y in &gdat {
      if y.to_string().contains(x) {
        continue 'loop3;
      }
    }
    if !lua.contains(&x) {
      lua.push(x.to_string());
    }
  }
  lua
}

I think it is caused by the fors in the filter function.

And if you want the JS version:


function input(b,OKIp) {
        var dat = [];
      for (var x = 0; x < b.length; x++) {
        if (x >  b.length-2) {
          continue;
        }
        var v = false;
        for (var y = 1; y < b.length-x; y++) {
          var str = b.substring(x,y);
          if (y<4) {
            continue;
          }
          for (var g of OKIp) {
            if (g.includes(str)) {
              v = true;
              break;
            }
          }

          if (v) {
            continue;
          }
          dat.push(str);
        }
      }
      return dat;
}

function filter(dat,VPNS) {
  var gdat = [];
  for (var q of dat) {
    var count = 0;
    for (var d of VPNS) {
      if (d.includes(q)) {
        count++;
      }
    }
    if (count < 2) {
      continue;
    }
    gdat.push([q,count]);
  }
  var lua = [];
  for (var t of gdat) {
    for (var y of gdat) {
      if (y[0].includes(t[0])) {
        t[0] = y[0];
      }
    }
    if (!lua.includes(t[0])) {
      lua.push(t[0]);
    }
  }
  return lua;
}

function normalize(input) {
  return input.toLowerCase().trim();
}

function train(VPN,NORMAL) {
    var VPN1 = [];
    var NORMAL1 = [];
    for (var t of VPN) {
      var f = normalize(t);
      if (f.length > 2) {
        VPN1.push(f);
      }
    }
    for (var t of NORMAL) {
      var f = normalize(t);
      if (f.length > 2) {
        NORMAL1.push(f);
      }
    }
    var date = [];
    var t = VPN1.length;
    var t2 = 0.0;
    for (var t1 of VPN1) {
     for (var g of filter(input(t1,NORMAL1),VPN1)) {
        date.push(g);

      }
      t2++;
      console.log(t2/t*100.0);
    }
    return filter(date,VPN1)
}

function stringToList(str) {
  var qa = [];
  for (var ju of str.split("\n")) {
    var t = ju.toLowerCase().replace("\r","").trim();
    if (t.length>3) {
      qa.push(t);
    }
  }
  return qa;
}
var fs = require("fs");
var N = fs.readFileSync('NORMAL.txt', 'utf8');
var V = fs.readFileSync('VPN.txt', 'utf8');


console.time('Training on dataset');
var a = stringToList(V);
var b = stringToList(N);
console.log(train(a,b));
console.timeEnd('Training on dataset');
console.log("dataset size:",a.length,b.length)

If somebody can solve my problems.

Please show how you are compiling and running both code snippets. In particular, for Rust code, are you using cargo build --release?

2 Likes

To compile the rust program I use:
cargo run --release
And to run the JS I use
node data.js
I've got 10 sec for the JS and 18.3 for the rust.

Without going into detail (on mobile, sorry), you are doing a LOT of locking (e.g. on bigger_vec, and on Terminal), inside your tight loops.
Especially the terminal progress indicator looks suspicious to me, because I don't see anything similar in the js version.
You are probably spending ~90% of the time waiting on output sync, and/or locking on bigger_vec.

Outputting to terminal is terribly expensive, compared to a few string manipulations, because it does formatting, linebreaking, copies everything to terminal buffers, font rendering, etc.. compared to CPU-caches :racing_car:, screens move like glaciers :ice_cube:

Depending on how big VPN.txt is, it may also be reading that entire thing into memory ("let vpn_string = fs::read_to_string("VPN.txt")").
The split('\n') to get lines is also suboptimal.
See read lines - Rust By Example for how to do it with more buffering and lazy loading.

In general, JS does a LOT of magic around file buffering, lazy iteration and string copying. You have to know how to reproduce this in Rust ("rust by example" is a great resource!), or quickly pay the performance price.

8 Likes

I have several guesses, but it is impossible to help in any matter of profiling without a complete, runnable code.

2 Likes

try cargo-flamegraph

I Matched to optimize the terminal output and some little things in the for so now in rust I got 8.3 sec and 10 sec in JS but there multithreading in Rust.

Here the JS code and the Rust code. And I count only performance on the learning, not the writing or reading.

My PC is very slow so If you want to increase the computation time increase VPN.txt size with random text.

(AMD Phenom(tm) II P960 Quad-Core Processor × 4 cores at 2.31GHz)

TL;DR

  • The issue is in the many calls to str::contains, which computes information about the needle on every call (and there is no way to precompute it, it seems?).
  • aho_corasick is amazing.

All of the changes I made can be found in the commit history here:


Okay, so I did the following:

Record performance data

I also disable multithreading because (a) the JS one isn't multithreaded, and (b) it's easier to benchmark with fewer threads.

export RUSTFLAGS=-g
export RAYON_NUM_THREADS=1
cargo build --release
perf record target/release/open_learning
perf report
  68.26%  open_learning  open_learning       [.] core::str::pattern::StrSearcher::new
  19.84%  open_learning  open_learning       [.] core::str::pattern::TwoWaySearcher::next
   3.62%  open_learning  libc-2.29.so        [.] __memcmp_sse4_1
   2.99%  open_learning  open_learning       [.] open_learning::filter
   2.32%  open_learning  open_learning       [.] <core::str::pattern::StrSearcher as core::str::pattern::Searcher>::next_match
   0.50%  open_learning  [unknown]           [k] 0xffffffffaba00a27
   0.48%  open_learning  libc-2.29.so        [.] malloc
   0.39%  open_learning  libc-2.29.so        [.] cfree@GLIBC_2.2.5
   0.39%  open_learning  libc-2.29.so        [.] _int_free

This caught me by surprise. I expected the problem to have to do with all of the unnecessary allocations in your code (for instance, you have things like y.to_string().contains(...)), but instead it is all occurring in preparation work performed by str::contains.

Identify the most costly calls

We find functions in main.rs that call str::contains and add #[inline(never)] to them. Because you suspect filter is the problem, and it has two calls, I isolate them into two separate functions:

https://github.com/exphp-share/open-learning-opt/commit/a866002494c0d3bca0c69b3a468ec22400682db5

Then I use gdb to find out which function it is:

$ RAYON_NUM_THREADS=1 gdb target/release/open_learning

(wait until it's halfway done learning and hit Ctrl-C)
(gdb) thread apply all bt

          --- (snip) ---
#4  core::str::pattern::Pattern::is_contained_in (self=..., haystack=...) at /rustc/2fe7b3383c1e0a8b68f8a809be3ac21006998929/src/libcore/str/pattern.rs:39
#5  <&alloc::string::String as core::str::pattern::Pattern>::is_contained_in (self=0x7ffff0038aa0, haystack=...) at /rustc/2fe7b3383c1e0a8b68f8a809be3ac21006998929/src/liballoc/string.rs:1817
#6  core::str::<impl str>::contains (self=..., pat=0x7ffff0038aa0) at /rustc/2fe7b3383c1e0a8b68f8a809be3ac21006998929/src/libcore/str/mod.rs:2834
#7  open_learning::filter_gdat (dat=..., vpn=...) at src/main.rs:155
#8  0x00005555555629b4 in open_learning::filter (dat=..., vpn=...) at src/main.rs:146
#9  open_learning::main::{{closure}} (p=0x5555555d0990) at src/main.rs:84
          --- (snip) ---

Right, so it is in the creation of gdat.

Add unit tests

I don't want to break your code by optimizing it incorrectly! Therefore, we need test cases. Let's see what kind of outputs we're dealing with:

diff --git a/src/main.rs b/src/main.rs
index ada88d9..65bc1b3 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -81,8 +81,11 @@ fn main() {
     let u: Vec<String> = vpn.par_iter().flat_map(|p| {
         let hu = input(p.to_string(),&normal_string);
         let dat = filter(&hu,&vpn);
+        if !dat.is_empty() {
+            println!("{:?}: {:?}", p, dat);
+        }
"comtabo mhf": ["comtabo mh"]
"online spas heizer": ["online spas heize"]
"datanetwork nl": ["datanetwork n"]
"experience project law enforcement ltd.": ["experience project law enforcement ltd"]
"light twenty seconds": ["light twenty second"]
"e70b wide ku band": ["e70b wide ku ban"]

Uhmmmm.... so... this always produces a string that is one character shorter than p? Are you sure this is doing what you want?

...since that hardly excercises the behavior of filter_gdat, I'm going to write my own test based on what the function appears to be trying to do. I tried to exercise all the edge cases I could think of, but may have missed a couple.

#[test]
fn gdat() {
    let strings = |strs: &[&str]| strs.iter().map(|s| s.to_string()).collect::<Vec<_>>();
    assert_eq!(
        filter_gdat(
            &strings(&[
                "a", // appears in three inputs
                "aaaabbaa", // never appears
                "lol", // appears once at the end and once in the middle
                "wards", // only appears in one input, but twice
                "copter", // only appears once at the end
                "be", // appears twice in the middle
                "n't", // one of the occurences is the complete string
                "b", // overlaps with "be"
            ]),
            &strings(&["abea", "roflolacopter", "don't wardsy wards be happy lol"]),
        ),
        strings(&["a", "lol", "be", "n't", "b"]),
    );
}

With that, I can verify that the following implementation is correct:

#[inline(never)]
fn filter_gdat(dat: &[String], vpn: &[String]) -> Vec<String> {
  dat.iter()
    .filter(|q| vpn.iter().filter(|d| d.contains(&q[..])).count() > 1)
    .map(|s| s.to_string())
    .collect()
}

Optimize the code

First try

Unfortunately, there's no fast way in std to search for a single needle in multiple haystacks, even with nightly features. So we look to aho_corasick, which provides fast searching for mutliple needles in multiple haystacks.

#[inline(never)]
fn filter_gdat(dat: &[String], vpn: &[String]) -> Vec<String> {
  let ac = AhoCorasick::new(dat);
  let matched_indices = {
    vpn.iter().flat_map(|haystack| ac.find_overlapping_iter(haystack).map(|m| m.pattern()))
  };

  get_counts(matched_indices).into_iter()
    .filter(|&(_, count)| count > 1)
    .map(|(index, _)| dat[index].to_string())
    .collect()
}

fn get_counts<T: Ord, I: IntoIterator<Item=T>>(iter: I) -> BTreeMap<T, usize> {
    let mut map = BTreeMap::new();
    for value in iter {
        *map.entry(value).or_insert(0) += 1;
    }
    map
}

This reduces the total runtime from 4.5 seconds down to 0.7s! But... whoops! It fails the unit test because it counts wards, which appears twice in a single string. Good thing I wrote that test!

More correct

Lets see how much of that boost we can keep after fixing the implementation to use individual AhoCorasicks for each str:

#[inline(never)]
fn filter_gdat(dat: &[String], vpn: &[String]) -> Vec<String> {
  dat.iter()
    .filter(|q| {
      let ac = AhoCorasick::new(&[q][..]);
      vpn.iter().filter(|d| ac.is_match(d)).count() > 1
    })
    .map(|s| s.to_string())
    .collect()
}

This passes the test and is more correct, and simpler, but takes 3.5s. Hm. That's not nearly as nice.

Correct AND fast

So it looks like AhoCorasick really does excel when given multiple needles at once. So let's see if we can write an impl that is correct AND mega-fast. I'll take the fast implementation from before, and simply add a call to Itertools::unique.

use itertools::Itertools;

#[inline(never)]
fn filter_gdat(dat: &[String], vpn: &[String]) -> Vec<String> {
  let ac = AhoCorasick::new(dat);
  let matched_indices = {
    vpn.iter().flat_map(|haystack| {
        ac.find_overlapping_iter(haystack).map(|m| m.pattern())
          .unique() // only count the first match of each needle in the haystack
    })
  };

  get_counts(matched_indices).into_iter()
    .filter(|&(_, count)| count > 1)
    .map(|(index, _)| dat[index].to_string())
    .collect()
}

Now it passes the test, AND runs in 0.7s.

Again, the final code is here:

https://github.com/exphp-share/open-learning-opt


To answer the original question, it seems possible to me that JS String.includes precomputes information similar to Rust, but caches it globally for each string object.

43 Likes

Nice analysis. You might get a small speed-up by tweaking how the automaton is constructed. Since it looks like you have a small number of patterns, building an Aho-Corasick DFA shouldn't take too much time, and should be faster than using the default NFA. You can defer such decisions to the library by using AhoCorasick::new_auto_configured.

5 Likes

The part where I printed p and dat might be a bit misleading. I think both inputs to filter actually contain hundreds of elements. It's just that the final output of filter is small (and, honestly, quite surprising).

Thank's a lot for this implementation.

The goal of this program is:
in a a Vector of string:

apple
pineapple

for each element:
generate all substrings possible with min size of 3.

apple
appl
app
pple
ppl
ple

then with that search in the vec if there more than 2 occurrences of the substring and if it is not contained in the NORMAL file.
the filtered substrings are now compared and if another substring contains a substring removes the smallest.

I don't really understand how this bit of code works:

  let ac = AhoCorasick::new(dat);
  let matched_indices = {
    vpn.iter().flat_map(|haystack| {
        ac.find_overlapping_iter(haystack).map(|m| m.pattern())
          .unique() // only count the first match of each needle in the haystack
    })
  };

  get_counts(matched_indices).into_iter()
    .filter(|&(_, count)| count > 1)
    .map(|(index, _)| dat[index].to_string())
    .collect()
  • First, precompute a bunch of information about the patterns in dat. (AhoCorasick::new)
  • For every string haystack in vpn, get the indices into dat of all patterns that are contained in haystack. (find_overlapping_iter, m.pattern)
    • find_overlapping_iter will return multiple matches for the same pattern if it is contained multiple times. (e.g. searching for a in aloha). Only keep one of these. (.unique())
  • Gather all of these indices for all items in vpn into one big iterator. (flat_map)
  • Count how many times each index appears. (get_counts)
  • Only keep the indices that have a count greater than 1. (filter)
  • Get the corresponding items from dat. (map)
3 Likes

Hundreds, or even low thousands, are in the realm of "small." Building a DFA in that case should still be quite fast.

5 Likes

Thank's a lot I come from Java so I never use functional-programming like functionalities or Rusty functionalities. I will learn more on how to use them in Rust. I thought the iterators were slow.
But thank's a lot I will add your change to the crate. If you got any idea of improvement for the performances please contact-me.

2 Likes

Oh wow @ExpHP, that is a a great example of optimisation!

I was trying something similar, but never got as far (or as good) as you did! I got as far as a hunch about the contains calls, and the inefficient repeated searching of normal, but Aho-Corasick never even occurred to me.
And so nicely documented in git commits! :heart:
I'm going to study this :smile:

Oh no! Quite the opposite! Rust's compiler is extremely good at optimising iterators, because they very precisely explain their internal data-dependencies.
They are usually as good as perfectly implemented, guru-level unrolled for loops (and far more readable to us humans), and usually (way) better than naive for loops.

I will say that they take some getting used to, and it is very important to have a working knowledge of all the different combinators that exist, so that you can pick the perfect one.
If you try to "abuse" one combinator to re-implement another, because you forgot the other one exists, it usually hurts your optimisation opportunities (I speak from experience :blush:).

For me personally, I also never used them (Java ecosystem), but the Blandy-Orendorff book, "programming Rust" made them 'click' for me. In Rust, I wouldn't recommend learning them from the docs reference (you get "lost in the woods" quite quickly, because of all the "secret" intermediate types)

Iterators are a very powerful, expressive tool in all languages that have them, so learning them in Rust will also improve you as a programmer in general! (My Groovy JVM code is also a lot better now!)

10 Likes

Very good conclusion. I will read that today. Thank's for your help. So it is worth doing functional programming in term of performances?

I've added your performance tricks.

In Rust, the functional-style techniques are usually "zero cost" abstractions. "Zero cost" here means if you manually tried to do the same thing imperatively you wouldn't be making any less allocations or operations than what the FP style compiles down to. That's why you'll usually see recommendations to go with that... it's nicer to reason about and is free :slight_smile:

(in other languages FP style often comes with extra allocations, a garbage collector, etc... Rust is different)

Big-picture, code that is easier to reason about is often easier to optimize... so I think there's a bit of wiggle room to say using FP-style might indirectly make your code faster on large projects.... bit that's reaching a bit hehe

10 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.