WASM faster than native x86-64 build. WT..?!

ZiCog · August 8, 2019, 8:08pm

What deep magic is this? Or have I made a terrible mistake?

My first ever Rust program, after a few days studying Rust, finds anagrams in the 650,000 words of the Debian dictionary file /usr/share/dict/british-english-insane.

I have a build script that builds a regular release executable and also builds as WASM and binds for node.js and the browser.

You can see it run in the browser here: http://otaniemi.conveqs.fi:9000/www/
Code is here: GitHub - ZiCog/insane-british-anagram-rust: Rust program to find anagrams in the Debian british-english-insane dictionary file.

Now, I would not normally mention any of this except for one thing...

How come the WASM build under node.js can be 40% faster than the native x86-64 build?

$ ./target/release/insane-british-anagram > anagrams.txt
503ms
473ms
$ cd nodejs/
$ node index.js > anagrams.txt
597ms
288ms

How come in the web page in Chrome it runs in 403ms. Faster than native executable.

The execution times measured there do not include the time to read the dictionary file or print the output. I have made release builds all round.

Am I going nuts?

kornel · August 9, 2019, 12:04am

That's odd. It might be possible if you've hit a rare case where a JIT can optimize better than ahead-of-time compiler. Perhaps it eliminates bounds checks better?

You could profile the native code to see what's expensive.

Try using iterators instead of for i in 0..x loops. Indexing with [i] may be slow. Use ArrayVec instead of fixed-size arrays with separate size.

Try using a faster hashmap:

On String::from_utf8_lossy(&slice).to_string(); don't call .to_string(), so most of the time it'll give you a slice without allocation.

You could use let mut output = String::with_capacity(<estimate size here>) to make output faster.

nologik · August 9, 2019, 1:48am

I am become web assembly, destroyer of benchmarks

DHorse2 · August 9, 2019, 2:45am

Why arrayvec? It's stack features?

Update: This is helpful:

github.com

nrc/r4cppp/blob/master/arrays.md

# Arrays and Vectors

Rust arrays are pretty different from C arrays. For starters they come in
statically and dynamically sized flavours. These are more commonly known as
fixed length arrays and slices. As we'll see, the former is kind of a bad name
since both kinds of array have fixed (as opposed to growable) length. For a
growable 'array', Rust provides the `Vec` collection.


## Fixed length arrays

The length of a fixed length array is known statically and features in its
type. E.g., `[i32; 4]` is the type of an array of `i32`s with length four.

Array literal and array access syntax is the same as C:

```rust
let a: [i32; 4] = [1, 2, 3, 4];     // As usual, the type annotation is optional.
println!("The second element is {}", a[1]);
```

This file has been truncated. show original

ZiCog · August 9, 2019, 7:05am

kornel,

Thanks for the tips.

Changing to HashBrown has had a huge effect:

$ ./target/release/insane-british-anagram > anagrams.txt
448ms
421ms
$ cd nodejs/
$ node index.js > anagrams.txt
519ms
215ms

The native build speeds up by 10%
The WASM build speeds up by 25%

The WASM build is twice as fast now on the second execution. WTF?!!

Meanwhile in my Chrome browser I see:

WASM execution time: 325 milliseconds. Which 20 or 30% faster than the original HashMap version.

Removing that to_string() made no noticeable difference.

So far I have failed to get arrayvec working, it complains about some missing trait or other...

Repo is updated. If anyone wants to see this miracle I have tagged the versions tested.

Note: Tests run on an Intel Core i7-2600K at 3.4GHz. Things look a bit different on my server, whatever that is.

17cupsofcoffee · August 9, 2019, 8:43am

FYI, on Rust 1.36 and above, std::collections::HashMap is hashbrown - they've integrated it into the standard library. The only major difference between the crate and the version in std is that the crate uses a different hasher (ahash) by default, so that's probably why you're seeing a speedup.

With regards to the WASM build being faster on the second run, I'd guess that this is due to the JIT having 'warmed up' after running your code once already.

vmedea · August 9, 2019, 8:58am

that's very intersting, does the JIT cache anything between runs ?

if it can 'learn' between runs (e.g. by using profiling, branch likelyhood, information) and optimize for the specific code+data combination (after all, the data is fixed), instead of just the code, then the benchmark could be realllly fast at some point

17cupsofcoffee · August 9, 2019, 9:09am

This article on how the Node/Chrome JavaScript engine optimizes code is pretty interesting. I'm not sure how much of it carries over to the WASM engine, but I'd assume there's crossover, at least in the general concepts!

jonh · August 9, 2019, 10:19am

Reason is simple. System allocator isn't good; switch to Jemalloc and see the same speed up.

ZiCog · August 9, 2019, 11:59am

johnh,

Somehow I very much doubt switching allocator will help much. I have tried hard not to do much allocation there. You will notice that when I find words in the dictionary and build lists of anagrams I don't use Strings but only indexes back to the binary dictionary array that get used to access slices in the dictionary.

If I could use fixed size vectors for my various tables I would.

Anyway, as an experiment how does one get Rust to use Jemalloc ?

broken-pen · August 9, 2019, 12:02pm

The readme file of jemallocator has an example of how to use it.

Yandros · August 9, 2019, 3:23pm

Important: when doing benchmarks with HashMaps, really do try to average many runs: the hashing is randomized at each run, so some runs may have more (hash) collisions than others. I don't know if it applies in your case, but I know of cases where it had most definitely been the case.

newpavlov · August 9, 2019, 3:48pm

Another possible reason in addition to JIT is using 64-bit pointers/lengths instead of 32-bit on WASM, i.e. some structures take more space on your native target compared to WASM, e.g. String will take 24 bytes on x86-64 and only 12 on wasm32. So if your algorithm is memory-bound it can result in a noticeable impact on benchmark results. Have you tried to compile your code for i686 target and benchmark resulting binary?

Note that wasm32-unknown-unknown uses a constant seed, so only native runs will be randomized.

ZiCog · August 9, 2019, 3:54pm

Yes, I'm aware of the problems with bench marking such things.

Problem is this code came about because of a little coding challenge Liberation through Computer Literacy - Page 41 - Raspberry Pi Forums

Currently this Rust solution is in 3rd place (Twice as fast as my C++ solution in that chart).

As such it only gets to run once. A deterministic and fast solution would be better than a randomly amortized fast solution.

I have tried creating the vectors I use with_capacity() a few times along the way. It has never made a noticeable difference to the run time.

ZiCog · August 9, 2019, 5:58pm

kornel,

Thanks for the pull request.

Wow! That speeds up the x86-64 build by about 25%

$ ./target/release/insane-british-anagram > anagrams.txt
328ms
326ms

At the cost of slowing down the WASM by 50%

$ node index.js > anagrams.txt
490ms 
324ms

But there in lies a clue. Something in that change set is very friendly to x86 and very unfriendly to WASM.

Now I only have to figure out what you did exactly...

ZiCog · August 9, 2019, 7:07pm

Sadly kornel, your changes make things a bit slower on the Raspberry Pi:

With std HashMap on Raspberry Pi 3 (v0.1.0):

$ ./target/release/insane-british-anagram > anagrams.txt
729ms
717ms

With Goggle's HashBrown on Pi 3 (v0.1.1):

$ ./target/release/insane-british-anagram > anagrams.txt
606ms
583ms

With ArrayVec on Pi 3 (v0.1.2):

$ ./target/release/insane-british-anagram > anagrams.txt
664ms
639ms

This benchmarking business is the road to madness!

najamelan · August 9, 2019, 8:37pm

Best thing I found so far for profiling is cargo-flamegraph. It could give you some insights.

ZiCog · August 9, 2019, 9:03pm

Thanks for the suggestion.

I have been studiously avoiding putting any proper benchmark timing or profiling in place.

Thing is I now have 8 different platforms to think about. Things look very different on my PC than they do on the Pi, as the figures I quoted above show. They are different again on my cloud server and on a Pi running a 64 bit OS. For each of those the relative performance of native vs wasm changes.

As we see above, changing a little feature of the program's source can have good effects in one place and bad in the other.

Exercising all of those properly could be a very long and tedious task....

I guess I should just focus on the curious phenomena that a wasm build under node.js is faster than native code on this particular PC.

kornel · August 9, 2019, 9:16pm

The algorithm is an exercise of hashmaps, pointer chasing, and bounds checks.

So the factors may be:

How well-optimized is the hashmap implementation
How the working set size fits cache size and branch prediciton
How well bounds checks can be eliminated

I expect 1 and 2 favor x86-64, and 3 to favor WASM.

I've made tables pre-allocated to their final size, which reduces copying overhead. But it may also make them exceed cache size from the start, not just after a few resizings. So try tuning capacity.
Try using get_unchecked for indexing.
If the algorithm allows, try using u32 for indices rather than u64. It may be cheaper on 32-bit targets, and help fit in the cache.

ZiCog · August 9, 2019, 11:22pm

Whoah, what?

get_unchecked() sounds like a unsafe array access with no bounds checking!

How is that allowed ?

The cache friendliness or otherwise of this algorithm boggles my mind. Anagrams are by nature all over the map.

I'm open to suggestions for a totally different anagram finding algorithm here.

Topic		Replies	Views
Cheating to learn Rust by converting JavaScript to Rust, now I want to improve the code help	48	692	May 17, 2025
How to improve rust performance to 1.4 seconds? help	32	1831	September 14, 2023
Performance issue - High complexity code help	17	1643	July 1, 2020
Using Rust+WASM inside a React Frontend App help	13	2545	October 13, 2020
Success story: new Rustacean beating C perf in first week	13	5942	May 9, 2019

WASM faster than native x86-64 build. WT..?!

Related topics