WASM faster than native x86-64 build. WT..?!

What deep magic is this? Or have I made a terrible mistake?

My first ever Rust program, after a few days studying Rust, finds anagrams in the 650,000 words of the Debian dictionary file /usr/share/dict/british-english-insane.

I have a build script that builds a regular release executable and also builds as WASM and binds for node.js and the browser.

You can see it run in the browser here: http://otaniemi.conveqs.fi:9000/www/
Code is here: https://github.com/ZiCog/insane-british-anagram-rust

Now, I would not normally mention any of this except for one thing...

How come the WASM build under node.js can be 40% faster than the native x86-64 build?

$ ./target/release/insane-british-anagram > anagrams.txt
503ms
473ms
$ cd nodejs/
$ node index.js > anagrams.txt
597ms
288ms

How come in the web page in Chrome it runs in 403ms. Faster than native executable.

The execution times measured there do not include the time to read the dictionary file or print the output. I have made release builds all round.

Am I going nuts?

4 Likes

That's odd. It might be possible if you've hit a rare case where a JIT can optimize better than ahead-of-time compiler. Perhaps it eliminates bounds checks better?

You could profile the native code to see what's expensive.


Try using iterators instead of for i in 0..x loops. Indexing with [i] may be slow. Use ArrayVec instead of fixed-size arrays with separate size.

Try using a faster hashmap:

On String::from_utf8_lossy(&slice).to_string(); don't call .to_string(), so most of the time it'll give you a slice without allocation.

You could use let mut output = String::with_capacity(<estimate size here>) to make output faster.

1 Like

I am become web assembly, destroyer of benchmarks

12 Likes

Why arrayvec? It's stack features?

Update: This is helpful:

kornel,

Thanks for the tips.

Changing to HashBrown has had a huge effect:

$ ./target/release/insane-british-anagram > anagrams.txt
448ms
421ms
$ cd nodejs/
$ node index.js > anagrams.txt
519ms
215ms

The native build speeds up by 10%
The WASM build speeds up by 25%

The WASM build is twice as fast now on the second execution. WTF?!!

Meanwhile in my Chrome browser I see:

WASM execution time: 325 milliseconds. Which 20 or 30% faster than the original HashMap version.

Removing that to_string() made no noticeable difference.

So far I have failed to get arrayvec working, it complains about some missing trait or other...

Repo is updated. If anyone wants to see this miracle I have tagged the versions tested.

Note: Tests run on an Intel Core i7-2600K at 3.4GHz. Things look a bit different on my server, whatever that is.

FYI, on Rust 1.36 and above, std::collections::HashMap is hashbrown - they've integrated it into the standard library. The only major difference between the crate and the version in std is that the crate uses a different hasher (ahash) by default, so that's probably why you're seeing a speedup.

With regards to the WASM build being faster on the second run, I'd guess that this is due to the JIT having 'warmed up' after running your code once already.

that's very intersting, does the JIT cache anything between runs ?

if it can 'learn' between runs (e.g. by using profiling, branch likelyhood, information) and optimize for the specific code+data combination (after all, the data is fixed), instead of just the code, then the benchmark could be realllly fast at some point

This article on how the Node/Chrome JavaScript engine optimizes code is pretty interesting. I'm not sure how much of it carries over to the WASM engine, but I'd assume there's crossover, at least in the general concepts!

1 Like

Reason is simple. System allocator isn't good; switch to Jemalloc and see the same speed up.

johnh,

Somehow I very much doubt switching allocator will help much. I have tried hard not to do much allocation there. You will notice that when I find words in the dictionary and build lists of anagrams I don't use Strings but only indexes back to the binary dictionary array that get used to access slices in the dictionary.

If I could use fixed size vectors for my various tables I would.

Anyway, as an experiment how does one get Rust to use Jemalloc ?

The readme file of jemallocator has an example of how to use it.

Important: when doing benchmarks with HashMaps, really do try to average many runs: the hashing is randomized at each run, so some runs may have more (hash) collisions than others. I don't know if it applies in your case, but I know of cases where it had most definitely been the case.

Another possible reason in addition to JIT is using 64-bit pointers/lengths instead of 32-bit on WASM, i.e. some structures take more space on your native target compared to WASM, e.g. String will take 24 bytes on x86-64 and only 12 on wasm32. So if your algorithm is memory-bound it can result in a noticeable impact on benchmark results. Have you tried to compile your code for i686 target and benchmark resulting binary?

Note that wasm32-unknown-unknown uses a constant seed, so only native runs will be randomized.

1 Like

Yes, I'm aware of the problems with bench marking such things.

Problem is this code came about because of a little coding challenge https://www.raspberrypi.org/forums/viewtopic.php?f=31&t=240287&start=1000#p1515612

Currently this Rust solution is in 3rd place (Twice as fast as my C++ solution in that chart).

As such it only gets to run once. A deterministic and fast solution would be better than a randomly amortized fast solution.

I have tried creating the vectors I use with_capacity() a few times along the way. It has never made a noticeable difference to the run time.

kornel,

Thanks for the pull request.

Wow! That speeds up the x86-64 build by about 25%

$ ./target/release/insane-british-anagram > anagrams.txt
328ms
326ms

At the cost of slowing down the WASM by 50%

$ node index.js > anagrams.txt
490ms 
324ms

But there in lies a clue. Something in that change set is very friendly to x86 and very unfriendly to WASM.

Now I only have to figure out what you did exactly...

Sadly kornel, your changes make things a bit slower on the Raspberry Pi:

With std HashMap on Raspberry Pi 3 (v0.1.0):

$ ./target/release/insane-british-anagram > anagrams.txt
729ms
717ms

With Goggle's HashBrown on Pi 3 (v0.1.1):

$ ./target/release/insane-british-anagram > anagrams.txt
606ms
583ms

With ArrayVec on Pi 3 (v0.1.2):

$ ./target/release/insane-british-anagram > anagrams.txt
664ms
639ms

This benchmarking business is the road to madness!

2 Likes

Best thing I found so far for profiling is cargo-flamegraph. It could give you some insights.

Thanks for the suggestion.

I have been studiously avoiding putting any proper benchmark timing or profiling in place.

Thing is I now have 8 different platforms to think about. Things look very different on my PC than they do on the Pi, as the figures I quoted above show. They are different again on my cloud server and on a Pi running a 64 bit OS. For each of those the relative performance of native vs wasm changes.

As we see above, changing a little feature of the program's source can have good effects in one place and bad in the other.

Exercising all of those properly could be a very long and tedious task....

I guess I should just focus on the curious phenomena that a wasm build under node.js is faster than native code on this particular PC.

The algorithm is an exercise of hashmaps, pointer chasing, and bounds checks.

So the factors may be:

  1. How well-optimized is the hashmap implementation
  2. How the working set size fits cache size and branch prediciton
  3. How well bounds checks can be eliminated

I expect 1 and 2 favor x86-64, and 3 to favor WASM.

  • I've made tables pre-allocated to their final size, which reduces copying overhead. But it may also make them exceed cache size from the start, not just after a few resizings. So try tuning capacity.
  • Try using get_unchecked for indexing.
  • If the algorithm allows, try using u32 for indices rather than u64. It may be cheaper on 32-bit targets, and help fit in the cache.

Whoah, what?

get_unchecked() sounds like a unsafe array access with no bounds checking!

How is that allowed ?

The cache friendliness or otherwise of this algorithm boggles my mind. Anagrams are by nature all over the map.

I'm open to suggestions for a totally different anagram finding algorithm here.