What deep magic is this? Or have I made a terrible mistake?
My first ever Rust program, after a few days studying Rust, finds anagrams in the 650,000 words of the Debian dictionary file /usr/share/dict/british-english-insane.
I have a build script that builds a regular release executable and also builds as WASM and binds for node.js and the browser.
That's odd. It might be possible if you've hit a rare case where a JIT can optimize better than ahead-of-time compiler. Perhaps it eliminates bounds checks better?
You could profile the native code to see what's expensive.
Try using iterators instead of for i in 0..x loops. Indexing with [i] may be slow. Use ArrayVec instead of fixed-size arrays with separate size.
Try using a faster hashmap:
On String::from_utf8_lossy(&slice).to_string(); don't call .to_string(), so most of the time it'll give you a slice without allocation.
You could use let mut output = String::with_capacity(<estimate size here>) to make output faster.
FYI, on Rust 1.36 and above, std::collections::HashMapishashbrown - they've integrated it into the standard library. The only major difference between the crate and the version in std is that the crate uses a different hasher (ahash) by default, so that's probably why you're seeing a speedup.
With regards to the WASM build being faster on the second run, I'd guess that this is due to the JIT having 'warmed up' after running your code once already.
that's very intersting, does the JIT cache anything between runs ?
if it can 'learn' between runs (e.g. by using profiling, branch likelyhood, information) and optimize for the specific code+data combination (after all, the data is fixed), instead of just the code, then the benchmark could be realllly fast at some point
This article on how the Node/Chrome JavaScript engine optimizes code is pretty interesting. I'm not sure how much of it carries over to the WASM engine, but I'd assume there's crossover, at least in the general concepts!
Somehow I very much doubt switching allocator will help much. I have tried hard not to do much allocation there. You will notice that when I find words in the dictionary and build lists of anagrams I don't use Strings but only indexes back to the binary dictionary array that get used to access slices in the dictionary.
If I could use fixed size vectors for my various tables I would.
Anyway, as an experiment how does one get Rust to use Jemalloc ?
Important: when doing benchmarks with HashMaps, really do try to average many runs: the hashing is randomized at each run, so some runs may have more (hash) collisions than others. I don't know if it applies in your case, but I know of cases where it had most definitely been the case.
Another possible reason in addition to JIT is using 64-bit pointers/lengths instead of 32-bit on WASM, i.e. some structures take more space on your native target compared to WASM, e.g. String will take 24 bytes on x86-64 and only 12 on wasm32. So if your algorithm is memory-bound it can result in a noticeable impact on benchmark results. Have you tried to compile your code for i686 target and benchmark resulting binary?
Note that wasm32-unknown-unknown uses a constant seed, so only native runs will be randomized.
I have been studiously avoiding putting any proper benchmark timing or profiling in place.
Thing is I now have 8 different platforms to think about. Things look very different on my PC than they do on the Pi, as the figures I quoted above show. They are different again on my cloud server and on a Pi running a 64 bit OS. For each of those the relative performance of native vs wasm changes.
As we see above, changing a little feature of the program's source can have good effects in one place and bad in the other.
Exercising all of those properly could be a very long and tedious task....
I guess I should just focus on the curious phenomena that a wasm build under node.js is faster than native code on this particular PC.
The algorithm is an exercise of hashmaps, pointer chasing, and bounds checks.
So the factors may be:
How well-optimized is the hashmap implementation
How the working set size fits cache size and branch prediciton
How well bounds checks can be eliminated
I expect 1 and 2 favor x86-64, and 3 to favor WASM.
I've made tables pre-allocated to their final size, which reduces copying overhead. But it may also make them exceed cache size from the start, not just after a few resizings. So try tuning capacity.
Try using get_unchecked for indexing.
If the algorithm allows, try using u32 for indices rather than u64. It may be cheaper on 32-bit targets, and help fit in the cache.