WASM faster than native x86-64 build. WT..?!

cliff · August 9, 2019, 11:23pm

Have you tried replacing your HashMap with a BTreeMap?

ZiCog · August 9, 2019, 11:48pm

cliff,

No but I did just now.

Using BTreeMap adds 200ms to the execution time. From about 300ms to about 500ms on my PC.

ZiCog · August 10, 2019, 12:08am

cliff,

But I read "optimal choice for a sorted map" in the BTreeMap docs.

So perhaps if I sorted the letters of a word to create a "hash" string that was common to all anagrams of the same word that would save having to maintain my index.

I tried that in a much earlier solution in Javascript. The prime number hash worked out quicker there.

Maybe not so in Rust...

L0uisc · August 10, 2019, 12:22am

get_unchecked() is indeed wildly unsafe. This is why you need to wrap its use in an unsafe block. Yet it is sometimes necessary and useful, since sometimes you as programmer can be sure that there is no chance that the index will be out of bounds, yet rustc can't prove that, and will insert bounds checks to keep safe. This is the situation where get_unchecked is the right solution.

ZiCog · August 10, 2019, 12:36am

Oh trust me, L0uisc, I have been doing "unsafe" things in software for decades.

Which is why I glad to find that with Rust I can wrap those things up into an "unsafe" block!

Seriously though, I don't want to use "unsafe" in my first ever Rust program.

L0uisc · August 10, 2019, 12:48am

Oh, yeah... If you use iterators instead of indexing with integers you should enable the compiler to elide bounds checks without unsafe though.

ZiCog · August 10, 2019, 7:59am

Hooray!

Thanks kornel, pure genius. That insane-british-anagram-rust is now the fastest single core solution to the anagram finding challenge! See comparison chart: Liberation through Computer Literacy - Page 41 - Raspberry Pi Forums

Significantly faster than my original C++ version of this algorithm. And a bit faster the fastest C solution.

The only thing faster is a multi-core solution using OpenMP. Looks like I will be investigating threads in Rust next....

To be honest, that iterator thing gives me headache:

for (i, slice) in anagram_sets[anagram_sets_idx as usize].word_slices.iter().enumerate() {

ZiCog · August 10, 2019, 1:26pm

What deep magic is this?

So I took up jonh's advice to try the jemallocator in my anagram finder. With very disturbing results:

$ ./target/release/insane-british-anagram > anagrams.txt
324ms
79ms

Do what?! On the second run of the algorithm it is 4 times faster. That is totally unexpected.

On the other hand, on a Raspberry Pi 3 there is little improvement if any:

$ ./target/release/insane-british-anagram > anagrams.txt
664ms
571ms

What on Earth is going on there?

kornel · August 12, 2019, 9:12am

Try measuring the time with Bencher (cargo bench). Single runs may be noisy due to caches, system load, CPU throttling.

ZiCog · August 12, 2019, 5:43pm

kornel,

Thanks for the heads on on cargo bench. It works a treat. That is going to be very useful in the future.

I have a couple of issues with it this case:

Firstly it lies to me. Having iterated my core algorithm however many times it declares that it far faster than I could ever imagine (75ms).

 Running target/release/deps/insane_british_anagram-d47f8595fdf17f6a

running 1 test
test tests::bench_anagrams ... bench:  75,641,200 ns/iter (+/- 852,900)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out

However the reality is that on the first run it takes 300ms. As shown when the program times itself:

$ time target/release/insane-british-anagram > anagrams.txt  
Run    1: 306ms
Digest 1: addeda36a4631afc983f3bff13742e36
Run    2: 75ms
Digest 2: addeda36a4631afc983f3bff13742e36

Of course this program only runs the thing once so 300ms is more like the correct answer.

Sadly cargo bench crashes on my Pi 3 with a seg fault.

Here is my test harness:

    #[cfg(test)]
    mod tests {
        use super::*;
        use test::Bencher;

        #[bench]
        fn bench_anagrams(b: &mut Bencher) {
            let dictionary = std::fs::read("/usr/share/dict/british-english-insane").unwrap();
            b.iter( || {
                anagrams(&dictionary)
            });
        }
    }

Thanks.

kornel · August 12, 2019, 5:51pm

Bench runs it multiple times and gives a typical running time. So it is accurate. That 300ms first run could include some one-time cost.

As for the segfault, check it in the debugger. cargo bench prints the executable to run. If you can verify it doesn't crash because of your code, report it as a bug to rust-lang project.

ZiCog · August 12, 2019, 7:08pm

Yes. I can believe it's accurate when averaged over many iterations. But the fact is that first iteration takes 4 times longer. The program only runs that function once so that is the more like what one experiences. Especially others who are simply comparing many such solutions in different languages using just "time".

It's not running my code in the bench mark that segfaults, it's compiling it in the first place:

error: Could not compile `insane-british-anagram`.

Caused by:
process didn't exit successfully: `rustc --edition=2018 --crate-name insane_british_anagram src/lib.rs --color always --emit=dep-info,link -C opt-level=3 --test --cfg
'feature="default"' -C metadata=3d0d4073a4a4b64c -C extra-filename=-3d0d4073a4a4b64c --out-dir /home/pi/insane-british-anagram-rust/target/release/deps -L dependency=/home/pi/insane-british-anagram-rust/target/release/deps --extern arrayvec=/home/pi/insane-british-anagram-rust/target/release/deps/libarrayvec-2d1307b03da46e90.rlib --extern 
hashbrown=/home/pi/insane-british-anagram-rust/target/release/deps/libhashbrown-871bc5f6085f1297.rlib --extern jemallocator=/home/pi/insane-british-anagram-rust/target/release/deps/libjemallocator-5d0f36d38e6f33a9.rlib --extern md5=/home/pi/insane-british-anagram-rust/target/release/deps/libmd5-9b999e1d6b11f43b.rlib --extern time=/home/pi/
insane-british-anagram-rust/target/release/deps/libtime-9fcdc487b110451c.rlib --extern typename=/home/pi/insane-british-anagram-rust/target/release/deps/libtypename-512109d1e73dcfde.rlib --extern wasm_bindgen=/home/pi/insane-british-anagram-rust/target/release/deps/libwasm_bindgen-decc85c0d9d8f814.rlib --extern web_sys=/home/
pi/insane-british-anagram-rust/target/release/deps/libweb_sys-205ef0a11a385c85.rlib -L native=/home/pi/insane-british-anagram-rust/target/release/build/
jemalloc-sys-f7bca3cc8be438b0/out/build/lib` (signal: 11, SIGSEGV: invalid memory reference)   
warning: build failed, waiting for other jobs to finish...

najamelan · August 12, 2019, 7:54pm

That looks like a compiler error to me. Is this on an unusual target? I suggest:

you update rust to the latest nightly and verify the issue still occurs
search the rust issue tracker for similar issues (SIGSEGV) (also check closed issues, rust development is fast paced)
if nobody has reported it, try to figure out if it happens on all versions of rustc or if it started just recently
if it started recently, try to figure out the exact nightly that introduced the bug (note you can install them by date with rustup)
if it still happens with latest nightly and nobody has reported it, report the issue on the issue tracker with as much information as you can (target, os, ...) and preferably a self contained crate that can be used to reproduce, and the compiler version that first introduces the bug.
be patient, it will get fixed!

ZiCog · August 12, 2019, 8:26pm

najamelan,

I don't know about unusual, I was running that on a Raspberry Pi 3. There is over 20 million of them around and I noticed recently a lot of Pi users getting into Rust.

Thinking it may be running out of RAM I added 1G of swap space. No joy.

This was the latest nightly.

I'll have a poke around and see what I can find out. Problem is every time I change change nightly it takes an hour to rebuild everything!

Perhaps I should have a go at cross-compiling from my PC...

OvermindDL1 · August 12, 2019, 9:08pm

You always need to verify that those 'one-time costs' are run each time in the benchmark as well, not just the algorithm. This is an important thing to consider while benchmarking.

alanhkarp · August 12, 2019, 9:23pm

Don't forget to clear the processor cache between iterations or you'll get misleading results.

ZiCog · August 12, 2019, 11:11pm

@OvermindDL1,

In this case the only apparent one time cost is reading the dictionary file. I'm guessing there is not much I can do to speed that up so I want to leave it out of the bench mark. Similarly for writing the output.

@alanhkarp,

Do you really mean processor cache? If so I have no idea how to do that. Didn't think it was even possible.

Bottom line is that with jemallocator the benchmark returns 75 odd milliseconds, with the regular allocator it returns 300 odd milliseconds.

Meanwhile the program runs the algorithm twice and times it, showing that the first run is indeed about 300ms for both allocators with jemalloctator being marginally quicker.

With jemallocator:

$ ./target/release/insane-british-anagram > anagrams.txt
Run    1: 306ms
Digest 1: addeda36a4631afc983f3bff13742e36
Run    2: 78ms
Digest 2: addeda36a4631afc983f3bff13742e36

With standard allocator:

$ ./target/release/insane-british-anagram > anagrams.txt
Run    1: 322ms
Digest 1: addeda36a4631afc983f3bff13742e36
Run    2: 315ms
Digest 2: addeda36a4631afc983f3bff13742e36

Then again timing the WASM version in my Chrome browser shows 253ms. Substantially faster than both of them.

I don't know how this is possible or how to investigate it.

alanhkarp · August 12, 2019, 11:56pm

You can clear the cache by creating a big array and writing to it, being careful that you don't let the compiler optimize it away. My Fortran (Yes, Fortran) code to do that is

c Flush cache
      subroutine flush
      parameter ( icf = 2**16 )
      dimension icache(icf)
      do i = 1, icf - 1
         icache(i) = icache(i+1)
      enddo
      end

The parameter statement tells you how long ago I wrote this. Modern caches range from 10's of KB to 10's of KB/core to 10's of MB.

bjorn3 · August 13, 2019, 7:20am

ZiCog:

With jemallocator:

$ ./target/release/insane-british-anagram > anagrams.txt
Run    1: 306ms
Digest 1: addeda36a4631afc983f3bff13742e36
Run    2: 78ms
Digest 2: addeda36a4631afc983f3bff13742e36

With standard allocator:

$ ./target/release/insane-british-anagram > anagrams.txt
Run    1: 322ms
Digest 1: addeda36a4631afc983f3bff13742e36
Run    2: 315ms
Digest 2: addeda36a4631afc983f3bff13742e36

I think jemalloc doesnt release the requested memory back to the OS after each iteration. That would explain the huge perf increase after the first iteration. You might want to use GitHub - sharkdp/hyperfine: A command-line benchmarking tool for the benchmarking instead, as it measures the execution time of an arbitrary executable. Jemalloc would not be able to cache memory pages between different executions of the program.

system · November 11, 2019, 7:23am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rewriting a sha1 benchmark for wasm code review	6	605	January 12, 2021
Standard hash performance on WASM help	2	206	March 13, 2024
Webassembly vs JS performance help	12	2193	January 12, 2023
Rust (WebAssembly) is slower than JavaScript code review	11	2460	September 28, 2023
WebAssembly tutorials help	6	927	November 21, 2020

WASM faster than native x86-64 build. WT..?!

Related Topics