1GB/s memcopy on wasm32 reasonable?

I'm trying to benchmark, copying from one Vec<T> to another Vec<T>. It comes out to around 1GB/s when running Rust/wasm32 in Chrome.

For those doing performance intensive things in Rust/wasm32/Chrome, does this sound in the right ballpark ?

What numbers do you get if you run the equivalent operation in pure JavaScript? In guessing they should be near identical.

It is not obvious to me what this proves. I am curious how far away I am from optimal Rust/wasm32; what does benchmarking JS prove ?

JavaScript and WebAssembly will use the same underlying JIT, so comparing WebAssembly against JavaScript will let you know what is possible on your machine in that browser.

It gives you a baseline measurement you can make comparisons against, using a procedure and hardware you know/trust.

There is such a large diversity in hardware that 1 GB/sec could be phenomenal on your machine and for your use case, while it might be abysmal for someone else.

You also didn't say what type the T is. If it's a Copy type then the compiler uses a specialisation that memcpy()'s the bytes across directly, whereas if it has a destructor then the optimiser needs to copy element-by-element to ensure panics are observed correctly.

You might have given us numbers for copying a Vec<String> while I might give you numbers for copying a Vec<u8>, then be surprised when there's an order of magnitude difference between us.

Otherwise, if you are just asking whether 1GB/sec is good enough, that totally depends on what you are trying to do and your expected throughput.

5 Likes

Have you enabled the bulk-memory target feature. This adds a native memcpy wasm instruction rather than having to emulate it using load.i64 + store.i64 which is almost certainly much slower on larged data sizes than the vectorized memcpy that is used natively and with the bulk-memory feature.

4 Likes

What are you running on?

The Mac M1 I'm using here is said to have a memory bandwidth of 200GB/s. So reading from one place and writing to another theoretically could be 100GBs. Which makes 1GB/s sound terrible.

How does your memcpy benchmark run as native code on the same machine?

sysbench memory --memory-block-size=1G --memory-total-size=20G --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1048576KiB
  total size: 20480MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 20 (    6.71 per second)

20480.00 MiB transferred (6872.06 MiB/sec)


General statistics:
    total time:                          2.9779s
    total number of events:              20

Latency (ms):
         min:                                  145.90
         avg:                                  148.88
         max:                                  154.59
         95th percentile:                      150.29
         sum:                                 2977.56

Threads fairness:
    events (avg/stddev):           20.0000/0.00
    execution time (avg/stddev):   2.9776/0.00

If I am reading this correctly, even on native, I am getting only 6.872G/s. 1GB/s on wasm32 in Chrome doesn't sound that bad in comparison.

I'm using ~5 year old 4u server from ebay.

Your 100GB/s claim on M1 -- are you able to hit it in practice ?

Not yet. Is it merely

RUSTFLAGS='-C target-feature=+bulk-memory' 

or do I need more magic ?

No idea, never tried to measure it. My gut feeling is that it's unlikely. Marketing hype and all that.

I think that is enough, but you may need to use either -Zbuild-std or LTO to recompile the actual function that does the copying with bulk-memory. It is also possible that this function is generic and thus already codegened in your crate rather than as part of the standard library, in which case you wouldn't need either option.

Can you share the source code of your benchmark?