Rust (WebAssembly) is slower than JavaScript

We are trying to speed up JavaScript code using Rust (WebAssembly). The code is as follows, but it is noticeably slower than the equivalent JavaScript code. I had hoped that Rust (WebAssembly) would be faster, but am I missing something (e.g. compiler options)?

#[wasm_bindgen]
pub unsafe fn subtract(volume: JsValue, lathe: JsValue) {
    let a: Volume = serde_wasm_bindgen::from_value(volume).unwrap();
    let b: Lathe = serde_wasm_bindgen::from_value(lathe).unwrap();

    let matrix_b_to_a = a.matrix.inverted().multiply(&b.matrix);

    let x0 = b.bbox.min.x - b.threshold;
    let x1 = b.bbox.max.x + b.threshold;
    let y0 = b.bbox.min.y - b.threshold;
    let y1 = b.bbox.max.y + b.threshold;
    let z0 = b.bbox.min.z - b.threshold;
    let z1 = b.bbox.max.z + b.threshold;

    let p0 = matrix_b_to_a.multiply_point(&Point3::new(x0, y0, z0));
    let p1 = matrix_b_to_a.multiply_point(&Point3::new(x1, y0, z0));
    let p2 = matrix_b_to_a.multiply_point(&Point3::new(x0, y1, z0));
    let p3 = matrix_b_to_a.multiply_point(&Point3::new(x1, y1, z0));
    let p4 = matrix_b_to_a.multiply_point(&Point3::new(x0, y0, z1));
    let p5 = matrix_b_to_a.multiply_point(&Point3::new(x1, y0, z1));
    let p6 = matrix_b_to_a.multiply_point(&Point3::new(x0, y1, z1));
    let p7 = matrix_b_to_a.multiply_point(&Point3::new(x1, y1, z1));

    let min_x = min(&vec![p0.x, p1.x, p2.x, p3.x, p4.x, p5.x, p6.x, p7.x]);
    let max_x = max(&vec![p0.x, p1.x, p2.x, p3.x, p4.x, p5.x, p6.x, p7.x]);
    let min_y = min(&vec![p0.y, p1.y, p2.y, p3.y, p4.y, p5.y, p6.y, p7.y]);
    let max_y = max(&vec![p0.y, p1.y, p2.y, p3.y, p4.y, p5.y, p6.y, p7.y]);
    let min_z = min(&vec![p0.z, p1.z, p2.z, p3.z, p4.z, p5.z, p6.z, p7.z]);
    let max_z = max(&vec![p0.z, p1.z, p2.z, p3.z, p4.z, p5.z, p6.z, p7.z]);
    let (min_x_index, min_y_index, min_z_index) = a.get_index(&Point3::new(min_x, min_y, min_z));
    let (max_x_index, max_y_index, max_z_index) = a.get_index(&Point3::new(max_x, max_y, max_z));

    let min_x_index = clamp(min_x_index, 0, (a.resolution.x - 1) as i32) as usize;
    let max_x_index = clamp(max_x_index, 0, (a.resolution.x - 1) as i32) as usize;
    let min_y_index = clamp(min_y_index, 0, (a.resolution.y - 1) as i32) as usize;
    let max_y_index = clamp(max_y_index, 0, (a.resolution.y - 1) as i32) as usize;
    let min_z_index = clamp(min_z_index, 0, (a.resolution.z - 1) as i32) as usize;
    let max_z_index = clamp(max_z_index, 0, (a.resolution.z - 1) as i32) as usize;

    let manager = BufferManager::lock();
    let arc_a = manager.get(a.values_id).unwrap();
    let arc_b = manager.get(b.values_id).unwrap();
    let mut mutex_a = arc_a.lock().unwrap();
    let mutex_b = arc_b.lock().unwrap();
    let values_a = mutex_a.slice_mut::<u8>();
    let values_b = mutex_b.slice::<u8>();

    let matrix_a_to_b = b.matrix.inverted().multiply(&a.matrix);

    let resolution_u = b.resolution.x;
    let resolution_v = b.resolution.y;

    for z_index in min_z_index .. max_z_index + 1 {
        for y_index in min_y_index .. max_y_index + 1 {
            for x_index in min_x_index .. max_x_index + 1 {
                let index = x_index + y_index * a.resolution.x + z_index * a.resolution.x * a.resolution.y;

                let value_a = values_a[index];
                if value_a == 0 {
                    continue;
                }

                let x = a.origin.x + (x_index as f64) * a.cell_size;
                let y = a.origin.y + (y_index as f64) * a.cell_size;
                let z = a.origin.z + (z_index as f64) * a.cell_size;
                let p = matrix_a_to_b.multiply_point(&Point3::new(x, y, z));

                let squared_u = p.x * p.x + p.y * p.y;
                if b.max_u * b.max_u < squared_u {
                    continue;
                }

                let u = squared_u.sqrt() - b.origin.x;
                let v = p.z - b.origin.y;
                let u_index = (u / b.cell_size) as usize;
                let v_index = (v / b.cell_size) as usize;
                if v < 0.0 || resolution_u <= u_index || resolution_v <= v_index {
                    continue;
                }

                let v_index0 = (v_index + 0) * resolution_u;
                let v_index1 = (v_index + 1) * resolution_u;

                let c00 = values_b[u_index + v_index0] as f64;
                let c01 = values_b[u_index + v_index1] as f64;
                let c10: f64;
                let c11: f64;
                if u_index + 1 < resolution_u {
                    c10 = values_b[u_index + 1 + v_index0] as f64;
                    c11 = values_b[u_index + 1 + v_index1] as f64;
                } else {
                    c10 = 0.0;
                    c11 = 0.0;
                }

                let u_ratio = u / b.cell_size - u_index as f64;
                let v_ratio = v / b.cell_size - v_index as f64;
                let c0 = c00 + (c10 - c00) * u_ratio;
                let c1 = c01 + (c11 - c01) * u_ratio;
                let c = c0 + (c1 - c0) * v_ratio;
                let value_b = (255.0 - c) as u8;

                if value_b < value_a {
                    values_a[index] = value_b;
                }
            }
        }
    }
}
  • Make sure you're compiling with optimizations; if you are using wasm-pack then this is default; otherwise call cargo build --release and use its output in target/release/, not the target/debug/.

  • Try profiling the code (you can do this from the dev tools Performance tab in either Chrome or Firefox) to see what is actually slow.

  • The parts I'd suspect of being troublesome are the parts where the code you've shown interacts with other things:

    • serde_wasm_bindgen
    • whatever BufferManager is

    You may also be providing too small a problem per call for the Rust to provide noticeable benefit. In general, the most inefficient part is likely to be the interaction between Rust and JS (due to the overhead of copying/serializing data and also because there can be no cross-language optimization here), so when performance matters you want to minimize the number of calls across that boundary.

3 Likes

Thanks for the reply.

  • I compiled with the following settings (This is part of package.json). I believe that it will be optimized correctly.
    "scripts": {
        "build:wasm": "cargo build --target wasm32-unknown-unknown --release",
        "postbuild:wasm": "wasm-bindgen target/wasm32-unknown-unknown/release/wasm.wasm --out-dir src/wasm",
        "build:js": "webpack --mode development",
        "build": "run-s build:wasm build:js",
        "test": "jest"
    },
  • I tried the browser profiler but could not find any details. If I comment out the triple loop in the above code, the process ends almost, instantly, so I believe this part is the bottleneck.

  • There is certainly some overhead, such as "serde_wasm_bindgen," but even after subtracting that, Rust is still slower.

What order of numbers are you looking at?

To measure the inherent overhead, compare against a wasm function with the same Input and output shape that does no actual work. Js-wasm has high inherent overhead.

1 Like

I am calling the above function 29328 times from JavaScript to measure the total time. So the overhead estimates may indeed be inaccurate.

I will implement the higher-level process in Rust (WebAssembly) and make one call from JavaScript.

Is there any chance of winning the above strategy? I still doubt that Rust is slower than JavaScript.

Yeah, that loop is the mistake.

What you need to do in your application is collect up all the work you want the Rust code to do --- or perhaps "as much work as we can complete in 50ms" --- , and make a single call into the WASM module to complete all of that work.

2 Likes

Implemented higher-level processing in Rust (WebAssembly) and modified it to be called once from JavaScript. Now, the call overhead between JavaScritp and Rust should be negligibly small, but again, Rust (WebAssembly) is slower than JavaScript. :smiling_face_with_tear:
The measurement results are as follows.

[JavaScript]3.0827000000476836 s
[Wasm]3.824100000143051 s

It might be that this simply is not a good use case for a small amount of Rust. JavaScript JIT VMs will be able to optimize code that is just doing arithmetic to very similar machine code as Rust's compiler would.

Rust might provide more benefit when you have code that is working with more complex data structures and algorithms, where it's relevant that Rust can more reliably avoid heap-allocation or avoid dynamic dispatch.

2 Likes

Yes. I have absolutely no idea about the details but as far as I have read modern JS run times will "Just In Time" (JIT) the code into fairly optimal native code for execution.

So as long as a function always takes integer or at least "number" parameters and crunches on them it will end up as running native code that handles ints/floats as efficiently as any a head of time compiled language.

But JS is dynamically typed, so if your program ever calls that function with a string instead then all the fast optimised code gets thrown away and you are back to some kind of simpler, less optimal, interpretation.

That is the idea behind, for example, compiling C source to so called asm.js.

All in all it's not clear why Rust compiled to wasm should be faster.

I think one overlooked aspect in this discussion is the predictability (or conversely, the variability) of the performance, rather than pure performance itself.

Javascript is a garbage-collected language, and it always will be. Rust on WASM doesn't do that, and thus there should be a distinct lack of GC-related stop-the-world freezes.
From that I infer that it would be most advantageous when dealing with large amounts of objects/values of varying sizes (i.e. reusability of a memory chunk of a given size is limited) being constantly created and destroyed.

Measuring this accurately would be really difficult though. Meltdown/Spectre and the following neutering of time resolutions in browser-executed JS made sure of that, IIRC the resolution is now pegged to the grand granularity of 1ms. Which is pretty pointless for any sort of fine grained performance measurement one might actually care about, especially when considering that lots of things operate down in the single digit microsecond range, or even in the nanosecond range.

I can't even reliably measure the performance of an AST walking interpreter I use (not exactly a shining beacon of performance, architecturally speaking) anymore in a web browser because of that.

1 Like

You can get back quite high precision timers with the right headers: High precision timing - Web APIs | MDN

Or you could profile in node/deno, if the code is sufficiently generic.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.