Cuda vs WGSL program execution time

I wrote a trivial function that performs element-by-element addition of two vectors in Rust, cuda, and wgsl and did a runtime comparison of the programs. Here is a link to the results and code. It turns out that the execution time of the wgsl solution is much higher than cuda. Therefore, I had a few questions.

  • why is cuda so much faster than wgsl?
  • is my code optimal? can it be optimized?
  • is it possible to download / upload / work with buffers faster?
  • is it worth saving the compiled code to a file between runs?

I'm not familiar with wgsl, but from my very limited vulkan experience, your wgsl kernel looks very suspicious to my eye:

this line, if the workgroup annotation means the same as vulkan compute shader, you were wasting most power of your compute gpu:

and in the following snippets, why are you using loops in your compute kernel? or is the wgsl dispatch model is different from vulkan compute shaders?

finally, this line indicate you are using single buffer for input and output, which is different from your cuda kernel. I'm not familiar with the cache characteristic, but GPU performance is very sensitive to cache efficiency, because of the heterogenous nature of memory architecture.

bottom line: if you want to compare the dispatching overhead of cuda and wgpu, at least make sure the GPU kernels are doing the same (or at least comparable) work.

3 Likes

Instead of loops, this code should operate on a single element of the array, with the array index calculated from the global_invocation_id; See the example compute shader declaration from the specification:

 @compute @workgroup_size(64)
 fn cs_main(
   @builtin(local_invocation_id) local_id: vec3<u32>,
   @builtin(local_invocation_index) local_index: u32,
   @builtin(global_invocation_id) global_id: vec3<u32>,
) {}
1 Like

Thanks for the information. It helped me a lot!