The performance of the Rust 3D graphics stacks for complex scene is rather poor compared to the better C++ libraries. This is a big headache for my Sharpview metaverse client. I need performance not much worse than Unreal Engine 3 had in 2005.
The general idea in modern rendering is that one thread is doing nothing but draw, draw, draw, and other threads are updating the scene. Even per-frame update work is done in another thread, running in parallel with rendering. That's how Unreal Engine does it. I do that in Sharpview. But, because the lower levels don't have as much parallelism as Vulkan allows, updating severely impacts rendering performance.
One of the design problems in the Rust graphics stack is where to draw the memory safety boundary. At what level is a safe interface offered?
Vulkan, and the Rust crate ash, offer an unsafe interface. Raw pointers and potential race conditions all over the place. The contents of the GPU can be updated concurrently with rendering. This is Vulkan's big performance edge over OpenGL - you can put multiple threads on multiple CPUs to work.
WGPU and Vulkano put a safe Rust interface on top of Vulkan. It's basically the same interface as Vulkan, but with more checking. This leads to some boundary problems.
GPU memory allocation with Vulkan is a lot like CPU memory allocation in an operating system. Vulkan gives you big blocks on request, and something like malloc is needed on top of that to allow allocating arbitrary sized pieces of GPU memory. So far, so good.
Trouble comes when bindless mode is used. In bindless mode, there's a big vector of descriptors that contains raw pointers to buffers in GPU memory. Normally, each texture is in its own buffer, and textures are identified by a subscript into the vector of descriptors. Performance is better in bindless mode because time is not wasted creating and destroying bind groups on every frame. Profiling shows much time going into binding. Unreal Engine has been mostly bindless for a decade now. Not having bindless support is a boat-anchor on rendering performance.
Buffers are owned by either the GPU or the CPU. You get to switch the mapping from the CPU, being careful not to do this while the GPU is looking at a buffer. Each entry in the descriptor table is also owned by either the GPU or the CPU. The GPU is busily looking at active entries in the descriptor table, while the CPU is busily allocating buffers and updating inactive entries in the descriptor table. No buffer can be dropped while the GPU is using it, even if the CPU is done with it. So drop has to be deferred to the end of the current rendering cycle.
This area is both safety critical and a potential source of locking bottlenecks.
WGPU does not support bindless mode. There's some support for it in Vulkano, but it's not clear if anybody uses it yet. This limits performance.
Replicating the Vulkan interface with safety may be the wrong place to put the safety boundary. In bindless mode, buffer allocation and descriptor table slot allocation have to be tightly coordinated. Since that involves raw pointer manipulation, it seems to belong inside the safety boundary. That is, it ought to be part of WGPU and Vulkano, not on top of them. If the lower levels have to recheck that data, it duplicates work and causes locking bottlenecks.
(I'm not sure about this, but apparently for web targets, WGPU has yet another level of checking on this which involves much re-copying. This may make bindless infeasible for web, at least for now.)
So it looks like GPU buffer allocation and descriptor table updating should move down a level, to allow bindless to work properly while keeping a safe interface. This might be simpler and have fewer locking conflicts.
Maybe an interface where you request a texture buffer and descriptor slot with one call, and you get back an opaque handle to the buffer that's just an index into the descriptor table. Dropping that handle drops both the buffer and the descriptor slot, but that drop is deferred until the end of the current rendering cycle, when the GPU is done and not looking at the descriptor table.
The larger problem, as I've discussed on Reddit, is that My First Renderer has now been implemented at least four times in Rust. These projects all share some of the same design problems. Because it's a My First Renderer, the developers don't know the design pitfalls which lead to slow performance. So they get something that works great on the simple demos, but is too slow for complex, changing scenes. This area needs guidance from someone who's been inside a good renderer, such as an experienced Unreal Engine internals expert.
Which I am not. Anybody with the right experience around? The Vulkano and WGPU projects need your help.
Comments?