3D rendering: GPU buffer allocation vs. safety boundary

The performance of the Rust 3D graphics stacks for complex scene is rather poor compared to the better C++ libraries. This is a big headache for my Sharpview metaverse client. I need performance not much worse than Unreal Engine 3 had in 2005.

The general idea in modern rendering is that one thread is doing nothing but draw, draw, draw, and other threads are updating the scene. Even per-frame update work is done in another thread, running in parallel with rendering. That's how Unreal Engine does it. I do that in Sharpview. But, because the lower levels don't have as much parallelism as Vulkan allows, updating severely impacts rendering performance.

One of the design problems in the Rust graphics stack is where to draw the memory safety boundary. At what level is a safe interface offered?

Vulkan, and the Rust crate ash, offer an unsafe interface. Raw pointers and potential race conditions all over the place. The contents of the GPU can be updated concurrently with rendering. This is Vulkan's big performance edge over OpenGL - you can put multiple threads on multiple CPUs to work.

WGPU and Vulkano put a safe Rust interface on top of Vulkan. It's basically the same interface as Vulkan, but with more checking. This leads to some boundary problems.

GPU memory allocation with Vulkan is a lot like CPU memory allocation in an operating system. Vulkan gives you big blocks on request, and something like malloc is needed on top of that to allow allocating arbitrary sized pieces of GPU memory. So far, so good.

Trouble comes when bindless mode is used. In bindless mode, there's a big vector of descriptors that contains raw pointers to buffers in GPU memory. Normally, each texture is in its own buffer, and textures are identified by a subscript into the vector of descriptors. Performance is better in bindless mode because time is not wasted creating and destroying bind groups on every frame. Profiling shows much time going into binding. Unreal Engine has been mostly bindless for a decade now. Not having bindless support is a boat-anchor on rendering performance.

Buffers are owned by either the GPU or the CPU. You get to switch the mapping from the CPU, being careful not to do this while the GPU is looking at a buffer. Each entry in the descriptor table is also owned by either the GPU or the CPU. The GPU is busily looking at active entries in the descriptor table, while the CPU is busily allocating buffers and updating inactive entries in the descriptor table. No buffer can be dropped while the GPU is using it, even if the CPU is done with it. So drop has to be deferred to the end of the current rendering cycle.

This area is both safety critical and a potential source of locking bottlenecks.

WGPU does not support bindless mode. There's some support for it in Vulkano, but it's not clear if anybody uses it yet. This limits performance.

Replicating the Vulkan interface with safety may be the wrong place to put the safety boundary. In bindless mode, buffer allocation and descriptor table slot allocation have to be tightly coordinated. Since that involves raw pointer manipulation, it seems to belong inside the safety boundary. That is, it ought to be part of WGPU and Vulkano, not on top of them. If the lower levels have to recheck that data, it duplicates work and causes locking bottlenecks.

(I'm not sure about this, but apparently for web targets, WGPU has yet another level of checking on this which involves much re-copying. This may make bindless infeasible for web, at least for now.)

So it looks like GPU buffer allocation and descriptor table updating should move down a level, to allow bindless to work properly while keeping a safe interface. This might be simpler and have fewer locking conflicts.

Maybe an interface where you request a texture buffer and descriptor slot with one call, and you get back an opaque handle to the buffer that's just an index into the descriptor table. Dropping that handle drops both the buffer and the descriptor slot, but that drop is deferred until the end of the current rendering cycle, when the GPU is done and not looking at the descriptor table.

The larger problem, as I've discussed on Reddit, is that My First Renderer has now been implemented at least four times in Rust. These projects all share some of the same design problems. Because it's a My First Renderer, the developers don't know the design pitfalls which lead to slow performance. So they get something that works great on the simple demos, but is too slow for complex, changing scenes. This area needs guidance from someone who's been inside a good renderer, such as an experienced Unreal Engine internals expert.

Which I am not. Anybody with the right experience around? The Vulkano and WGPU projects need your help.

Comments?

5 Likes

This is close to a classic Rust problem - the ownership structure is not well designed. That usually manifests itself as:

  • Only one compile error, from the borrow checker, and major rework is needed to fix it.
  • unsafe code used to work around ownership problems.
  • Poor performance because everything is Arc<MutexLock<T>>, and there are lock conflicts and delays.

We all know those three. The last two show up in the graphics stack.

1 Like

I really would appreciate comments on this. I find myself having to dig into the lower depths of the Rust graphics stack, and I don't know the fine details of that sort of thing. If there's anybody in the Rust world familiar with, for example, how Unreal Engine deals with bindless, please get in touch. Thanks.

The only experience I have with wgpu has been extremely good. The kind of rendering I do doesn't have to recreate bindings every frame. That seems wasteful to me. Rhetorical question: what's the purpose of rebinding all the time if you're just going to reuse exactly the same data?

I did some scouting and found that you are using rend3 in your render-bench project [1]. And from what I've seen looking through rend3, it seems to do a lot of wasteful work like this. That might be important for very dynamic scenes that truly need to recreate everything on every frame. (I'm not sure what such a game would look like, abstract art?) But your render-bench project doesn't need to do that. It replaces buildings every few seconds, not every frame.

On the other hand, maybe I'm the weird one. I'd accept that reusing bindings is just a quirk that my projects have, and that no one is supposed to do that. Maybe the rend3 architecture is the right thing to do.

Take this with a grain of salt. I don't know anything about rend3 other than what you describe as rebinding expense and the fact that running your render-bench is 100% CPU bottlenecked on this call. Which if you follow through the call stack sure looks like it's rebuilding the whole render graph...


  1. Which I found in Wgpu22 23% slower than wgpu 0.20 on my render-bench. · Issue #6434 · gfx-rs/wgpu · GitHub ↩︎

1 Like

rebuilding the whole render graph...

Render-bench is a benchmark. It's a repeatable test to bring out scaling problems hit by big-world games. It's not the optimal way to draw those identical buildings, on purpose.

The real application is Sharpview, my metaverse client. Here's some video from that. Sharpview is constantly loading new textures and meshes into memory from a remote server as the player moves around a world that's the size of Los Angeles. The world is far too big to fit in the GPU all at once. Modern AAA titles can have over 100GB of content.

Wgpu is fine for small scenes. But there's a low ceiling. I've hit that ceiling.

1 Like

Starting work on a solution. See rust-vulkan-bindless. The README there summarizes the approach. The code is just stubs at this point.

1 Like

I did start a reply originally, then realized I don't really have any actually relevant insight that didn't just sound like "you're asking for too much", which both isn't what I really mean, and completely unhelpful.

If I have to give it a try; I think it's important to keep in mind that projects like WGPU don't really have direct equivalents in C++ and have a lot of other things on their plate for their existing users like tracking WebGPU (still in draft!), being a target for high performance rendering isn't really a goal:

wgpu is a safe and portable graphics library for Rust based on the WebGPU API. It is suitable for general purpose graphics and compute on the GPU.

Safe, portable, general.

I would have said you're probably as good as anyone else in this space to start something if it wouldn't have sounded really accusatory, so I'm glad to see you are giving it a stab!

2 Likes

I'm still collecting data and designing. Basic concepts for rust-vulkan-bindless.:

  • Safe API.
  • Bindless is primary mode.
  • Includes buffer allocation and bindless descriptor table management.
  • API exposes opaque buffer handle types, not raw GPU addresses. Uses Rust's ownership checking for most of the checking, rather than trying to check raw GPU addresses coming across the API. That's apparently a bottleneck in Vulkano, and maybe WGPU.
  • API will generally follow WGPU/Vulkano where possible.
  • Supports only Vulkan 1.2 or above via Ash.
  • Code borrowed from Rend3, Vulkano, and WGPU where possible. Minimize new code.
  • Adapt my rend3-hp, a fork of Rend3, to use it in bindless mode.
  • Have my Sharpview metaverse viewer use that and get its frame rate up.

Too early to know if this will work.

5 Likes

I just discovered that WGPU, unlike Vulkan, does have an allocator inside. It's using gpu-alloc, and notes indicate a proposed change to gpu-allocator. When you create a texture via WGPU create_texture call. you get a sub-buffer. Which makes sense. Vulkan itself has a much rawer interface; you allocate "heaps" and must bring your own suballocator. Adding bindless to WGPU looks potentially possible, but because WebGPU won't have bindless before December 2026 (per Google Chrome announcement) the WGPU people are holding off.

Vulkano has an allocator. But it's just a basic arena allocator - you allocate sequential blocks and have to release them all at once. Mismatch to what I need.

Still researching a good solution that doesn't require re-implementing too much.

1 Like