Help me solve this problem I have with Vulkan

As some of you may know, I'm currently writing a safe wrapper around Vulkan. I have already solved a lot of problems, but I've currently been stuck on something for a few days. I'm posting this here because I'm starting to be desperate about this, and I'd like to listen to suggestions from people.

Description of the problem

Vulkan is a low level API that can control the GPU. Contrary to older graphical APIs, Vulkan is very low level and gives you lots of responsibilities. One of these responsibilities is to handle memory safety.

Vulkan has this concept of queues on the GPU to which you submit commands. A queue is very similar to a CPU thread, and the same challenges arise. Take this for example:

We submit some tasks for queues 1 and 2. One of the tasks of queue 2 writes to a buffer, and one of the tasks of queue 1 reads from the same buffer. If the two command executions happen to overlap, you get a data race.

To do so, Vulkan provides semaphores:

Now the memory safety problem is solved. Doing this with the Vulkan API must be done in three phases:

  1. Create a semaphore object.
  2. When you submit the command that writes to the buffer, you must pass the semaphore and ask the implementation to signal it at the end of the task.
  3. When you submit the command that reads from the buffer, you must pass the semaphore and ask the implementation to wait for it to be signaled before starting.

Furthermore, there are a few constraints:

  • You can't signal a semaphore once and wait upon it multiple times. One signal = one wait.
  • You should minimize the number of semaphores that are created, because it can be expensive. Semaphores can be reused once they have been waited upon.
  • Even if the write and the read happen on the same queue, a semaphore must still be used. This is because the implementation allows commands to overlap.

The problem I have is: how to handle semaphores in my Vulkan wrapper? This may not look difficult, but it is. Continue reading to see why.

Idea #1: lock each resource exclusively every time

The idea is to assign a semaphore to each resource. Whenever that resource is used by a task, this task must at the same time wait for the corresponding semaphore at the beginning of the task (except the first time a resource is used), and signal it at the end.

This should work, but the problem is this:

Just like a Mutex compared to a RwLock, we're going to waste a lot of time with resources that are accessed concurrently, which happens often in graphics programming.

Idea #2: since a mutex-like isn't good, let's try a rwlock-like

For each resource, keep track of whether it is currently accessed mutably or immutably, just like a RwLock on the CPU.

However this approach doesn't work in practice, because you have to allocate the semaphores and signal them when you submit the write command. This means that at the time when you submit the write command, you have to know the number of read commands that are going to be executed. This is really not practical.

Another approach would be to always create several semaphores, and use only some of them when they are needed. But as stated in the constraints, we should avoid over-allocating semaphores. We need more than one semaphore per queue, because each read needs to signal a different semaphore in the case where it is followed by a write.

Idea #3: specialize what happens for each resource

Most of the times resources obey some patterns.

For example a buffer that contains a 3D model is usually only ever modified once, then only read. This means that we can just create one semaphore per queue, signal them all in the write, and then the first time each queue reads from the buffer we make it wait on the appropriate semaphore.

It could be argued that the first two solutions don't work well because they are too generic. By using one algorithm for each pattern, we can do things correctly.

However this enters in collisions with the way an application is usually organized. The fact that the code that creates resources needs to know in advance exactly how each resource is going to be used makes things really difficult to organize. For example you can no longer use a pool of textures, since you would have to know, for each individual texture, how it is going to be used at the time when it is created.

Idea #4: let the library user handle this and add checks

One thing that would work for sure is to let the user manually create, signal and wait for semaphores manually. But since the library has to be safe to use, the correct semaphores usage has to checked.

The problem with this is that it would add an overhead. This would not just add an overhead, but add an overhead that enters in collision with what the user already tells us.

Furthermore, this also makes the API unpractical. This would also tie different parts of the application together, because for example the code that submits the write command needs to know about the way the resource is going to be used.

Handling synchronization automatically really brings a lot of benefits compared to manual handling.

Idea #5: leave that part of the API unsafe

Same as the previous point, but without the part where you check that it's correct. This is obviously the C++ approach.

Handling synchronization is a major point of the Vulkan API, and if this part is left unsafe you might as well leave the entire library unsafe. For example the entire design of command buffers must be adapted to these synchronization issues. In the end only very few functions would be safe.

Idea #6: check the online literature about this topic

Since Vulkan has only been released a month ago, there's no such thing as good resources about it. As far as I know only three game developers ported their game to Vulkan yet, and all three said that they didn't yet optimize their engine for next-gen APIs because they still need to support older stuff. For example the "The Talos Principle" developers report a framerate 30% lower than with older API.

Conclusion

As said in the opening, I really don't know what to do. Please provide some remarks and suggestions!

6 Likes

Hm, I don't quite understand how this is supposed to work for a case of single writer and multiple readers.

Suppose we have one write command and three read commands for a single buffer. How many semaphores are needed in this case? And how should they be used?

This

When you submit the command that writes to the buffer, you must pass the semaphore and ask the implementation to signal it at the end of the task.

and

You can't signal a semaphore once and wait upon it multiple times. One signal = one wait.

seem to make impossible to synchronize a writer with several readers. I must have misunderstood something. Can we submit several semaphores with a singe write command?

Yes! You can signal multiple semaphores and wait on multiple semaphores.

With one write followed by three reads, if the three reads are on three different queues you need three semaphores. If multiple reads are on the same queue you can "merge" them. For example if the three reads are on the same queue you only need one semaphore.

That's what I was talking about at the end of #2. You might think that you only need one semaphore per possible queue, but in the "multiple reads followed by write" situation you need more than that.

Can you elaborate about this a bit more? Looks like if you use semaphore per queue in this situation, then reads and writes will be isolated, but some reads will happen after the write. It is not obvious that this behavior is always wrong.

Are the semaphores the only synchronization primitive available for the synchronization? This particular situation sounds a lot like a barrier, and vulkan spec includes word barrier.

Image that you're doing Write -> Read.

For the moment you only need one semaphore. But what is the next operation going to be? If the next operation is a read (Write -> Read -> Read), then you don't need to create a new semaphore and you only need one in total. If the next operation is a write (Write -> Read -> Write), then you need a second semaphore to ensure that the read is over when the write starts.

Since you can't predict the future, the only solution here is to always create a semaphore at each read operation. If the next operation is a read, ignore it. If the next operation is a write, use it.
If you have for example Write -> Read -> Read -> Read -> Write, you have to create 4 semaphores even though only 2 would be needed in theory

Yes they are the only mechanism. Commands are in fact grouped in command buffers. Barriers are used to handle synchronization within command buffers while semaphores are used to handle synchronization between command buffers. There's no problem with handling barriers automatically since you can know in advance all the operations that are going to happen.

This topic triggered some discussions on IRC.
(I had already asked for suggestions on IRC, but I guess my explanations here helped clarify the problem)

If you take a look at slides 37 to 39 here, the DICE developers suggest to group multiple command buffers in jobs and only handle synchronization between these jobs instead of between individual command buffers.

There are two things in Vulkan that are analog to jobs:

  • It is strongly encouraged to submit multiple command buffers at once, because submitting has a fixed overhead. You can only submit to queues one by one, so the group of command buffers that you submit could be considered like a "job".
  • Vulkan has primary and secondary command buffers. Primary command buffers can call secondary command buffers. It is possible to create many secondary command buffers and few big primary command buffers, in which case each primary command buffer is similar to a "job". This is something that only exists in Vulkan as far as I know.

This leads me to a new idea, which is to use two levels in the synchronization process.
If the wrapper knows in advance the list of command buffers that are going to be be used in a "job" (either when submitting several CBs at once or by creating a primary CB that calls multiple CBs), it can create semaphores in the most optimized fashion. Then it can handle synchronization between "jobs" (ie. submissions or primary command buffers) in a different way, like ideas #3 or #4 which are now less cumbersome.

With this design, instead of having lots of small tasks that need to synchronize with each other, you'd have few big tasks that don't interact with each other.

3 Likes

Expanding on that.

The reason why you need 4 semaphores instead of 2 is if you want the reads to be able to overlap.
However if you accept to do the reads one by one, you can make each read signal a semaphore that is then waited upon by the next read. Then only one semaphore per queue is needed.

Which means that you can have an equivalent of an RwLock with just one semaphore per queue per resource, if you accept a performance loss between sequences of reads.

But if we use that job-based design where you have big chunks of tasks instead of small ones, this performance cost is probably totally acceptable.

New idea.

Each queue has a semaphore dedicated to preventing jobs submitted to the same queue from overlapping. Each job submitted to a queue both waits and signals the semaphore of that queue.

In addition to this, each job submission would signal one individual semaphore for each possible other queue.

When a new job is submitted, the resources involved are queried to determine which previous jobs must be finished before starting the new one (each resource keeps in memory the last job that wrote or read it). For each dependency, there are two possibilities: either the inter-queue semaphore it signaled is still available, and we wait on it, or the semaphore has already been waited upon. In that second case, we know for sure that it has already been waited upon by a previous job of the queue we are currently submitting to, which means that we don't need to wait upon it.

In the example above, when the second "Read buffer" job is submitted we find out that it depends on the "Write buffer" job. But since the inter-queues semaphore has already been waited upon, we know for sure that the "Write buffer" was already a dependency of an earlier job of queue 1. Which means that we don't need to wait for anything else than the per-queue semaphore.

I think that's probably the best approach.

The drawback is that there are a lot of semaphore destructions, but:

  • There should be only a few dozen jobs per frame and a typical program doesn't use more than 4-5 queues, so that would be a few hundred semaphores. It's not that much.
  • A pool of semaphores could be maintained to avoid creating and destroying them. I'm not exactly sure how to handle that yet, but it looks definitely possible.
  • I think a good idea would be to let the user provide a hint of how many queue transitions there is going to be. This means that we only need to create the appropriate number of semaphores (one + the hint). If the hint happens to be too low, we can submit a dummy command buffer to the source queue to signal the semaphore. That is obviously an overhead, but that's what you get for not providing the right hint.
3 Likes

This is a very interesting topic. A few weeks ago, I started working on a barebones Vulkan binding and simple example. Progress has been slow due to school, but I've messed with it for long enough to start having a vague idea of how Vulkan works.

These are the questions that I've though of as I read through your ideas:

  • How expensive is it really to create/destroy semaphores? Is it more expensive than having our own threadsafe semaphore pool in Rust? I imagine that if that were faster, the Vulkan driver might already do it. Also, is it bad to have a lot of semaphores allocated at once? Could it have an effect on the performance of working with them?
  • How important is it that we allow overlap of multiple operations on the same queue? I would assume this is very important for performance. (My understanding is that we would force strict ordering of all operations within a queue.) So I think that having a semaphore for every queue would be quite non-ideal.

I'm thinking that it might be okay to have strictly-ordered queues - as long as users can express operations that can run concurrently. For example, I imagine a Write -> Read -> Read -> Read -> Write could be programmatically grouped up into a collection of concurrent operations (by the user) prior to execution, but I don't know whether this would be a flexible enough programming pattern.

Plus, I don't know whether we can do enough with Vulkan's semaphores to express particular orderings in this way:

  • How much can we do with a semaphore? It seems there are very few operations which work on semaphores - and none of them are primitive operations.
    • vkQueueSubmit (via VkSubmitInfo) (also takes a VkFence).
    • vkQueueBindSparse (via VkBindSparseInfo).
    • vkQueuePresentKHR (via VkPresentInfoKHR).
    • vkCreateSemaphore
    • vkDestroySemaphore
    • vkAcquireNextImageKHR

I think the "Start" and "End" blocks above might be implementable using vkQueueSubmit with commandBufferCount = 0. Although I think something similar to what I described is already part of the API - as VkSubmitInfo and VkBindSparseInfo can both take arrays of operations to perform between semaphore guards.

Thanks for the reply!

The Vulkan driver probably doesn't do it. Vulkan is supposed to be all about performances, and they don't want to add an overhead that some applications won't need.

In my previous post, the idea would be to have one pool per queue. Vulkan queues are not thread safe and have to be externally synchronized (ie. the app needs to put a mutex around them) before we can submit to them. Therefore the semaphore pool doesn't need to be thread safe either and can be locked at the same time as the queue.

The pool can consist in just a Vec (or even a Smallvec) where we push and pop the last element. I don't think you can do much faster than that.

That's I think a necessary drawback of the approach. We'd have to tell the user to group commands in few large command buffers and/or few large submissions.

Both the Battlefield developers (see the slides above) and nVidia suggest to group command buffers into tasks, so I think that's probably a good thing to do.

That's an interesting approach.

I think in terms of performances the problem is that if you do Write -> Something unrelated -> Something unrelated -> Something unrelated, and then only the user adds the Start -> Read -> ... to another queue, then the semaphore that Start waits upon will only be signaled after the last Something unrelated, which would delay the operations by a lot of time.

I'm also not sure in terms of usability. Do I have to add a Start for each immutable buffer and texture I upload?

Hi, I got this reply from a guy working at AMD:

"Looks like he needs something like a resource container that transitions/synchronizes all together. Solves G-Buffers, etc."

I'm not sure how much that helps you.

Oh, that's great that you ask professionals.

Unfortunately that doesn't really help, unless I'm missing something.
Handling transitions and synchronization together is already part of my design (alongside with queue family ownership). Each buffer and image needs to pass an object that defines the allocation, synchronization and layouts strategy. I'm currently slightly changing that system to make it easier to use, but it's already there.

Anyway, I'm leaning towards the design above, but the more suggestions the better.

Yeah I suspected as much. I was at a bunch of Vulkan/DX12 talks during GDC last week and it was suggested there that in Vulkan using "RenderPass" was a good way to synchronize dependent read/writes together so it might be interesting to look at that also.

I can ask if these presentations will be available online soon also.

Related twitter thread if you @tomaka want to ask more https://twitter.com/daniel_collin/status/711580538562334721

Aha! I see. Then a semaphore pool seems like it's probably useful.

I'm thinking something along the lines of:

let reads = vec![];
reads.push(read1);
reads.push(read2);
reads.push(read3);

let ex = ExecutionSequence::new()
    .enqueue(write1)
    .enqueue_concurrent(&reads) // creates Start/End automatically
    .enqueue(write2);

queue.enqueue(ex);

Although I haven't thought this through.

I'm settling for the solution with few submissions and per-queue semaphores.

In that design Buffer is an unsafe trait which is implemented by various structs. When you create a command buffer that uses a buffer, it is added to a list in the form of Arc<Buffer>.

Then when the command buffer is submitted, the gpu_access method of the Buffer trait is called on each buffer that are used by the command buffer to determine the dependencies (for images too, but that's not implemented yet). The rest is already described above.

Compared to raw Vulkan, the note-worthy performances costs of submitting a command buffer are:

  • One virtual function call (gpu_access) for each buffer and resource used by each command buffer. This is mitigated by the fact that all immutable buffers and images (like 3D models and textures for example) don't need to be synchronized and don't get called.
  • One mutex lock for each earlier submission that is a dependency of the new submission. This could block other threads, but it's probably okayish as we only modify two vecs during the lock.
  • Depending on the implementation of the Buffer trait, calling gpu_access may require locking mutexes as well. In ImmutableBuffer and StagingBuffer we only swap the content of the mutex, so it's probably okayish as well.
  • Contrary to what I said above, the semaphore pool (which is not implemented yet) does need to be thread-safe because once a semaphore is no longer needed it needs to be returned to the pool in a thread-safe way.

Overall I think the cost is okay. The most crucial thing is that ImmutableBuffer is fast, since it's probably 95% of the buffers used in a typical engine.

I think I may be worrying too much about the performance costs. For reminder every single OpenGL function call took at least 2µs, and sometimes up to around 20µs. One command buffer submission is equal to dozens, if not hundreds of OpenGL function calls. Two mutex locks and a virtual function call per resource are nowhere near the cost that things had in OpenGL.

2 Likes

AMD has posted their GDC presentations over here http://gpuopen.com/gdc16-wrapup-presentations/ you may find the "Vulkan Fast Paths" interesting.

1 Like