Simultaneous concurrent read and write into a buffer

I understand, that having simultaneous read and write to the same memory is 'insta-UB' in Rust, but my domain is asking for it. I want to have a shared buffer, where one thread is writing (and reading), and other is reading disregarding the state of the writer.

What I do is 'image rendering', which can take arbitrary time, and I want to show progress (the work done insofar), without incurring atomic overhead. It's done in a fixed-size buffer.

My idea is to have two threads, one is a pure math, writing and calculating at max speed, second is 'drawing' thread, which just copy to window canvas whatever is available in a buffer now. Buffer is preinitialized by '0' (black with 0 alpha), and it's perfectly fine to have few pixels been 'outdated', and none of the possible data races can hurt.

How can this be done in Rust? This kind of 'forbidden sharing' between threads? Should I try to write it myself or there are libraries for that?

You have to do this with atomics. Luckily, if you used Relaxed and the store/load operations, it will compile down to completely ordinary MOV instructions, at least on x86 machines. See e.g. this page.

What about the one where it notices that you are reading through an immutable reference, which allows the compiler to assume that the data is unchanged between any two uses of the same immutable reference, at which point it optimizes out any access that happens after the first one, meaning that you never see any updates?

3 Likes

Yes, it sounds less impressive than I expected it to be. Atomics sold (if they perform as fast as I hope).

Are you sure you really have the need for speed here?

If the image rendering is going to take seconds, minutes, whatever, then surely if the thread doing the rendering took time out every second or so and posted it's progress somewhere for something else to read, that would take such a small amount of time, micro-seconds perhaps, compared to the total job no-one would ever notice.

Typically I would write the code first and then see if I have a performance problem.

I, actually, aims not for speed here, but for convenience. If I split 'display' and 'draw' threads like this, I can write my 'draw' thread without thinking how to split my algorithm into chunks, where to put synchronization points; no need to handle events in message queue, etc. Just for x in pixelarray { x=calculate_pixel() }.

My naive dream is to recreate experience of ZX-Spectrum, where you can just write a loop and 'put pixel' on a screen without any lore. I want the same with Rust. One thread is doing boilerplating and initialization, another is working with a buffer at comfortable speed.

(Before that I've tried to make 'put_pixel' into a message to the main thread, but it was crazy slow to do it 'per pixel', and every other idea just make 'draw' thread more complicated).

I want the simplest possible API for draw thread without compromising speed and responsiveness of the app (mostly to 'redraw' and 'close' events).

Not so naive. I think it's a noble objective.

The closest I have come to doing that sort of thing since the "Speccy" days is to open the frame buffer device on Linux and write pixels directly into the memory space you get. Which immediately appear on the screen. I did that on a Raspberry Pi with no X Windows running. Worked a treat.

Here is an example of someone doing that Writing To The Framebuffer | Seena Burns

I guess the next closest is writing the pixels into a texture buffer and then getting OpenGL to put it on the screen.

Still, I would not worry about getting progress reports out of the render loop. I would be tempted to put that loop in a function and past it a channel to write too as a parameter. Then call it from a thread. Only a line or two of code required to send a progress message after every line of some number of lines or whatever.

By the way, how were you running threads on a Speccy?

I already tried framebuf library (there was too much of event loop and using 100% CPU), piston (has terrible docs and I wasn't able to understand it well enough).

The best thing I found insofar is rust-sdl2 (which is a very nice wrapper over SDL2). It made my silly demo running at 150-250fps (depending on specifics of code) and was almost simple to write. The 'not simple part' was around texture types and double buffering, but 30% CPU at vsync() at 2560x1440@60FPS was my child dream (no, my child dream was 320x240 with interleaved bitmap and 8x8 colormap, but it was more of child horror story...). Anyway, I can draw whatever I can and it's fast enough without noticing anything.

My current 'flash and random' demo is here GitHub - amarao/sdl_random: SDL2 Rust demo with pixel drawing.

Anyway, my next step is to wrap up all SDL-specific code into a separate thread and to have a simple 'draw here' object with no nuances.

I already bench-marked atomic.store(_, Relaxed), and criterion can't differentiate between loop over Vec<u32> and Vec<AtomicU32>, so it's really nice.

I far from claiming to understanding Rust (and Rust lore) well enough to be able to write a public library, but if I ever get to silky smooth foundation for my toys, I would try anyway.

Compare to my childhood computers become slick and complex. It's really hard to start.

I talked to a guy on a corner and he sold me for $50 a vial with some diluted 'unsafe' inside.
This is my small library (for now with only get and set functions), which allows one thread to write and other to read. I used AtomicU32 and Ordering::Relaxed to address issues I heard here.

I run it and it looks working. How bad it is? How many laws have I broken? Does it has any relation with 'soundness'?

I don't understand why you are doing the thing with the slice. Share the vector with an Arc.

It's a prototype. In a future, there going to be a loop with endless 'set' (based on some computation), and 'copy_buff' which would copy the whole slice into another buffer (texture) to display on screen.

As far as I understand, Arc does not let me to have get_mut and get simultaneously in different threads (because it's unsafe).

    fn set(&mut self, idx: usize, val: u32) {
        self.slice[idx].store(val, Relaxed);
    }

You can use &self in place of &mut self here. AtomicU32 doesn’t need the exclusive reference to make changes— That’s the entire point of atomics.


Edit: I’m pretty sure that your code as written is unsound. The call to Vec::as_mut_ptr in spawn() will invalidate the pointer you got inside new, and therefore the slice reference derived from it.

Just make your set method take &self and the Arc will work

Wow, I never thought about it. If I can have writes with two non-mut slices, I can do it without using unsound {} code.

Thank you, I'll try it again.

Arc is working, indeed (rust_dual/main.rs at c6577e63cffc0d7ce866ec61e7e5524ac44c82d0 · amarao/rust_dual · GitHub)

but my humble benchmark (writing counts 10 times from 0 to u32 in one thread and reading them in another) gives on my PC those numbers:

arc:
|real|0m19.216s|
|user|0m38.430s|

unsound double ptr:
|real|0m13.736s|
|user|0m27.469s|

It's almost the same code, but one is uses arc to get into vec, and other (the one which is faster) is using two slices from the same vector. It's 40%!

Unsound cranking 12.5GB/s of integers, sound (arc) - 8.9GB/s.

I would like to make it sound, but not a expense of the speed. I'll try to play with two non-mut refs (whose I can have, as far as I understand) to make it less scary.

It's probably because of the double indirection, since with an Arc<Vec<AtomicU32>> you have to first go to the arc's allocation, and then to the vector's allocation.

Try using the type Arc<[AtomicU32]> instead. You can make one by doing Arc::from(&vec_of_atomic_u32[..]).

1 Like
-    data:  Arc<Vec<AtomicU32>>,
+    data:  Arc<[AtomicU32]>,
...
-                data: Arc::new(vec),
+                data: Arc::from(vec),

And result is...
|real|0m9.846s|
|user|0m19.690s|

Wow! It beats the hell out of my unsound unsafe version (9.8s vs 13.7s!). Another 40% speedup! 17.5GB/s! My memory is capable of 25.6GB/s per channel, and it's really close to it. (I checked it with .get_unchecked instead of [], but it's within 2% difference, negligible).

I suspect, that this speed (17.5GB is more of the CPU IPI bound than anything else).

I appreciate a lot for help (and for saving me from unsound unsafe mess)!

Upd, and just for clarity, I've tried my code without atomics, it's indeed, does not work at all. I think, every CPU is using local cache and can't see changes of other processors without proper IPI (which is atomic under the hood).

It should not differ from the other solutions on this point. Are you sure you are not just creating a different Arc for each thread?

No, no, sorry for been unclear.

For the sake of argument I've tried my 'hacky' version with two pointers to the same memory area. It works with AtomicU32, but when I changed it to u32 (just to see how fast its going to work), it stopped to work. The 'main' thread never saw any updates from other thread.

I'm not a guru of multi-threading, but I think my tiny knowledge of how CPU works, actually, enough to propose a simple explanation.

There is a shared memory between processors. If one processor wrote to it, other can read, if it wants to.

My code with a single u32 in a slice of size 1, begs for caching. Every processor core uses own cache, and have no need to look into the memory. Write operations are flushed (eventually) from writer CPU cache, but reader definitely has no reasons to re-read from memory. The cache line is hot, so it's never updated.

The AtomicU32::store() is a way to force one CPU to send IPI (interprocessor interrupt) to force cache update, so other CPU must read updated version.

Basically, it's low-level explanation why u32 on heap is not r/w-shareable between threads (without involving Rust memory model).

On an intel x86 processor, relaxed reads and writes should be equivalent to non-atomic reads and writes. My guess would be that the optimizer optimized out your reads.

it looks like it still doing a real load/cmp:

.LBB344_10:
 xor     eax, eax
 mov     esi, eax
 lea     rdi, [rsp, +, 72]
 mov     eax, dword, ptr, [rsp, +, 52]
 cmp     eax, -2147483648
; while a.get(0) < 2147483648 {
 jb      .LBB344_13
...
.LBB344_13:
; while a.get(0) < 2147483648 {
 jmp     .LBB344_10

loop with set is working too,

.LBB29_2:
 mw.set(0, cnt);
 mov     rcx, qword, ptr, [rbx, +, 24]
 self.slice[idx] = val;
 mov     dword, ptr, [rcx], eax
 cnt += 1;
 add     eax, 1
 self.slice[idx] = val;
 cmp     qword, ptr, [rbx, +, 32], 0
 jne     .LBB29_2

So it looks like it wasn't optimized.

But I can't explain differences. I looked into asm for atomic/Relaxed, and it's really just mov (compare to Release/Acquire which put lock, dropping performance about 40%, or SeqCst, whom I was never able to wait till even one loop to finish).

It's a bit of mystery of me now...