Unsafe code review for coi's Resolvable

Nashenas88 · July 4, 2020, 2:29am

I'm writing my first piece of unsafe code and would really appreciate some other eyes on this if anyone has the time. I've moved it to its own module (https://github.com/Nashenas88/coi/blob/15bfa293d01bfb67e335d31b06827c3059c3c42f/src/resolvable.rs) to limit the scope of unsafe.

What I'm trying to accomplish with that code:
This is for a dependency injection crate, and I'm supporting multiple threads possibly attempting to resolve the same item at the same time. In the past I dealt with this by hiding the resolutions behind a lot of mutexes, but it was hard to manage properly, and prone to deadlocks whenever I would make any sort of involved change.

I switched to using an atomic u8 to manage the states of resolution: unresolved, building, and resolved. The idea I had was that a single thread could guarantee that it's the one doing the building by validating that it is able to update the atomic, and in doing so, I could use UnsafeCell to ensure that no other thread is writing. However, I also needed to make sure that no other thread attempts to read the value in the UnsafeCell during the building time, so there's also a (Mutex<bool>, Condvar) tuple that's used to block all other threads until the building is done, and the value has been resolved. This way I can ensure all of the other threads are notified at once and the CPU isn't blocked with busy waiting.

I wanted to know if I may have gone overboard and don't need unsafe. Have I gotten too much tunnel vision and maybe I'm missing something more obvious?

HeroicKatora · July 4, 2020, 12:02pm

In resolve_inner at lead the order of updates to is_building and variant isn't what you intended. You first do a SeqCst write to variant to release it as Unresolved and then lock the condition variable to set it to false. This means some other thread could have grabbed the resolution and ran the code that sets *is_building = true and now its update is running while having is_building = false. I don't think this leads to further problems other than quite a lot of cpu activity and lock contention on the building lock though.

I think you do have a concurrency bug though. If we examine the possible paths a thread could take towards unwrapping a built Option then we find that there one where no acquire load of variant takes place, this happens when both loads return RESOLVED immediately. Now, if that is the first time this thread has examined the Resolvable then it has not yet synchronized with the operation that wrote the item, which means its loads are racey. Note for example the impl of once_cell.

Nashenas88 · July 7, 2020, 2:43pm

Thank you for your quick response! I took some time to read the nomicon, especially the parts around ordering. I'll make sure to apply your changes and also add some tests on an arm device to see if I run into any issues. I really appreciate the feedback!

Nashenas88 · July 13, 2020, 10:59pm

@HeroicKatora I think I solved both issues you brought up. I replace the compare_and_swap orderings from Relaxed to AcqRel. I employed double-checked locking around the switch to building to avoid the second issue you brought up. The updated code is here: https://github.com/Nashenas88/coi/blob/atomic-updates/src/resolvable.rs. I got rid of the condvar and use a RwLock instead, this I believe would allow the waiting threads to all proceed at once.

After writing that, I wondered if I could get away from unsafe by using RwLock and I have that implementation here: https://github.com/Nashenas88/coi/blob/rw-lock-resolvable/src/resolvable.rs

I have never done performance testing with threads before, so I'm not sure of the best way to compare the difference in performance between the two. I wrote this benchmark for that purpose: https://github.com/Nashenas88/coi/blob/atomic-updates/benches/threads.rs (which is the same in both branches). In the atomic-updates branch, the performance seems to be fairly consistent, but in the rw-lock-resolvable branch, I get timings that jump around. The former on my current laptop had timings around 3.4ms, with not too much variantion, while the latter would get down to almost 3.5ms, but sometimes jump up to 6 ms. I closed all other apps running on my laptop, to hopefully reduce noise from the system, but it wasn't clear to me how I could analyze this more easily/clearly.

system · October 11, 2020, 11:00pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Atomic-synchronized UnsafeCell code review	6	824	June 4, 2022
Unsafe code review request for cycle collector code review	2	277	July 13, 2023
Unsafe Cell async help	5	530	March 1, 2023
Alternative to Mutexs, When code is sound without them	4	445	February 20, 2023
Anouncing async_executors, a building block for executor agnostic libraries announcements	10	1029	May 28, 2020

Unsafe code review for coi's Resolvable

Related Topics