I'm working on concurrent slab allocator operating on a pre-allocated memory block which has O(1) memory overhead (for types larger or equal to a usize) and initialization time with respect to the backing storage size. The current implementation uses a Treiber stack to maintain the free-list, with the next pointers being usize indices bit-tagged with the deallocation count of the slot to mitigate the ABA problem. I also plan to implement backoff strategies using thread-local state to reduce contention on the free-list head.
The API is quite simple:
Call AtomicSlab::<T>::new(storage) with a *mut [u8] pointer to pre-allocated storage
Allocate with slab.alloc(), getting an AtomicSlabPtr<T> which stores both a direct *mut T to the allocated slot, and the slot's bit-tagged index.
Free with slab.free(ptr), where ptr is the AtomicSlabPtr given by slab.alloc().
In order to achieve O(1) memory overhead, I stored the free-list next pointer in a union with the storage slot for a value, which can safely be done for a non-concurrent slab allocator. However, in a concurrent scenario, we can create a data race:
Thread A and B call alloc() and enter the CAS loop, reading the same free-list head.
Thread A succeeds the CAS and becomes the owner of the memory slot corresponding to head.
Thread A writes a value to head's memory slot.
Thread B reads from the now-old head's next to prepare for the CAS.
Thread B's tries to CAS the free-list head next pointer, which fails as its value of head is outdated.
So, steps 3 and 4 constitute a data race, which both Miri and TSAN are able to detect. I'm quite confident that when the CAS succeeds such a race cannot happen, so the data race is "safe" (see a proof sketch at line 221 in the playground code). But this doesn't change the fact that, from what I understand about the Rust/C++11 memory model, data races are always UB.
So, the question is: Is there a way to make this code UB-free while keeping things lock-free and with constant memory overhead? All the strategies I can think of involve some kind of spin locking.
There is also the possibility of making it impossible to get a mutable pointer from the AtomicSlabPtr<T>, instead providing Cell-like methods that copy values of T in and out while using a relaxed atomic operation for the lower part which overlaps the next pointer. However this prevents holding a Rust reference to the allocated value, so any safe API built on top of it would be quite limited.
I am not a specialist on this kind of low-level stuff, but I have two observations:
Firstly, what happens here?
let slab = AtomicSlab::<T>::new(storage);
let ptr1 = slab.alloc();
let ptr2 = ptr1.clone(); // There is a Clone implementation which just copies the pointer and its index.
slab.free(ptr1);
// Is using ptr2 from here on UB?
Secondly, the issues in the Wikipedia article linked by you apply, since Rust does not have the same guarantees as the Java VM. Looking at the article for the ABA problem, you may want to implement some kind of hazard pointer or another means of synchronization.
Yes, using a clone of the pointer after slab.free() would be UB, just like using a pointer returned by std::alloc::alloc() after it's freed is UB. AtomicSlabPtr is meant to be a thin wrapper around a *mut T that leaves the user fully responsible for the lifecycle of their allocation. Hence why I don't think hazard pointers are required here -- if the user needs some form of concurrent GC, they are responsible for implementing it themselves by wrapping the AtomicSlabPtr in an atomically reference counted object or their own hazard pointer implementation.
As for the ABA problem, I mitigate it using tagged references. The top 10 bits of a slot index are reserved for the generation tag, which is increment on every deallocation. Hence you'd need the same slot to be re-allocated a multiple of 1024 times while a thread yielded in the CAS loop to exhibit the ABA problem. Granted, it's probably a good idea to use more generation bits, but on 32-bit systems without a CAS2 primitive, there's not that many available.
I am not sure how useful unsafe APIs are in general and I don't know what the purpose of this crate is in particular, but I'd recommend changing the API to
return some kind of pointer wrapper that is not clonable
do not expose a ptr() method, but implement AsMut for the pointer wrapper
store a shared back-reference with internal mutability to the original slab in the pointer wrapper, to be able to
deregister the pointer by a Drop implementation, rather than a C-esque call to Slab::free()
PS: As for the pointer clone usage, by allowing the user to Clone the pointer wrapper, you suggest to the user that it is a new object independent of its object of origin. I don't think that silently introducing UB like this is a good practice. In complex enough code bases, a user may forget that ptr2 originated as a clone from ptr1 which was dropped 100 lines previously.
Thanks for the suggestions. These are things that I considered, and plan to do in a separate type that wraps around the AtomicSlab. The reason the current code provides a C-esque unsafe API is that it can be used as a building block for multiple different safe abstractions:
A concurrent slab that returns owned mutable reference wrappers, much like what you described;
A concurrent slab that creates Arc-like reference-counted immutable references to data on the slab. This is why the Clone impl is useful to have on AtomicSlabPtr: For this higher-level abstraction, it's necessary to be able to clone the underlying raw pointer.
Anyway, this is not related to my original question. To reiterate, I'm wondering if there's a way to prevent the data race I describe (even though it has no observable effects, it's still UB), and doing so without adding a linear memory overhead or making the data structure non lock-free.
I’m not following the entire discussion, but note that unions don't have “active” fields; that's a concept from C++. The effect of writing one union field and reading another should be understood as exactly the same kind of thing as a pointer cast or transmute — what matters to the type is whether the bytes read (including their uninitializedness) are valid for that type or not, not what type those bytes were previously written from.
This is not a missing feature in bytemuck, because T: Pod implies the ability to safely convert from &T to &[u8] (or in this case, even &AtomicUsize to &usize), which is unsound for AtomicUsize, Cell<U>, or any other type with interior mutability because it would allow the &[u8] to observe mutations of the bytes.
Thanks for the corrections. On atomics and bytemuck, I must have been thinking of the change to implement Zeroable for atomics, and I didn't realize it's already released. Of course this doesn't give the same capabilities as Pod.
I don't quite follow your sketch of proof, but I don't think your conclusion is valid:
hence the load below will read the correct free_next value.
I believe it is totally possible to load a garbage value (written as type T by the winning thread) for free_next.
however, I think your allocation algorithm is logically "correct", in that, even if the load of free_next would get a garbage value back, the following CAS was guaranteed to fail because the head was updated by other thread (assuming the ABA problem cannot occur), so the garbage value is discarded during next retry and cannot be observed from outside.
but the problem is, merely the racy loading operation itself is already UB in rust (and C++), it doesn't matter the loaded (garbage) value was guaranteed to be discarded: the signature of compare_exchange(&self, current, new, ...) requires both current and new be valid values! although we know semantically, the (garbage) value new was transient between the load and the CAS, but this cannot be expressed in the language. (no, MaybeUninit doesn't help)
probably the only way to make this UB free is to do both the loading of next, and the CAS operation, "outside" the rust memory model, e.g. with ffi or inline assembly.
I believe it is totally possible to load a garbage value (written as type T by the winning thread) for free_next.
however, I think your allocation algorithm is logically "correct", in that, even if the load of free_next would get a garbage value back, the following CAS was guaranteed to fail because the head was updated by other thread (assuming the ABA problem cannot occur),
Yeah, this is precisely my argument -- it's possible for the free_next relaxed load to read garbage that the winning thread wrote, but then the CAS fails so the data race is "harmless" (but still UB, so not-so-harmless). In the comment in the code I was exploring the path where the CAS succeeds, and trying to show that if it does, the last write to free_next was the one done by the free() call that installed the current free-list head.
the signature of compare_exchange(&self, current, new, ...) requires both current and new be valid values! although we know semantically, the (garbage) value new was transient between the load and the CAS, but this cannot be expressed in the language. (no, MaybeUninit doesn't help). Probably the only way to make this UB free is to do both the loading of next, and the CAS operation, "outside" the rust memory model, e.g. with ffi or inline assembly.
Thanks for the idea of using inline assembly, I didn't think of it. If doing so, would loading free_next with inline assembly, but still doing the compare exchange in Rust be UB? You seem to say that it is, but once it's loaded by inline assembly, as far as Rust is concerned it is just a plain usize that we got from outside of the memory model, which would be a "valid" value, no?
that's an interesting thought, I think it can actually work, rust should just assume it's a foreign value, but I'm not an expert to assert that. I overlooked this possibility because my mind was pre-conditioned for a hypothetical "usize but possibly contains random value" type, which does not exist in rust, so I thought it could be handled outside rust.
however, inline assembly is a black box to MIR, so miri does not support emulating code with inline assemblies.