Code review for thread-safe arena implementation

I've coded the following thread-safe arena for fun and to practice atomics (thanks @Mara for the excellent book!). I know there is a bug, because if I set RANGE to a very large number (I recommend 10_000_000) it sometimes fail with STATUS_ACCESS_VIOLATION (on Windows), but I can't spot it and I have no idea how to debug that:

#![forbid(unsafe_op_in_unsafe_fn, rust_2018_idioms)]

use std::alloc::Layout;
use std::cell::UnsafeCell;
use std::sync::atomic::{AtomicPtr, Ordering};

/// A non-dropping thread-safe arena.
pub struct Arena<T> {
    current_chunk: AtomicPtr<T>,
    current_chunk_end: AtomicPtr<T>,
    /// `null` if we are currently allocating a new chunk.
    next_to_alloc: AtomicPtr<T>,
    /// The number of `T`s every chunk can contain.
    capacity: usize,
    allocated_chunks: UnsafeCell<Vec<*mut T>>,
}

// SAFETY: This is the point.
unsafe impl<T: Send> Sync for Arena<T> {}

const DEFAULT_CAPACITY: usize = 1000;

impl<T> Arena<T> {
    const EMPTY_ARENA_SENTINEL: *mut T = 1 as *mut T;

    #[inline]
    pub const fn new() -> Self {
        Self::with_capacity(DEFAULT_CAPACITY)
    }

    #[inline]
    #[track_caller]
    /// Takes the number of `T`s a chunk will contain.
    pub const fn with_capacity(capacity: usize) -> Self {
        // This is not required, but put here for better error reporting. This is equivalent to
        // `Layout::array::<T>(capacity).unwrap()`, but const.
        match capacity.checked_mul(std::mem::size_of::<T>()) {
            Some(v) if v > isize::MAX as usize => panic!("too big capacity"),
            None => panic!("too big capacity"),
            _ => {}
        }

        assert!(capacity > 0, "capacity cannot be zero");
        assert!(
            std::mem::size_of::<T>() != 0,
            "zero-sized types are not supported",
        );

        Self {
            current_chunk: AtomicPtr::new(Self::EMPTY_ARENA_SENTINEL),
            current_chunk_end: AtomicPtr::new(Self::EMPTY_ARENA_SENTINEL),
            next_to_alloc: AtomicPtr::new(Self::EMPTY_ARENA_SENTINEL),
            capacity,
            allocated_chunks: UnsafeCell::new(Vec::new()),
        }
    }

    fn alloc_new_chunk(capacity: usize) -> (*mut T, *mut T) {
        let layout = Layout::array::<T>(capacity).expect("too big capacity");
        // SAFETY: We verified in `with_capacity()` that the capacity is not zero and the type is not a ZST.
        let chunk = unsafe { std::alloc::alloc(layout) }.cast::<T>();
        if chunk.is_null() {
            // We cannot call `handle_alloc_error()` because it can cause a recoverable panic but we are left
            // in invalid state as the current chunk is duplicated on `current_chunk` and `allocated_chunks`.
            // We could recover to a valid state, this is a matter of putting a guard in the right place, but
            // I didn't bother as this is not production code.
            std::process::abort();
        }

        // Shouldn't happen, but just to be extra safe.
        assert_ne!(
            chunk,
            Self::EMPTY_ARENA_SENTINEL,
            "got allocation at address {}??",
            Self::EMPTY_ARENA_SENTINEL as usize,
        );

        // SAFETY: It is one past the end.
        let chunk_end = unsafe { chunk.add(capacity) };
        (chunk, chunk_end)
    }

    #[inline]
    fn alloc_place(&self) -> *mut T {
        let mut next_to_alloc = self.next_to_alloc.load(Ordering::Acquire);
        // We can get an incorrect `chunk_end` during allocation of a new chunk. However,
        // if this happens `next_to_alloc` will be null, so we will spin loop until the
        // chunk has been fully allocated then update `chunk_end`.
        let mut chunk_end = self.current_chunk_end.load(Ordering::Relaxed);
        loop {
            if next_to_alloc.is_null() {
                // Somebody is allocating a new chunk, wait for them.
                loop {
                    std::hint::spin_loop();
                    next_to_alloc = self.next_to_alloc.load(Ordering::Relaxed);
                    if !next_to_alloc.is_null() {
                        break;
                    }
                }
                // fence(Ordering::Acquire);
                // For thread sanitizer, as it doesn't support fences.
                self.next_to_alloc.load(Ordering::Acquire);
                chunk_end = self.current_chunk_end.load(Ordering::Relaxed);
            }

            // This must come after the previous check, so that if we were in the middle of allocating
            // a new chunk but it quickly ran out of space before we had a chance to use it we will
            // allocate a new chunk.
            if next_to_alloc == chunk_end {
                return self.grow_by_chunk(next_to_alloc);
            }

            // SAFETY: This is either in bounds or one past the end.
            let after = unsafe { next_to_alloc.add(1) };
            // `Ordering::Relaxed` is not enough for the success ordering, we need `Release`. This is
            // because we need to form a happens-before relationship between all updates of `next_to_alloc`
            // and the load of it as part of the `compare_exchange()` in `grow_by_chunk()`. If the
            // stores will not happen-before this load, we may observe a corrupted `chunk_end`
            // in `alloc_place()` because of later `grow_by_chunk()`.
            match self.next_to_alloc.compare_exchange_weak(
                next_to_alloc,
                after,
                Ordering::Release,
                Ordering::Acquire,
            ) {
                Ok(_) => return next_to_alloc,
                Err(v) => next_to_alloc = v,
            }
        }
    }

    // Allocates a new chunk and does all bookkeeping.
    #[cold]
    #[inline(never)]
    fn grow_by_chunk(&self, expected_next_to_alloc: *mut T) -> *mut T {
        if self
            .next_to_alloc
            .compare_exchange(
                expected_next_to_alloc,
                std::ptr::null_mut(),
                Ordering::Acquire,
                Ordering::Relaxed,
            )
            .is_err()
        {
            // If the chunk is not equal to the chunk end anymore, that means someone was faster than
            // us and already took care to allocate a new chunk. So let's wait for it to finish
            // (`alloc_place()` will do that) and retry allocating (it cannot be someone who further
            // increased `next_to_alloc` because once we reached the chunk end we don't advance anymore).
            return self.alloc_place();
        }

        let (chunk, chunk_end) = Self::alloc_new_chunk(self.capacity);
        let result = chunk;
        // SAFETY: We verify in `with_capacity` that the capacity is at least one, so this is either
        // in bounds or one past the bounds.
        let chunk_rest = unsafe { chunk.add(1) };

        let old_chunk = self.current_chunk.load(Ordering::Relaxed);
        if old_chunk != Self::EMPTY_ARENA_SENTINEL {
            // SAFETY: We're holding a "lock" as `next_to_alloc` is null and we have exclusive access
            // to `allocated_chunks` during this time period.
            unsafe {
                (*self.allocated_chunks.get()).push(old_chunk);
            }
        }

        self.current_chunk.store(chunk, Ordering::Relaxed);
        self.current_chunk_end.store(chunk_end, Ordering::Relaxed);
        self.next_to_alloc.store(chunk_rest, Ordering::Release);

        result
    }

    #[inline]
    pub fn alloc(&self, v: T) -> &mut T {
        let place = self.alloc_place();
        // SAFETY: `alloc_place()` always gives us a valid pointer.
        unsafe {
            place.write(v);
            &mut *place
        }
    }
}

impl<T> Drop for Arena<T> {
    fn drop(&mut self) {
        let layout = Layout::array::<T>(self.capacity).unwrap();
        self.allocated_chunks
            .get_mut()
            .iter()
            .copied()
            .chain(
                Some(*self.current_chunk.get_mut())
                    .filter(|&chunk| chunk != Self::EMPTY_ARENA_SENTINEL),
            )
            .for_each(|chunk| {
                // SAFETY: We allocated each chunk with the global allocator and `Layout::array::<T>(self.capacity)`.
                unsafe {
                    std::alloc::dealloc(chunk.cast::<u8>(), layout);
                }
            });
    }
}

fn main() {
    let arena = Arena::new();

    dbg!(arena.alloc(123u64));

    let arena = &arena;
    let counter = &std::sync::atomic::AtomicU64::new(0);
    const RANGE: u64 = 10_000_000;
    std::thread::scope(|s| {
        for i in 0..10 {
            s.spawn(move || {
                counter.fetch_add(
                    (i * RANGE..(i * RANGE) + RANGE)
                        .map(|j| arena.alloc(j))
                        .collect::<Vec<_>>()
                        .into_iter()
                        .map(|&mut v| v)
                        .sum::<u64>(),
                    Ordering::Relaxed,
                )
            });
        }
    });
    dbg!(counter.load(Ordering::Relaxed));
}

I also tried to replace all orderings with SeqCst but it still sometimes crashes, so this is probably not a problem with the ordering.

In addition, I will also appreciate any bug report/other comments.

So there was an ordering bug, Ordering::Relaxed should be replaced by Ordering::Release for the success case of the compare_exchange_weak() in alloc_place() (I explained why in a comment). But it still crashes, even though thread sanitizer no longer reports a data race...

Address sanitizer reports that there is an attempt to write 8 bytes exactly at the end of one of the chunks, so I believe the reason is that chunk_end has an incorrect value and we allocate at the chunk end.

I think I understood the problem. We may load next_to_alloc and chunk_end that are incompatible since we don't update the chunk_end after we load the new next_to_alloc, and so if a chunk allocation has completed behind our back we will see a different next_to_alloc but keep the same chunk_end.

Have you tried running your code with Miri? I immediately got an out of bounds write:

[src/main.rs:209] arena.alloc(123u64) = 123
error: Undefined Behavior: out-of-bounds pointer arithmetic: alloc1335516 has size 8000, so pointer to 8 bytes starting at offset 8000 is out-of-bounds
   --> src/main.rs:114:34
    |
114 |             let after = unsafe { next_to_alloc.add(1) };
    |                                  ^^^^^^^^^^^^^^^^^^^^ out-of-bounds pointer arithmetic: alloc1335516 has size 8000, so pointer to 8 bytes starting at offset 8000 is out-of-bounds
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
    = note: BACKTRACE:
    = note: inside `Arena::<u64>::alloc_place` at src/main.rs:114:34: 114:54
**snip**
2 Likes

I ran it with much smaller iterations. How many times did you run it? But I think this refers to the problem I fixed, here's the fixed code:

#![forbid(unsafe_op_in_unsafe_fn, rust_2018_idioms)]

use std::alloc::Layout;
use std::cell::UnsafeCell;
use std::sync::atomic::{fence, AtomicPtr, Ordering};

/// A non-dropping thread-safe arena. Not entirely lock-free as it spins if another
/// thread was in the middle of allocating a page.
pub struct Arena<T> {
    current_chunk: AtomicPtr<T>,
    current_chunk_end: AtomicPtr<T>,
    /// `null` if we are currently allocating a new chunk.
    next_to_alloc: AtomicPtr<T>,
    /// The number of `T`s every chunk can contain.
    capacity: usize,
    allocated_chunks: UnsafeCell<Vec<*mut T>>,
}

// SAFETY: This is the point.
unsafe impl<T: Send> Sync for Arena<T> {}

const DEFAULT_CAPACITY: usize = 1000;

impl<T> Arena<T> {
    const EMPTY_ARENA_SENTINEL: *mut T = 1 as *mut T;

    #[inline]
    pub const fn new() -> Self {
        Self::with_capacity(DEFAULT_CAPACITY)
    }

    #[inline]
    #[track_caller]
    /// Takes the number of `T`s a chunk will contain.
    pub const fn with_capacity(capacity: usize) -> Self {
        // This is not required, but put here for better error reporting. This is equivalent to
        // `Layout::array::<T>(capacity).unwrap()`, but const.
        match capacity.checked_mul(std::mem::size_of::<T>()) {
            Some(v) if v > isize::MAX as usize => panic!("too big capacity"),
            None => panic!("too big capacity"),
            _ => {}
        }

        assert!(capacity > 0, "capacity cannot be zero");
        assert!(
            std::mem::size_of::<T>() != 0,
            "zero-sized types are not supported",
        );

        Self {
            current_chunk: AtomicPtr::new(Self::EMPTY_ARENA_SENTINEL),
            current_chunk_end: AtomicPtr::new(Self::EMPTY_ARENA_SENTINEL),
            next_to_alloc: AtomicPtr::new(Self::EMPTY_ARENA_SENTINEL),
            capacity,
            allocated_chunks: UnsafeCell::new(Vec::new()),
        }
    }

    fn alloc_new_chunk(capacity: usize) -> (*mut T, *mut T) {
        let layout = Layout::array::<T>(capacity).expect("too big capacity");
        // SAFETY: We verified in `with_capacity()` that the capacity is not zero and the type is not a ZST.
        let chunk = unsafe { std::alloc::alloc(layout) }.cast::<T>();
        if chunk.is_null() {
            // We cannot call `handle_alloc_error()` because it can cause a recoverable panic but we are left
            // in invalid state as the current chunk is duplicated on `current_chunk` and `allocated_chunks`.
            // We could recover to a valid state, this is a matter of putting a guard in the right place, but
            // I didn't bother as this is not production code.
            std::process::abort();
        }

        // Shouldn't happen, but just to be extra safe.
        assert_ne!(
            chunk,
            Self::EMPTY_ARENA_SENTINEL,
            "got allocation at address {}??",
            Self::EMPTY_ARENA_SENTINEL as usize,
        );

        // SAFETY: It is one past the end.
        let chunk_end = unsafe { chunk.add(capacity) };
        (chunk, chunk_end)
    }

    #[inline]
    fn alloc_place(&self) -> *mut T {
        let mut next_to_alloc = self.next_to_alloc.load(Ordering::Acquire);
        // We can get an incorrect `chunk_end` during allocation of a new chunk. However,
        // if this happens `next_to_alloc` will be null, so we will spin loop until the
        // chunk has been fully allocated then update `chunk_end`.
        let mut chunk_end = self.current_chunk_end.load(Ordering::Relaxed);
        loop {
            if next_to_alloc.is_null() {
                // Somebody is allocating a new chunk, wait for them.
                loop {
                    std::hint::spin_loop();
                    next_to_alloc = self.next_to_alloc.load(Ordering::Relaxed);
                    if !next_to_alloc.is_null() {
                        break;
                    }
                }
                fence(Ordering::Acquire);
                chunk_end = self.current_chunk_end.load(Ordering::Relaxed);
            }

            // This must come after the previous check, so that if we were in the middle of allocating
            // a new chunk but it quickly ran out of space before we had a chance to use it we will
            // allocate a new chunk.
            if next_to_alloc == chunk_end {
                return self.grow_by_chunk(next_to_alloc);
            }

            // SAFETY: This is either in bounds or one past the end.
            let after = unsafe { next_to_alloc.add(1) };
            // `Ordering::Relaxed` is not enough for the success ordering, we need `Release`. This is
            // because we need to form a happens-before relationship between the last update of
            // `next_to_alloc` and the load of it as part of the `compare_exchange()` in
            // `grow_by_chunk()`. If the store will not happen-before this load, we may observe a
            // corrupted `chunk_end` in `alloc_place()` because of later `grow_by_chunk()`.
            match self.next_to_alloc.compare_exchange_weak(
                next_to_alloc,
                after,
                Ordering::Release,
                Ordering::Acquire,
            ) {
                Ok(_) => return next_to_alloc,
                Err(v) => {
                    next_to_alloc = v;
                    // We need to update `chunk_end` too, because maybe someone allocated a complete
                    // chunk behind our back and already swapped `next_to_alloc` and this is the new
                    // value we're seeing. Even though the `chunk_end` and `next_to_alloc` may still not
                    // synchronized if somebody will allocate a new chunk after the loading of
                    // `next_to_alloc` but before the loading of `current_chunk_end` and now `chunk_end`
                    // will point at the new chunk while `next_to_alloc` is still pointing at the old
                    // chunk, this doesn't matter, when trying to compare-and-exchange `next_to_alloc`
                    // we'll notice it has changed and reload `chunk_end`.
                    chunk_end = self.current_chunk_end.load(Ordering::Relaxed);
                }
            }
        }
    }

    // Allocates a new chunk and does all bookkeeping.
    #[cold]
    #[inline(never)]
    fn grow_by_chunk(&self, expected_next_to_alloc: *mut T) -> *mut T {
        if self
            .next_to_alloc
            .compare_exchange(
                expected_next_to_alloc,
                std::ptr::null_mut(),
                Ordering::Acquire,
                Ordering::Relaxed,
            )
            .is_err()
        {
            // If the chunk is not equal to the chunk end anymore, that means someone was faster than
            // us and already took care to allocate a new chunk. So let's wait for it to finish
            // (`alloc_place()` will do that) and retry allocating (it cannot be someone who further
            // increased `next_to_alloc` because once we reached the chunk end we don't advance anymore).
            return self.alloc_place();
        }

        let (chunk, chunk_end) = Self::alloc_new_chunk(self.capacity);
        let result = chunk;
        // SAFETY: We verify in `with_capacity` that the capacity is at least one, so this is either
        // in bounds or one past the bounds.
        let chunk_rest = unsafe { chunk.add(1) };

        let old_chunk = self.current_chunk.load(Ordering::Relaxed);
        if old_chunk != Self::EMPTY_ARENA_SENTINEL {
            // SAFETY: We're holding a "lock" as `next_to_alloc` is null and we have exclusive access
            // to `allocated_chunks` during this time period.
            unsafe {
                (*self.allocated_chunks.get()).push(old_chunk);
            }
        }

        self.current_chunk.store(chunk, Ordering::Relaxed);
        self.current_chunk_end.store(chunk_end, Ordering::Relaxed);
        self.next_to_alloc.store(chunk_rest, Ordering::Release);

        result
    }

    #[inline]
    pub fn alloc(&self, v: T) -> &mut T {
        let place = self.alloc_place();
        // SAFETY: `alloc_place()` always gives us a valid pointer.
        unsafe {
            place.write(v);
            &mut *place
        }
    }
}

impl<T> Drop for Arena<T> {
    fn drop(&mut self) {
        let layout = Layout::array::<T>(self.capacity).unwrap();
        self.allocated_chunks
            .get_mut()
            .iter()
            .copied()
            .chain(
                Some(*self.current_chunk.get_mut())
                    .filter(|&chunk| chunk != Self::EMPTY_ARENA_SENTINEL),
            )
            .for_each(|chunk| {
                // SAFETY: We allocated each chunk with the global allocator and `Layout::array::<T>(self.capacity)`.
                unsafe {
                    std::alloc::dealloc(chunk.cast::<u8>(), layout);
                }
            });
    }
}

fn main() {
    let arena = Arena::new();

    dbg!(arena.alloc(123u64));

    let arena = &arena;
    let counter = &std::sync::atomic::AtomicU64::new(0);
    const RANGE: u64 = 10_000_000;
    std::thread::scope(|s| {
        for i in 0..10 {
            s.spawn(move || {
                counter.fetch_add(
                    (i * RANGE..(i * RANGE) + RANGE)
                        .map(|j| arena.alloc(j))
                        .collect::<Vec<_>>()
                        .into_iter()
                        .map(|&mut v| v)
                        .sum::<u64>(),
                    Ordering::Relaxed,
                )
            });
        }
    });
    dbg!(counter.load(Ordering::Relaxed));
}

Edit: Oops, not. Ran it and got the same error. Another thing to fix :slight_smile:

Edit 2: Ah, I got the problem. I used add where I should've used wrapping_add().

let chunk_end = unsafe { chunk.add(capacity) };

This can overflow. An array of length capacity fitting in memory doesn't mean that an array one larger (containing the end pointer) also fits.

// If the chunk is not equal to the chunk end anymore, that means someone was faster than
// us and already took care to allocate a new chunk. So let's wait for it to finish
// (`alloc_place()` will do that) and retry allocating (it cannot be someone who further
// increased `next_to_alloc` because once we reached the chunk end we don't advance anymore).
return self.alloc_place();

This looks like it can lead to infinite recursion and stack overflow under contention.

pub fn alloc(&self, v: T) -> &mut T;

This means that the value stored in the allocator will never be dropped. I guess it's not a big deal: it's a bump allocator which keeps all data alive until the arena is dropped. But for correctness' sake I would either include the contents drop logic in Drop impl, or restrict T: Copy, to ensure there is no drop glue.

allocated_chunks: UnsafeCell<Vec<*mut T>>,

You can just use Cell here, since you only modify it under lock. Use Cell::take, push the chunk, use Cell::set. And if you mess up your synchronization, at least you'll get a memory leak rather than a data race.

This could have a tiny performance penalty due to extra memory copies, but it should be negligible compared to a chunk allocation.

2 Likes
// fence(Ordering::Acquire);
// For thread sanitizer, as it doesn't support fences.
self.next_to_alloc.load(Ordering::Acquire);

Are you sure it acts as a fence, and isn't removed as a useless read? I don't see anything on that topic in its docs.

Actually, I'm not sure fence would have any effect here at all. The docs say

A fence 'A' which has (at least) [Release] ordering semantics, synchronizes
with a fence 'B' with (at least) [Acquire] semantics, if and only if there
exist operations X and Y, both operating on some atomic object 'M' such
that A is sequenced before X, Y is sequenced before B and Y observes
the change to M. This provides a happens-before dependence between A and B.

But there are no atomic operations which would synchronize this fence with anything else (by the way, shouldn't there also be a fence(Release) somewhere? the docs are unclear whether a fence can synchronize with atomics without any other matching fences, I suppose it can't).

The atomic operation preceding this fence is

next_to_alloc = self.next_to_alloc.load(Ordering::Relaxed);

but Ordering::Relaxed doesn't provide any synchronization. It doesn't synchronize with Release or Acquire, so I have a feeling that it can return unexpected values here (e.g. non-monotonic, in the allocation order, sequence of chunk pointers).

It feels very iffy to access the same variable with both AcqRel and Relaxed semantics. Is it ever sound? I doubt it, but perhaps you have a reference which says so. There doesn't seem anything preventing the processor, for example, from always returning the same value in that load, assuming that you have only 2 running threads, so one of them spinloops on that Relaxed load, while the other sets the variables via normal Acquire/Release operations.

This is one past the end of an allocated object, which is allowed. And it can't overflow because no allocated object can be bigger than isize::MAX.

I have a feeling it should optimize to a tail call? Not sure. Anyway, this will only be retrying for long times in a very heavy contention and very small chunk size, a very unrealistic scenario.

The comment on the top says explicitly that it does not drop. I don't want to restrict it to T: Copy and not even to assert!(!needs_drop::<T>()) since it can also be useful to allocate a Drop-needing object with some kind of wrapper responsible for the dropping. This is the same situation as in bumaplo.

I don't see a big difference, since Cell would still have a data race if I mess with the lock.

I took this trick from std, so it probably works :slight_smile: .

It should not synchronize with this read, but with the Release store to next_to_alloc in grow_by_chunk(). The problem this comes to solve is that after the spinning, I am guaranteed that I'm loading the new, correct value for next_to_alloc, but I can't know about current_chunk_end - it could be that someone (compiler, processor) is arranging the writes so that the update to current_chunk_end in grow_by_chunk() becomes after the update of next_to_alloc and after the read of current_chunk_end in alloc_place(), and I'm reading the old value. To eliminate this risk I have this fence, which forms a happens-before relationship between the store to next_to_alloc in grow_by_chunk() and the load of it in alloc_place(), so now the load of current_chunk_end that is after the load of next_to_alloc is guaranteed to come after the store to next_to_alloc that is guaranteed to come before the store to current_chunk_end.

Essentially, this fulfills the same role as if the load you mentioned was Acquire instead of Relaxed, but more efficiently, because instead of constantly spinning with Acquire we spin with Relaxed and Acquire only once.

About the synchronization of fences with regular atomic accesses, my information comes from Mara's book (which I highly recommend).

Yes, this is fine. Any ordering can observe the result of any other ordering on the same variable. The only difference is with other things, that are not the variable itself.

That's true only on 64 bit, but I guess it's reasonable to ignore in exercise code.

Doubt it. Rust doesn't have any guaranteed tail call elimintations, and LLVM has very unreliable behaviour on that part. Tail calls aren't a common thing in C/C++/ObjC.

You just need heavy contention. Have 10 000 threads, and it's easy to hit. Your thread can also get unluckily preempted a number of times. Your synchronization is basically spinlock-based, which can lead to some nasty priority inversion. One thread gets the lock, is preempted. A hundred of other threads all get a fair time slice, but can't allocate anything since the chunk is full, and so all block on the spinlock, exhausting their time slices. The locking thread finally gets a time share, unlocks the allocator, does some work. Some other 100 threads do some work and exhaust the chunk, the locking thread gets preempted. The first 100 threads get scheduled, loop on a spinlock, exhaust their time slices, maybe recurse due to glitchy Relaxed loads. Etc until you get a stack overflow, or at least a very long livelock.

Bumpalo is a bit different, because it allocates chunks of arbitrary size. The liveness and drop glue of allocated objects must always be tracked externally, bumpalo has literally no way to do it, and no way to drop the contents when it goes out of scope. But note that bumpalo::Box provides a way to do allocation with proper resource semantics.

Your example is more like typed-arena, which serves as a growable storage for items of a single type. Since the type is known, the container can trivially deallocate the allocated objects when it is dropped. Although I see that typed-arena also doesn't do drop on deallocation... At least there is into_vec method which allows to do it.

Cell is safe, you can't get a data race on it using its safe methods. You could get an ordinary race where two threads update the field at the same time, resulting in some information loss, but it should be easier to debug.

No, this is true on all targets Rust supports.

10,000 threads is already rare, 10,000 threads that do enough allocation so there will be always contention - basically just allocating - is extremely rare and unrealistic.

Yes, you're right. I part that was omitted from my answer wrongly was "but dropping in Drop is a good idea, I haven't thought about that".

But I have a unsafe impl Sync and I access it concurrently from different threads. If different threads try to take() it at the same time (or put back, or one take and one put back) we got a data race, not a race condition, as the operation is not atomic