Unsafe code review: thread local stack allocator `second_stack2` (Stacked Borrows Violation)

I've been working on second_stack2, a library which manages a thread local second stack on the heap. It can be used to store temporary slices and large values, and when trying to replace recursion with a loop. It is heavily inspired by second_stack but tries to be more efficient by only using const initialized, trivially droppable thread locals on the fast path as well as having a more flexible API.

The key type of this library is StackVec<'a, T> which acts as a typed handle to push and pop elements to the second stack. A StackVec<'a, T> can be converted into a StackBox<'a, [T]> (a library version of &'a own [T]/&'a move [T]). When the heap allocation for the second stack is full, a new allocation of twice the size is allocated and all of the contents are copied over, but the old allocation isn't freed until the thread terminates. This is necessary since any StackBoxs that were created will still be referencing the first allocation. Existing StackVecs will switched to using the new allocation so the memory does need to be copied. Unfortunately this does lead to a stacked borrows violation, although my current test suite does pass under tree borrows.

Example of stacked borrows violation:

// LEN: 0, CAP: 0, ALLOCATION#0
reserve_buffer_capacity(16);
// LEN: 0, CAP: 16, ALLOCATION#1
with_stack_vec(|mut v1: StackVec<Box<bool>>| {
    // LEN: 0, CAP: 16, ALLOCATION#1
    v1.push(Box::new(true));
    // LEN: 8, CAP: 16, ALLOCATION#1
    with_stack_vec(|mut v2: StackVec<Box<char>>| {
        // LEN: 8, CAP: 16, ALLOCATION#1
        v2.push(Box::new('🦀'));
        // LEN: 16, CAP: 16, ALLOCATION#1
        let mut s2: StackBox<[Box<char>]> = v2.into_slice();
        let s2: &mut [Box<char>] = &mut *s2;
        // s2 is pointing in to ALLOCATION#1
        with_stack_vec(|mut v3: StackVec<u8>| {
            // LEN: 16, CAP: 16, ALLOCATION#1
            v3.push(0);
            // LEN: 17, CAP: 32, ALLOCATION#2
        });
        // LEN: 16, CAP: 32, ALLOCATION#2
        assert_eq!(*s2[0], '🦀')
        // ALLOCATION#1 is still exists but stacked borrows complains the box has been moved out
    });
    // LEN: 8, CAP: 32, ALLOCATION#2
    v1.push(Box::new(false));
    // LEN: 8, CAP: 32, ALLOCATION#2
    let mut s1: StackBox<[Box<bool>]> = v1.into_slice();
    let s1: &mut [Box<bool>] = &mut *s1;
    // s1 is pointing in to ALLOCATION#2 since this is where the second box was allocated
    // it relies on the first box being copied from ALLOCATION#1 to ALLOCATION#2
    assert_eq!((*s1[0], *s1[1]), (true, false));
})
// LEN: 0, CAP: 32, ALLOCATION#2

In case your worried about how alternating pushes from two StackVecs are handled, the function passed to with_stack_vec has a Send bound, and StackVec doesn't implement Send so having two StackVecs at the same time should be impossible. Ideal the Send bound could be relaxed to a custom auto trait that is explicitly not implemented for StackVec, but hopefully the Send bound is a good enough compromise to allow things to work on stable. Theoretically if the auto trait was used and multiple copies of second_stack2 existed using both at the same time should still be sound since the two crates would use different heap allocated stacks.

I would appreciate if someone would look it over to see if I am missing any potential unsoundness.

I just realised that this is unsound when combined with SendWrapper since with_stack_vec doesn't actually use a different thread. RustBelt also seems to favour SendWrapper as the library being sound. The solution using a custom auto trait might still work, but then the library would only be usable on nightly

Doing even more reading I realized even the custom auto trait wouldn't work `Python::allow_threads` is unsound in the presence of `scoped-tls`. · Issue #3640 · PyO3/pyo3 · GitHub

I added some extra runtime checks to try to avoid the scoped-tls-hkt/SendWrapper unsoundness. I would still appreciate if someone would look it over to see if I am missing any other potential unsoundness.

Some extra context for this post, since it was a bit hard to grasp at first.

You are offering a

with_stack_vec(|mut s: StackVec<'_, T>| {
    …
})

which works by having the .push() operation on such an s increase some thread-local LEN so that further(ly nested) calls to with_stack_vec() yield a new handle which points after that new len.

Hence why, very quickly, a problem emerges:

with_stack_vec(|mut s1: StackVec<'_, u8>| {
    // capacity over the `0..` span.
    with_stack_vec(|mut s2: StackVec<'_, bool>| {
        // capacity over the `(0 + used_by_s1)..` span, _i.e._,
        // the `0..` span, since previous call has not `.push()`ed
        // anything yet.
        s2.push(false); // believes it can write to `0..`, so it writes `0b00` to `[0]`.
        s1.push(0b10); // believes it can write to `0..`, so it writes `0b10` to `[0]`.
        // believes it can read from `0..1`, so it reads from `[0]`.
        let b: bool = s2.pop().unwrap(); // UB!
    });
});

So, indeed, you wanted to prevent usage of s1 within the scope of s2, hence the auto-trait approach.

  • It does have limitations, as you have yourself observed, so, alas, you do need a runtime check here of some sort to avoid this issue. We'll go back to this later.

Now, even assuming no direct usage of s1 in that scope of s2, there appears to be a Stacked Borrows violation of aliasing invariants, since, in fact, a .push() call may require reällocating a whole new thread-local buffer, which, in turn, the current implementation performs by copying the "prefix" of the current buffer to the new bigger buffer:

// `(*BASE)[..len]` is the thread-local buffer before the reälloc,
// `(*new)[..len_or_bigger]` is the new thread-local buffer.
unsafe { copy_nonoverlapping(BASE.get(), new, len) };

In your example, you did trigger such a condition within your v3.push() call:

And the problem is that such a copy_nonoverlapping() operation expects, aliasing-wise, to have read-access to the (*BASE)[..len] range using the provenance of the pointer stored in BASE.

This is a strict parent/ancestor of whichever pointer is used in your StackVecs, such as the one obtained from doing:

So this operation invalidates any exclusivity aliasing assertion, so this s2 becomes invalidated/unusable once the v3.push()-triggered reälloc occurs.

Hence the issue.

  • Aside: debugging this was easier to spot by taking the "tag" of the problematic memory which cargo miri test reported:

      --> src/lib.rs:32:13
       |
    32 |             assert_eq!(*s2[0].0, '🦀')
       |             ^^^^^^^^^^^^^^^^^^^^^^^^^^
       |             |
       |             attempting a read access using <183562>
    
    MIRIFLAGS="-Zmiri-track-pointer-tag=183562" cargo miri test
    
    note: tracking was triggered
      --> …/second-stack2/src/core.rs:61:14
       |
    61 |     unsafe { copy_nonoverlapping(BASE.get(), new, len) };
       |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ popped tracked tag for item [Unique for <183562>] due to Read access for <182364>
    

Solutions

PhantomPinned

If you intend to keep the API exactly like you currently have, you'd need to use something like the currently-being-suggested UnsafePinned RFC. That is, StackVec<'_, T> should only offer, be it via Deref{,Mut} or .into_slice() (since StackBox itself offers such a Deref{,Mut}), access to UnsafePinned<>-wrapped Ts.

  • You can currently, very hackily, test this hypothesis, by using a DIY-UnsafePinned, e.g., by wrapping the T in question (in your example, T = Box<char>), with a PhantomPinned sibling field (e.g., new T = (Box<char>, PhantomPinned)). I tried with that and miri stops complaining, looking the other way.

Type-level restrictions of the API

I personally believe that it could be better to try and prevent such overlapping usages of parent StackVec handles, or derived handles (the &mut-ness of that s2 stuff being a form of a "hidden" handle being captured in the inner scope), within the nested with_stack_vec().

The way to do so, would be by preventing with_stack_vec() altogether from being called reëntrantly, via some runtime check there indeed.

  • Since such a check would probably be implemented using a thread-local, that alone would justify making StackVecs non-Send.

Once you do that, you can still offer reëntrant with_stack_vec() calls, but only by requiring them to operate on/start off/stem from an already existing StackVec<'_, T> handle, which they would keep invalidated for the duration of the inner scope, by requiring an exclusive borrow to it for the duration of the call.

with_stack_vec(|mut s1: StackVec<'_, …>| {
    // ^ first layer, checks thread-local, it's OK.
    if false {
        with_stack_vec(…) // <- checks thread-local, not first, would fail.
    }

    s1.push(…);
    let first = &mut s1[0];

    s1.with_stack_vec(|mut s2: StackVec<'_, …>| {
        // ^ uses the `.with…()` *method* API, it can *skip the check*
        // and unconditionally succeed
        …
        /* s1 cannot be used here, since `&mut`-borrowed. */
        /* derived handles such as `first` cannot be used either. */
    });

    …
});
1 Like

Thanks for taking the time to look at this, the stacked borrows violation is unfortunate, and unfortunately I can create a similar tree borrows violation as well:

#[test]
fn test_forum() {
    // LEN: 0, CAP: 0, ALLOCATION#0
    reserve_buffer_capacity(16);
    // LEN: 0, CAP: 16, ALLOCATION#1
    with_stack_vec(|mut v1: StackVec<Box<bool>>| {
        // LEN: 0, CAP: 16, ALLOCATION#1
        v1.push(Box::new(true));
        // LEN: 8, CAP: 16, ALLOCATION#1
        with_stack_vec(|mut v2: StackVec<Box<char>>| {
            // LEN: 8, CAP: 16, ALLOCATION#1
            v2.push(Box::new('🦀'));
            // LEN: 16, CAP: 16, ALLOCATION#1
            let mut s2: StackBox<[Box<char>]> = v2.into_slice();
            let s2: &mut [Box<char>] = &mut *s2;
            s2[0] = Box::new(' '); // Compiler may want to optimize this out
            // s2 is pointing in to ALLOCATION#1
            with_stack_vec(|mut v3: StackVec<u8>| {
                // LEN: 16, CAP: 16, ALLOCATION#1
                v3.push(0);
                // LEN: 17, CAP: 32, ALLOCATION#2
            });
            // LEN: 16, CAP: 32, ALLOCATION#2
            s2[0] = Box::new('🦀');
            // s2 is tagged as frozen because of foreign read from during copy_nonoverlapping
            // Tree borrow violation
        });
        // LEN: 8, CAP: 32, ALLOCATION#2
        v1.push(Box::new(false));
        // LEN: 8, CAP: 32, ALLOCATION#2
        let mut s1: StackBox<[Box<bool>]> = v1.into_slice();
        let s1: &mut [Box<bool>] = &mut *s1;
        // s1 is pointing in to ALLOCATION#2 since this is where the second box was allocated
        // it relies on the first box being copied from ALLOCATION#1 to ALLOCATION#2
        assert_eq!((*s1[0], *s1[1]), (true, false));
    })
    // LEN: 0, CAP: 32, ALLOCATION#2
}

Your solutions are interesting but seem to come with significant trade-offs:

PhantomPinned

I'm not sure how this would keep the API exactly the same since it would seem to require changing the return type of into_slice to return a StackBox<[UnsafePinned<T>]> which could lead to difficulties if there is an external API that requires &mut T or [T].

Type-level restrictions of the API

I'd considered that idea in second-stack-vec, but it has its own limitations. For example using the current API you could:

with_stack_vec(|mut s1| {
    v1.extend(...)
    let s1 = v1.into_slice();
    with_stack_vec(|mut s2| {
        do_something_with_stackvec_and_slice(v2, &mut * s1)
    }
}

The other advantage of having a reëntrant with_stack_vec() is that library's could make use of it without needing to change their API or worry whether their client may be calling them in an existing with_stack_vec scope.

I could be wrong about this, but it seems like even though it can cause stacked-borrows/tree-borrows violations, using second-stack2 doesn't break any optimizations, since even though the call to copy_nonoverlapping can "read" bytes that have been borrowed using StackVec::into_slice, these bytes may as well be treated as uninitialized since there value is never used.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.