Atomicity, granularity, and Send/Sync: parallel execution over vectors/slices

[ETA: I connected with some hardware architecture experts, who reassured me that this wasn't a problem.]

Suppose I have two threads, that are updating the contents of a
Vec<u8>. If the two threads update areas that do not share
words in common, then they don't need any synchronization. But
let's say that thread #1 updates &v[0..5] and thread #2 updates
&v[5..8]` of an eight-byte vector.

I remember from back when I worked on the JVM, that there were
important constraints on this sort of thing, and in short, you had to
act as if these threads were updating the same word, and not
non-overlapping sequences of bytes.

I'm not a Rust expert, and certainly not even an LLVM newbie. So I
don't know where to start looking, to learn whether this sort of thing
is a problem or not, but geez, at least from what I've seen of the way
Rust allows me to cut apart a slice into two subslices basically at
any point, it doesn't appear that Rust makes any provision for this.

I should add that it's not a big deal: as long as you deal with stuff
at word-size (64-bit quantiities) and up, it's simply a non-problem.
But I do wonder about whether this problem actually occurs for smaller
quantities.

It would seem somewhat straightforward to solve in the internals of
copy_from_slice -- just do a little address-arithmetic at the
endpoints of the destination slice to figure out if the ends are
partial words, and if so, then use compare-and-swap to move the data
there.

Lest this seem completely wacky, I guess I should explain how this
came up. I've written a fast "concat slices" function, that uses
rayon to parallelize copying from source slices into subslices of a
destination vector. I'm only using it on word-size-and-larger
element-types, but it occurred to me that maybe one ought to check
that it would work correctly on smaller element-types.

Hence my question.

ETA: of course, my suggestion for what to do in copy_from_slice only
works for that -- the more general issue of when two adjacent &[u8]
slices are not going to interfere if they're being modified by different
threads seems .... complicated at least.

I'm not sure what you are asking. If you have two disjoint &mut [T], you can send both to different threads and update them concurrently (even in parallel). The type system of Rust (in particular, the non-overlap guarantee of &mut) ensures that there is no need for any sort of special handling in copy_from_slice() or anywhere else. If your code is safe (i.e., no unsafe) and it typechecks, then it's also concurrency-safe.

1 Like

One of Rust’s safety guarantees is that safe code cannot cause data races; if I understand your question correctly, you’re asking about the possibility whether two threads accessing the same word (but not the same bytes within that word) in parallel can cause data races. The answer is of course no, because, as you noticed, you can split up e.g. a &mut [u8] into two parts at arbitrary byte boundaries, and use the parts in two threads in parallel, all in safe code using rayon, or scoped threads (which also come to the standard library very soon).

I don’t know about LLVM either, but if there was any problems like you described, this would be considered a bug in the rust compiler and would get fixed. Rust takes its safety guarantees pretty serious.

I don’t know about the JVM situation you describe; perhaps it’s some optimization where copies of data always operate word-by-word or whatever. Of course that wouldn’t be allowed in Rust; methods such as copy_from_slice are all implemented in a way that doesn’t access any data beyond what you’re allowed to access. This function in particular is implemented using ptr::copy_non_overlapping which seems to be essentially a compiler intrinsic (i.e. a function magically provided by the compiler itself / baked into the language).

Looking at how that function is implemented is possible by looking into LLVM IR or assembly in the playground

(source)

pub unsafe fn copy_u8(src: *const u8, dst: *mut u8, count: usize) {
    std::ptr::copy_nonoverlapping(src, dst, count)
}

(LLVM IR)

; ModuleID = 'playground.97fed5fe-cgu.0'
source_filename = "playground.97fed5fe-cgu.0"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; playground::copy_u8
; Function Attrs: mustprogress nofree nosync nounwind nonlazybind uwtable willreturn
define void @_ZN10playground7copy_u817h30a98a53a5a06a70E(i8* nocapture readonly %src, i8* nocapture writeonly %dst, i64 %count) unnamed_addr #0 {
start:
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %dst, i8* align 1 %src, i64 %count, i1 false) #2
  ret void
}

; Function Attrs: argmemonly mustprogress nofree nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #1

attributes #0 = { mustprogress nofree nosync nounwind nonlazybind uwtable willreturn "probe-stack"="__rust_probestack" "target-cpu"="x86-64" }
attributes #1 = { argmemonly mustprogress nofree nounwind willreturn }
attributes #2 = { nounwind }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 7, !"PIC Level", i32 2}
!1 = !{i32 2, !"RtLibUseGOT", i32 1}

(assembly)

playground::copy_u8:
	movq	%rdi, %rax
	movq	%rsi, %rdi
	movq	%rax, %rsi
	jmpq	*memcpy@GOTPCREL(%rip)

As you can see, this appears to involve an LLVM intrinsic “llvm.memcpy”, and ultimately calls the system “memcpy” function.

Based on my explanations above, I have reason to believe that memcpy does respect the boundaries of the area to be copied and does not access anything beyond is in any form. Regarding your idea to “just do a little address-arithmetic at the endpoints of the destination slice to figure out if the ends are partial words, and if so, then use compare-and-swap to move the data there”, note that CPUs do have [1] instructions for loading or storing single bytes. Of course an optimized memcpy implementations will not operate byte-by-byte for the bulk of the operations, but I assume for the boundaries it does / has to operate potentially even on single bytes in this manner, in order to prevent data races.

(On the other hand, it may well be the case, that CPUs do need to use effectively “atomic operations” internally to implement such single-byte memory access, or perhaps even all memory access: I mean, at least the way that caching works is probably somewhat comparable to atomic operations anyways.)


  1. according to e.g. the first answer in this stackoverflow thread ↩︎

Maybe this has something to do with atomic support across different hardware? In Rust that's something you opt into, and some sizes aren't available on some platforms (but AtomicU8 is supported on most, whether natively or by emulation).

I finally connected with some hardware-architecture experts, who reassured me that the cache-coherence hardware will "do the right thing" in detail. So my worry was unfounded. Sorry for the ruckus.

1 Like