Let's say I have a largish data structure (mostly a struct with some Vec in it) and share it read-only between multiple threads by putting it into an Arc and giving each thread a clone of the Arc. The number of threads would be probably the number of CPUs, and there is a possibility I might be using large machines (64?). Each thread will do a lot of reading from the data.
My question is how will the threads perform compared to each having their own copy of the data?
Since the data can only be read and doesn't change, I'm hoping different threads can read it without noticing each other. Is it so?
a) Bumping a reference count when it is cloned and decrementing the reference count when a clone goes out of scope. Which is probably unnoticeable for a thread working on a big un of data.
b) A double indirection when reading what is behind it. First to the Arc, which is on the heap, then to the struct it ponts to. Again probably not a significant overhead. depending on what you threads re doing.
rust Arc is different from C++ shared_ptr, there's no double indirection. the data is colocated with the reference counter
EDIT: C++ shared_ptr is also one level of indirection, but the data is not always allocated together with the reference counter (search: aliasing constructor). so C++ shared_ptr is always "fat", while rust Arc is fat for DST, and thin for Sized types.
Yup. For further reading, the basic idea is captured by the idea of the MESI protocol (and its various optimizations): each chunk of cache memory can be Modified, Exclusive (and not modified), Shared, or Invalid / missing.
Changing between each of these is moderately expensive, and the most expensive thing is cores constantly bouncing a Modified line between each other instead of doing useful work.
But isn't that what Arc is doing if you clone it? It looks as if Arc should pretty expensive to clone. well… relatively speaking, about 50 or 100 simple operations worth, still many times cheaper than actual DRAM access, but much slower than Rc in uncontended case.
And the first Vec would be in the same cache line as counters... which means first Vec would be slow and others fast. Maybe better to use repr(C) and put something in the beginning of your type.
Don't do that without benchmarking, though: all these edge extra-close-to-the-metal cases are very fragile and unpredictable.
Ah yes, I forgot - a main memory access is indeed several times worse than a cacheline trade with another CPU, and even worse when it's paged out to disk. The danger with thrashing is when you've got all the cores in the system queued up waiting for the same cache line and you end up with low utilization% despite high cpu% usage.