Uuid by value or Uuid by reference? Help me digest this

I wrote this microbenchmark using criterion, but I wonder if I'm fooling myself somehow?

I assume some of the time, the answer to this question will be "it depends", but do I generally need to be passing Uuids by reference instead of by value in my code?

Uuid is Copy.

use criterion::*;
use uuid::Uuid;

pub fn criterion_benchmark(c: &mut Criterion) {
    let uuid: Uuid = Uuid::new_v4();
    c.bench_function("uuid references", |b| b.iter(|| by_ref(black_box(&uuid))));
    c.bench_function("uuid values", |b| b.iter(|| by_val(black_box(uuid))));
}

fn by_ref(id: &Uuid) -> u128 {
    id.as_u128()
}

fn by_val(id: Uuid) -> u128 {
    id.as_u128()
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Here are the results:

uuid references         time:   [2.1596 ns 2.2106 ns 2.2582 ns]
                        change: [-35.693% -34.510% -33.251%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low severe
  5 (5.00%) low mild

uuid values             time:   [11.945 ns 12.215 ns 12.482 ns]
                        change: [+1.5659% +3.9638% +6.2375%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) low severe
  7 (7.00%) low mild

Should I always be passing Uuid by reference? Am I drawing the wrong conclusion?

No. A Uuid is a wrapper around a [u8; 16]. It fits in two registers on most modern machines and it's Copy. It's thus probably useless to pass it by reference in the overwhelming majority of the time.

What you are measuring here is meaningless. Since the methods don't do basically anything other than copy some bytes around, it's likely that all of your code is optimized away completely, even in the presence of black_box().

In this godbolt, you can see that the two functions are compiled to the exact same machine code, to the point where they are merged by the linker.

You must add #[inline(never)] to the functions to guarantee they will actually exist and receive arguments.

Thanks that was very enlightening. This is why I asked in the first place.

Thanks, it didn't change the outcome of the tests.

So two follow up questions to whoever wants to answer --

  1. What is the difference between the tests? Since I'm consistently seeing the by-value one being 5x slower. Is it because there is an implicit drop at the end of by_val? Or what? Would really like to learn.
  2. @H2CO3 What about WebAssembly? Do Uuids still have the same properties there? Registers aren't really a thing in wasm, right?

That can't possibly be the difference. Dropping Copy types must be a no-op, and the compiler enforces this, otherwise it would be terrifyingly easy to cause double-free and similar resource management errors.

Whatever difference you are seeing, it is an artifact of the testing setup and is completely meaningless in any sort of real-world usage.

What matters is only the physical machine the resulting code will be executed on. Browsers JIT the WebAssembly you feed them anyway, so you won't be "executing WebAssembly" – you will be executing x64 machine code on an x64 machine, ARM machine code on an ARM processor, etc. (And if not, because you are running the thing under an actual WebAssembly interpreter, then the performance of that interpretation will be orders of magnitude worse anyway, so worrying about this sort of micro-optimization will be even more meaningless than it already is.)

I want to believe you but I also want to get to the bottom of it.

I have some ideas for changes to the tests, which I will make and then rerun them.

Thanks, great info.

So I modifed my code to look like this instead, which levels the playing field a bit. Obviously most of the new cost for the "uuid references" test is that it needs to construct a Uuid.

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use uuid::Uuid;

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("uuid references", |b| b.iter(|| {
        let uuid = Uuid::from_u128(0);
        by_ref(black_box(&uuid))
    }));
    c.bench_function("uuid values", |b| b.iter(|| {
        let uuid = Uuid::from_u128(0);
        by_val(black_box(uuid))
    }));
}

#[inline(never)]
fn by_ref(id: &Uuid) -> u128 {
    id.as_u128()
}

#[inline(never)]
fn by_val(id: Uuid) -> u128 {
    id.as_u128()
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Results (ran multiple times), one representative run is here:

uuid references         time:   [7.6366 ns 7.6687 ns 7.7065 ns]
                        change: [-1.5775% -1.0179% -0.4569%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

uuid values             time:   [13.790 ns 13.840 ns 13.896 ns]
                        change: [-1.1393% -0.0666% +0.8301%] (p = 0.90 > 0.05)
                        No change in performance detected.

As you can see, "uuid references" is still winning by a good margin.

But your godbolt link made me curious, so I played around a bit, and modified your code by adding two functions: main_by_ref and main_by_val which calls the respective by_ref and by_val functions: godbolt

main_by_ref generates this code:

example::main_by_ref:
        sub     rsp, 40
        xorps   xmm0, xmm0
        movaps  xmmword ptr [rsp + 16], xmm0
        lea     rax, [rsp + 16]
        mov     qword ptr [rsp + 8], rax
        mov     rdi, qword ptr [rsp + 8]
        call    qword ptr [rip + example::by_ref@GOTPCREL]
        add     rsp, 40
        ret

But main_by_val generates this monstrosity:

example::main_by_val:
        push    rbp
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        sub     rsp, 56
        xorps   xmm0, xmm0
        movaps  xmmword ptr [rsp + 16], xmm0
        movzx   eax, byte ptr [rsp + 31]
        mov     byte ptr [rsp + 15], al
        movzx   edi, byte ptr [rsp + 30]
        movzx   r8d, byte ptr [rsp + 29]
        movzx   r9d, byte ptr [rsp + 28]
        movzx   r10d, byte ptr [rsp + 27]
        movzx   r11d, byte ptr [rsp + 26]
        movzx   ebp, byte ptr [rsp + 25]
        movzx   r14d, byte ptr [rsp + 24]
        movzx   r15d, byte ptr [rsp + 23]
        movzx   r12d, byte ptr [rsp + 22]
        movzx   r13d, byte ptr [rsp + 21]
        movzx   esi, byte ptr [rsp + 20]
        movzx   edx, byte ptr [rsp + 19]
        movzx   ebx, byte ptr [rsp + 18]
        movzx   eax, byte ptr [rsp + 16]
        movzx   ecx, byte ptr [rsp + 17]
        mov     byte ptr [rsp + 40], al
        mov     byte ptr [rsp + 41], cl
        mov     byte ptr [rsp + 42], bl
        mov     byte ptr [rsp + 43], dl
        mov     byte ptr [rsp + 44], sil
        mov     byte ptr [rsp + 45], r13b
        mov     byte ptr [rsp + 46], r12b
        mov     byte ptr [rsp + 47], r15b
        mov     byte ptr [rsp + 48], r14b
        mov     byte ptr [rsp + 49], bpl
        mov     byte ptr [rsp + 50], r11b
        mov     byte ptr [rsp + 51], r10b
        mov     byte ptr [rsp + 52], r9b
        mov     byte ptr [rsp + 53], r8b
        mov     byte ptr [rsp + 54], dil
        movzx   eax, byte ptr [rsp + 15]
        mov     byte ptr [rsp + 55], al
        lea     rdi, [rsp + 40]
        call    qword ptr [rip + example::by_val@GOTPCREL]
        add     rsp, 56
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        pop     rbp
        ret

So you're right @H2CO3 , both of the by_ref and by_val functions generate the same assembly, but the caller generates a lot more!

So we're kind of back at square one: Is by_ref faster? Why?

I assume black_box from criterion is causing all of the code being generated, for some reason, when the argument isn't already a pointer, so the reply will probably be "black_box is the problem" which is fine.

And I also assume that (as H2C03 pointed out in the first reply) this is a meaningless test. I maybe won't get to the bottom of this here & now, but if there is some material for reading up on micro benchmarks and how to think to not fool oneself, that would be interesting to me.

The difference is apparently still that the caller needs to set up a copy of the passed value on the stack. However, this completely disappears the very moment you neuter black_box().

Needless to say, not inlining the construction and accessors of Uuid is unrealistic. Being able to contort the code to an unrecognizable extent only in order to force the compiler not to perform trivial optimizations doesn't really count as "Uuid by-reference is faster than Uuid by-value".

1 Like

@H2CO3 Thanks, I preempted you by about a second =)

A follow-up, theoretical, question - If the function you're calling is accepting many arguments, say a bunch of references and other values, then the registers will be full, is there a good strategy for how to think about those cases?

No. The only meaningful way to benchmark your code is to benchmark it in the context it is actually used in, unless it is doing some heavy (or at least non-trivial) computation that very obviously dominates any calling overhead.

The compiler will still go out of its way to put as much data into registers as possible. Even though most modern calling conventions don't allow more than a couple of integers to go into registers, inlining still means that smaller functions will be reasoned about and optimized in the context of their callers, and so register allocation can usually extend to a reasonable amount of code.

Of course, all of this is not magic. If you pass hundreds of arguments, you pass [u64; 65536] by value, and you annotate every single function with #[inline(never)], then the optimizer will eventually run out of registers and will have to spill values to the stack. This is pretty much physically inevitable, it is not specific to Rust, and it's not something you can defend against in basically any language.

However, since you specifically mentioned "many arguments", I'd like to point out that any function that takes a number of arguments for which you would need more than a single digit to count raises serious concerns about code quality. I personally really dislike (and am bad at) reading code that relies on functions with more than, say, 4 arguments. In this case, the readability of the code would be the first thing I would complain about, before even thinking about its potential performance implications.

Thanks @H2CO3, I appreciate your answers, very high quality :+1:

(Tounge in cheek answer follows, adding emojis to convey meaning) With all due respect, I wasn't looking for your concerns regarding the amount of function arguments I use :joy_cat:, and I want to make you calm by saying that I do stick to single digit arguments ... most of the time :wink:. (End)

So to conclude, with your help, I have realized that my benchmark was flawed.

I'm happy!

EDIT: Previously I assumed that size_of<&[u8]>() was the same as size_of<Uuid>() but that isn't true, because &Uuid isn't &[u8], it's more like &[u8; 16]. I had a paragraph stating that they were the same. I'm not as happy anymore :expressionless:. So when a function has a few arguments, I guess it may be faster to use a &Uuid argument.

1 Like

My TL/DR here is that this will be such a small difference that you shouldn't even think about it. Just pass it by value like how even on a 32-bit machine you would still pass a u64 by value.

If you're on a 64-bit machine, a Uuid is smaller than a String. But you pass owned Strings without worrying about that.

And in general, I'm skeptical that you ever need to pass so many Uuids that this could ever show up in macrobenchmarks.

4 Likes

My main domain is games, and there, due to the soft real-time requirements, some people get really involved in arguing minor microoptimization conventions like pass-by-reference versus pass-by-value.

Of course, C++ has further complications, in that copy constructors exist, so pass-by-value may be silently introducing an expensive clone where pass-by-const- reference would've avoided that. Also, C++'s compilation model means making things inlineable is a much more active decision from the developer (putting it in the header versus the source file), so calling convention cost adds up more (and also because C++ compilers refuse to do any transparent ABI optimization).

I bring it up merely because this means gamedev shops have cargo culted conventions for when to pass-by-value versus pass-by-reference. (Reminder for the C++-weary: this has no syntactic impact on the calling syntax, as C++ references are transparently introduced and are not normal types.) My current shop's guidelines are relatively reasonable and simple (only the parts about const references, ignoring mutating references, out pointers, etc):

  • If it's a primitive, always pass by-value.
  • If a type is not a primitive-like POD, always pass by-reference.
    • In C++ terms: it has non-defaulted “rule of 5” special member functions (dtor, copy/move constructor/assignment).
    • In Rust terms: it doesn't implement Copy.
  • STL types, where used, are always passed by reference.
    • We've not adopted C++17 std::string_view, let alone C++20 std::span. Who needs anything beyond const char* for non-owning strings? :upside_down_face:
  • If a type is non-primitive POD:
    • If it's on our list of primitive-like POD structs (math/physics vectors), always pass by-value.
      • Everyone agrees vec4f gets treated by-value. There's some debate about vec4d (256 bit f64x4), but consistency with vec4f has always won out, and the f64 vectors just aren't used much in general.
    • If it's not on the list, always pass it by-reference.
      • “Strong newtypes” just don't get used, the primitives and sometimes type aliases get used directly.
      • This means e.g. mat2f, despite being f32x4, is always passed by-reference, for consistency with e.g. mat4f (f32x16).
  • Fixed size arrays just basically don't exist. For the rare cases small arrays actually get used, they almost always get treated as vectors (the math/physics kind, not the dynamic array).
  • There's further rules for the use of nonconst references and pointer arguments, as well as rvalue references (C++ move semantics). These are elided for brevity here, but it's perhaps worth adding that the style this guide is for doesn't utilize RAII much at all; objects are managed via raw pointers and centralized arena ownership most of the time (and sometimes generational ids when telling people to discard destroyed objects' pointers fails too much).

To explicitly repeat this: these guidelines ignore when you'd pass by-move. E.g. in Rust if you're going to take ownership of a Clone value, you should take the owned value directly, so the caller can give it to you if they already have it and no longer need it, or write .clone() themselves if a clone is necessary. You shouldn't ever be taking a value by-reference and then unconditionally calling .clone() on it.


For Rust, rustc is willing to do ABI optimization; e.g. the compiler will transparently pass large values "by reference" instead of "by value". What exactly this means depends on the ABI[1] (i.e. the default extern "Rust" versus extern "C"), but you can generally expect that rustc is choosing the best approach for you.

However, the other hand to that is that while rustc can introduce a transparent by-reference on function boundaries, removing them (e.g. treating fn(&i32) as fn(i32)) generally can't happen (without inlining), because the address could be semantically meaningful.

These two factors combine such that pretty much every Rust convention listing I've seen says to just always prefer passing Copy values by copy, until stack copies show up in a profiler as something to worry about. (The exception of course being generic code/impls or otherwise mirroring APIs which handle non-Copy values, and thus introduce references.)

The one caveat is that different guides will put a “large values” caveat on this where for some value of “large,” you no longer #[derive(Copy)] even though it's semantically possible. The smallest maximum I've seen anyone legitimately is 256 bits / 16 bytes / f64x4, matching the largest SIMD register size. The largest cutoff I've seen arguments for is f64x16 / 128 bytes, though I've seen a few larger non-generic Copy types (typically enums of relatively large generic Copy types).

There's a reason for this — Rust/LLVM aren't great at eliminating redundant stack/stack copies of nontrivial[2] values. By-reference passing of by-copy values currently introduces a copy at the MIR level, even if the value is not used again (notably introducing a copy even just for a pass-through function), so combine that suboptimal codegen with overly conservative optimization (AIUI typical C++ compilation doesn't create this kind of bad codegen without explicitly trying to hit this missed optimization) and unfortunate usage patterns,

My personal advice is to always #[derive(Copy)] where it's semantically allowed[3], and always pass copy values by-copy in non-generic environments. If this becomes a problem (e.g. you blow your stack without optimizations, or maybe even with) it'll typically be relatively obvious what big types are the culprit. However, even in this case, I'd still err on just adding references to culprit functions (and/or replacing recursion with “dynamic” and/or iterative approaches) rather than removing the Copy implementation, unless there's a semantic justification for making duplication of the type explicit.

If you're worried about accidentally using way too much stack space, you can add a rule that recursive functions should take “large” arguments by reference instead of by value. But this applies equally to non-Copy values! I'd also generally suggest not compromising pub APIs; instead write the most desirable API and have it forward to an internal non-exposed helper with the adjusted interface. (I'll often even do the same for dyn Trait where dyn is only used for implementation reasons... though in that case, try to avoid ending up introducing an extra indirection if the caller already has a trait object; always prefer for<T: Trait + ?Sized> fn(&T) instead of for<T: Trait> fn(T) wherever possible, so T=dyn Trait is allowed. If you can add ?Sized to your generics, you probably should.)


  1. If there's a difference on the platform C ABI, it'll be that register spilled by-value arguments are passed at a statically known offset from the stack pointer (e.g. copied onto the calee stack by the caller), whereas by-reference arguments are handled by passing by-value (i.e. in registers or spilled to known stack pointer offsets) the address of the value. There's also the question of whether ABI-by-reference values are allowed to be modified by the subroutine or if it has to copy it into its own stack space if it wants to modify it. With C++, you're getting the platform conventions, no matter how good a fit they're considered for modern software. With Rust, you're getting semantics reasonably tuned for Rust conventions. ↩︎

  2. My standard example here is that a redundant copy of a temporary [0x0101_u16; 10_000] will get optimized out, but [0x0102_u16; 10_000] won't. ↩︎

  3. If the fields and lack of Drop mean the language would allow the derive, but the described type doesn't make semantic sense to implicitly duplicate, then the Copy impl isn't semantic. ↩︎

8 Likes

I did some playing around yesterday, using the new std::hint::black_box instead of the one from criterion, and it seems differences start showing up at 8 Uuids, but that might be yet another benchmarking flaw on my part.

And big thanks to @CAD97 for that comprehensive answer, really useful :pray: