What is faster? `Copy` or passing reference?

I don't know much about low-level programming.

Which of the following functions is faster?

fn copy<X: Copy>(x: X) -> X { x }
fn reference<X>(x: &X) -> &'_ X { x }
  • X could be u8, u64, or [u8; 1024].
  • opt-level could be 3 or z.
1 Like

The only real way to know which is faster is to benchmark them both, e.g. using criterion.
That's pretty much always true btw. Always benchmark.

EDIT, additional observations:

As defined, note that the parameter to reference() is the size of a borrow, while the arg to copy() can be arbitrarily large.
This it could be a primitive which fits in a register, but it could, in principle, also be a really large value.

8 Likes

It also depends a lot on the context of the code. Unless there is a need to pass by reference, the compiler will almost surely optimize passing around small values by-value (so that it can put them into registers instead of main memory). So I'd expect zero difference in the case of u8, u64, and anything small enough to fit in a few registers.

Generally, when passing around primitives and pure values, you shouldn't worry about whether refrencing or copying is faster. It doesn't matter because practically everything else in your program (eg. allocations) will be slower and will dominate trivial copies.

12 Likes

There´s often another performance boundary at 4kB, because that’s the granularity of the virtual memory system. So [u8; 4096] might end up faster to copy than [u8; 1024] if the compiler/OS chooses to use page table trickery instead of actually copying the bytes.

3 Likes

Do you actually know of any such OS? It's not impossible to create such OS, but I wonder if such beast exist even as theoretical thingie. Practically such things don't exist for sure: TLB invalidation is a heavy thing, thus this would be a pessimization 10 times out of 10.

For something like [u8; 1048576] this would be win (and I have even done that myself), but, again, I have no idea if any compiler may to that automatically for you.

I know it’s built-in to glibc’s memcpy implementation, but I don’t know what conditions (platform/size/alignment/etc) trigger that code path.

1 Like

I know because I’ve looked at the glibc code myself. Please refrain from this sort of ad hominem attack in your posts.

In particular, I inferred it from the structure of memcpy.c, most notably the use of PAGE_COPY_FWD_MAYBE(…).


After a bit more digging, this appears to only be used by the mach platform, once the copy exceeds 16kB.

(cf generic/pagecopy.h, mach/pagecopy.h)

1 Like

Good, good, good. You have found failed experiment which was never actually used by normal people. The version of memcpy which you have found is designed for the “primary GNU platform: GNU/Hurd”. Of course I have never looked there.

Which is based around PAGE_COPY_FWD, which is only ever defined in sysdeps/mach/pagecopy.h.

I apoligize. Sorry. For that:

I should have known better. Of course real OSes created by normal people for real needs wouldn't do such crazyness. But I forgot about Architecture Astronauts. Sorry.

Remember when that was written: An initial kernel exists but many more features are needed to emulate Unix. When the kernel and compiler are finished, it will be possible to distribute a GNU system suitable for program development.

1985! After Stallman gave his initial crude code to Architecture Astronauts to finish they crated something so convoluted even name is unique. It's funny that they asked, ten years ago Is it vaporware, like Duke Nukem Forever? and answered Fortunately not.

But Duke Nukem was released the very next year after fifteen years of development. Hurd is not yet released, after 37 years. Sorry, I really forgot about that abomination.

I apologize for my words but not for my stance. What you have found was a joke (anything related to Hurd is a joke) and I shouldn't be angry at you for not recognising it. Nonetheless my point still stands: if you exclude non-vaporware-just-simply-thing-which-takes-half-centiry-to-develop [/sarcasm-off] then all other OSes don't do that. Thus from practical viewpoint this [u8; 4096] barrier doesn't exist.

2 Likes

Thanks for explaining that, because this has come up before, and I for one never realised this was not a practical operating system.

It seems that way, at least for the present.

There's no practical barrier, though, to a future Rust compiler doing this for types where it would be a benefit (though the cutoff would be much bigger than 4kB). It would just take someone motivated enough to write a patch and demonstrate where the performance cutoff actually lies in practice, across a representative sample of CPUs that run Rust code¹.

As always, the main message is that the compiler has lots of opportunity to optimize things that you might not think of— Write for clarity first, and don't manually optimize until you've verified that you can do a better job than the compiler.

¹ As this is necessarily platform-dependent code, the exercise would need to be repeated for each platform the new optimization gets added to.

1 Like

I don't see anyone ever implementing this "virtual memory copying" though, because practical programs take care not to make identical copies of large objects. So I don't ever see it happening.

8 posts were split to a new topic: When to do optimization?

According to Compiler Explorer both functions are eliminated (squashed) which, I guess, makes them both infinitely fast.

I assume you meant the bodies to do something non-trivial. However, for reference it's unclear what non-trivial things can be done given the list of potential arguments.

Given the fact that the optimizers are very good at what they do, two things come to mind...

  • The details (the function bodies) are important to understanding which is going to be "faster".
  • It is possible that all the potential arguments will be passed "in the small"; that u8 and u64 will be passed by value using registers and that [u8; 1024] will be passed by reference for both functions.
1 Like

Both functions are generic and as such only codegened if you actually call (or otherwise reference) them with a concrete type as generic argument. See for example Compiler Explorer

1 Like

Thank you for the reply.

I used println! to call them. Without optimizations both are included. With the optimizations the original poster specified both functions are eliminated.

I don't think it's meaningful to discuss that part here.

If developers don't want to create efficient program then Rust can not help them.

I don't think I can imagine any language which can do that.

We have to assume they want to do that, just want to pick the most efficient (from human resources expenditure POV) way. And in that case this advice is actively harmful:

I would probably only recommend to write for clarity first in cases where task is so complicated that any other approach means you are risking not to finish anything at all.

Otherwise the order should be:

  1. Think about your data structures and think about cost of using them .
  2. Write code and and don't think about its speed.
  3. Optimize code in places where you hit the bottleneck.

Because in my experience on many occasions I have gotten much more efficient implementation (and often also the ones which are easier to support) by investing few days (or couple of weeks in complex cases) in designing data structures on paper before writing code than my colleagues have gotten from your approach with profiling and rewriting some procedures.

Of course if you can not even get the proper text of what you need to achieve from a customer you don't really have a choice: you have to make inefficient system because efficiency have to be sacrificed for flexibility.

But even then I face architecture astronautics more often then genuine flexibility. Take that HURD with it's crazy CoW memory copy support: they embedded so many theoretically nice and clever things into. And achieved crazy good extensibility and other such buzzwordy things.

Now… where are these GNU/HURD based desktops, servers and mobile phones, hmm?

I played once on godbolt, and noticed that anything that isn't a primitive or a slice is passed in as a pointer to the stack. Eg if I have under 8 arguments, they will all be registers, but structs are pointers. Thus Foo and & Foo and Box are identical assembly wise. All are pointers to something. The only difference is & Foo doesn't run drop()

This might not always be true of course - just that one time playing around with noinline simple functions.

1 Like

That's not true. For instance:

pub struct Foo {
    a: u8,
    b: u8
}

pub fn pass_by_value(foo: Foo) -> u8 {
    foo.a + foo.b
}

pub fn pass_by_reference(foo: &Foo) -> u8 {
    foo.a + foo.b
}

In pass_by_value the argument is passed in two registers, in pass_by_reference a pointer is passed.

example::pass_by_value:
        lea     eax, [rsi + rdi]
        ret

example::pass_by_reference:
        mov     al, byte ptr [rdi + 1]
        add     al, byte ptr [rdi]
        ret
2 Likes

If you edit that to be Box it goes back to being pointer arithmetic in pass_by_value. If you swap u8 with [u8; 1024] then it's also pointer arithmetic. So, while I agree the generalization probably hits edge cases like you described, the common case will not be by value.

I did try u128 and it did very nicely use registers in pass_by_value; so that's awesome.

u128 is passed as a pair of registers in this compiler, so going to 3 a,b,c u128's it flips over to pass-by-ref even in pass_by_value.

conversely using 6 u8s a..f it still keeps them as registers. I have to get all the way to 'j' TEN fields, before it flips back to pass-by-ref.. That's impressive. (of course waaaaay in the speicalization of the hardware and compiler version, I'm sure)