I am developing game tree search using large struct and amd suffering from copy performance.
I compared C and Rust by simple program.
Rust version
#[derive(Copy)]
struct Example {
a: [u8; 400],
}
impl Clone for Example {
fn clone(&self) -> Self {
*self
}
}
fn main() {
let a = Example { a: [0; 400] };
let mut b =Example { a: [0; 400] };
for i in 0..10000000 {
b = a.clone();
b.a[0] = i as u8;
}
println!("{}", b.a[0]);
}
C version
#include <stdio.h>
struct Example {
unsigned char a[400];
};
int main() {
struct Example a;
struct Example b;
int i;
for (i = 0; i < 10000000; i++) {
b = a;
b.a[0] = (unsigned char)i;
}
printf("%d", (int)b.a[0]);
}
Checking by Instruments on OS X, Rust version with -0 option consumes 164ms and C version without -O consumes 121ms, but when looking at memmove part, the former consumes 134ms and the latter consumes 79ms.
The sizes of structs should be same, why does Rust version consume more?
One thing that I wonder is, when I compiled without -O, memmove of Rust version consumes almost same as C's but memswap by iterator consumes almost same time.
And, if you know any articles about techniques to write fast Rust code, please let me know.
For example, I am not quite sure when being imposed bound check. If I use functional programming style fully, can I avoid bound check? I want to read this kind of articles.
If performance of Rust code become close to C, I must love Rust!^^
Just to note that the two programs are not equivalent. The Rust code has to initialise a and b, but the C program doesn't. I believe the C program is reading uninitialised memory, so any sort of performance comparison is probably bunk.
Also, the Rust code notionally contains a bounds check on both array accesses. Are you certain those have been optimised out?
And just to note: when I compile that code on playpen in release mode (after moving the println! into its own #[inline(never)] function), it appears to eliminate the structs entirely, and just calculates the value of the last b.a[0].
I know that initialization difference and I know that it is negligible compared with 10000000 copies^^
And currently I confirm that Rust has very heavy overhead about copy of large struct.
I think that it is a big problem if my Rust code itself has no explicit overhead.
I think the point about initialization is that the c compiler is not required to copy the struct, making the comparison moot, assuming a clever c compiler. In fact, I would hope that the rust compiler some day would also be able to eliminate the loop, since it has no effect.
I want to focus my current issue.
My project is using a lot of copy of struct and large part of execution is for it.
Even if simplest case I suggested, current Rust consumes almost twice time of C.
So, if I can avoid this situation by coding technique, I will become definitely happy, but If not, it is a serious problem for me.
Please give me hints for improvements, not criticisms against my manner.
Benchmarking is hard. The code you provided to show the problem doesn't, so it's impossible for anyone else to know whether or not what you're describing really exists or not. That makes it difficult for anyone to offer any real advice.
I'll try to get the actual disassembly for the functions, but it turns out I don't have a usable gdb on this machine, so I need to sort that out first...
Here's a better comparison that yields identical runtimes on my system (rust: -C opt-level=3, gcc: -O2). You need the black boxes to make sure the compiler doesn't just throw away the copy operations.
#![feature(test)]
extern crate test;
#[derive(Copy)]
struct Example {
a: [u8; 400],
}
impl Clone for Example {
fn clone(&self) -> Self {
*self
}
}
fn main() {
let aa = Example { a: [0; 400] };
let mut bb = Example { a: [0; 400] };
let a =test::black_box(&aa);
let mut b =test::black_box(&mut bb);
for i in 0..10000000 {
*b = *a;
b.a[0] = i as u8;
}
println!("{}", test::black_box(b).a[0]);
}
Versus:
#include<stdio.h>
struct Example {
unsigned char a[400];
};
struct Example * __attribute__((optimize("O0"))) black_box(struct Example * v) {
return v;
}
int main() {
struct Example aa = { .a = {0} };
struct Example bb = { .a = {0} };
struct Example * a = black_box(&aa);
struct Example * b = black_box(&bb);
int i;
for (i = 0; i < 10000000; i++) {
*b = *a;
b->a[0] = (unsigned char)i;
}
printf("%d", (int)black_box(b)->a[0]);
}
The real problem was the lack of black boxing. Basically, you have to trick the compilers into "forgetting" that the two arrays are filled with zeros. Otherwise, they'll just optimize everything away (which gcc appears to be better at doing).
Just a note: I got around to pulling out the compiled assembly for the copy functions:
The biggest difference I saw was that the Rust functions were calling memcpy, whereas the C ones weren't. Note that this is without doing LTO (since I'm getting these from the respective compilers), but LTO would probably make analysing the functions in isolation difficult anyway.
I can't really speculate on why they're different.
It looks like LLVM is having trouble optimizing the branch condition for copy_struct_many_rust. I think there's a bug filed somewhere about that, but I can't seem to find it (IIRC explicitly slicing the operands so they're the same length improves code generation a bit).