Performance of copy of struct


#1

Hi.

I am developing game tree search using large struct and amd suffering from copy performance.

I compared C and Rust by simple program.

Rust version

#[derive(Copy)]
struct Example {
    a: [u8; 400],
}

impl Clone for Example {
    fn clone(&self) -> Self {
        *self
    }
}

fn main() {
    let a = Example { a: [0; 400] };
    let mut b =Example { a: [0; 400] };
    for i in 0..10000000 {
        b = a.clone();
        b.a[0] = i as u8;
    }
    println!("{}", b.a[0]);
}

C version

#include <stdio.h>

struct Example {
    unsigned char a[400];
};

int main() {
    struct Example a;
    struct Example b;
    int i;
    for (i = 0; i < 10000000; i++) {
        b = a;
        b.a[0] = (unsigned char)i;
    }
    printf("%d", (int)b.a[0]);
}

Checking by Instruments on OS X, Rust version with -0 option consumes 164ms and C version without -O consumes 121ms, but when looking at memmove part, the former consumes 134ms and the latter consumes 79ms.
The sizes of structs should be same, why does Rust version consume more?

One thing that I wonder is, when I compiled without -O, memmove of Rust version consumes almost same as C’s but memswap by iterator consumes almost same time.

And, if you know any articles about techniques to write fast Rust code, please let me know.
For example, I am not quite sure when being imposed bound check. If I use functional programming style fully, can I avoid bound check? I want to read this kind of articles.

If performance of Rust code become close to C, I must love Rust!^^


#2

Just to note that the two programs are not equivalent. The Rust code has to initialise a and b, but the C program doesn’t. I believe the C program is reading uninitialised memory, so any sort of performance comparison is probably bunk.

Also, the Rust code notionally contains a bounds check on both array accesses. Are you certain those have been optimised out?

Microbenchmarks are hard. :stuck_out_tongue:


#3

And just to note: when I compile that code on playpen in release mode (after moving the println! into its own #[inline(never)] function), it appears to eliminate the structs entirely, and just calculates the value of the last b.a[0].


#4

I know that initialization difference and I know that it is negligible compared with 10000000 copies^^

And currently I confirm that Rust has very heavy overhead about copy of large struct.
I think that it is a big problem if my Rust code itself has no explicit overhead.


#5

I think the point about initialization is that the c compiler is not required to copy the struct, making the comparison moot, assuming a clever c compiler. In fact, I would hope that the rust compiler some day would also be able to eliminate the loop, since it has no effect.


#6

Thank you for you guy’s interest in this topic.

I want to focus my current issue.
My project is using a lot of copy of struct and large part of execution is for it.
Even if simplest case I suggested, current Rust consumes almost twice time of C.

So, if I can avoid this situation by coding technique, I will become definitely happy, but If not, it is a serious problem for me.

Please give me hints for improvements, not criticisms against my manner.


#7

I would recommend first improving your test, so you’ll know that you are fixing something!


#8

Sorry, your information is no hints for me, just criticisms…


#9

Try writing a timing program that makes use of the dates that is being copied, and requires the data to actually be copied.

One suggestion already made is to initialize your structs in C.


#10

Benchmarking is hard. The code you provided to show the problem doesn’t, so it’s impossible for anyone else to know whether or not what you’re describing really exists or not. That makes it difficult for anyone to offer any real advice.

Here’s my attempt at a more usable benchmark:

Gists are Git repos, so you should be able to clone this and just cargo bench to get results. Here’s a gist with some results:

I’ll try to get the actual disassembly for the functions, but it turns out I don’t have a usable gdb on this machine, so I need to sort that out first…


#11

You can get it straight from rustc, e.g. rustc --emit=asm (or, with cargo, cargo rustc --bench NAME -- --emit=asm).


#12

That doesn’t provide the disassembly of the C code, though; since my goal was to compare them, that kinda makes it difficult.

Also, it looks like MSYS2 is just completely stuffed, so I’m giving up on getting a working gdb tonight. sigh


#13

If you’re using gcc or clang, then -S will give the ASM for that too. (No idea about other compilers.)


#14

Here’s a better comparison that yields identical runtimes on my system (rust: -C opt-level=3, gcc: -O2). You need the black boxes to make sure the compiler doesn’t just throw away the copy operations.

#![feature(test)]
extern crate test;
#[derive(Copy)]
struct Example {
    a: [u8; 400],
}

impl Clone for Example {
    fn clone(&self) -> Self {
        *self
    }
}

fn main() {
    let aa = Example { a: [0; 400] };
    let mut bb = Example { a: [0; 400] };
    let a =test::black_box(&aa);
    let mut b =test::black_box(&mut bb);
    for i in 0..10000000 {
        *b = *a;
        b.a[0] = i as u8;
    }
    println!("{}", test::black_box(b).a[0]);
}

Versus:

#include<stdio.h>

struct Example {
    unsigned char a[400];
};

struct Example * __attribute__((optimize("O0"))) black_box(struct Example * v) {
    return v;
}

int main() {
    struct Example aa = { .a = {0} };
    struct Example bb = { .a = {0} };
    struct Example * a = black_box(&aa);
    struct Example * b = black_box(&bb);
    int i;
    for (i = 0; i < 10000000; i++) {
        *b = *a;
        b->a[0] = (unsigned char)i;
    }
    printf("%d", (int)black_box(b)->a[0]);
}

#15

DanielKeep-san, steballen-san,
Thank you for your educational comments!

I am one-month Rustacean, so it is still hard to understand what you are using in your code, but I really appreciate time that you spent for me.

Trying your codes, I wonder if my original statement “b = a.clone();” is serious cause.
Does it mean clone then copy?

I was happy that I found my problem, but it gave me false hope.
Variables using clone method are references in my real project…

I will study your codes step by step and I hope that my real project will improve.

Thank you!


#16

The real problem was the lack of black boxing. Basically, you have to trick the compilers into “forgetting” that the two arrays are filled with zeros. Otherwise, they’ll just optimize everything away (which gcc appears to be better at doing).


#17

re: amd suffering

Don’t worry, Zen is coming!


#18

Just a note: I got around to pulling out the compiled assembly for the copy functions:

The biggest difference I saw was that the Rust functions were calling memcpy, whereas the C ones weren’t. Note that this is without doing LTO (since I’m getting these from the respective compilers), but LTO would probably make analysing the functions in isolation difficult anyway.

I can’t really speculate on why they’re different.


#19

It looks like LLVM is having trouble optimizing the branch condition for copy_struct_many_rust. I think there’s a bug filed somewhere about that, but I can’t seem to find it (IIRC explicitly slicing the operands so they’re the same length improves code generation a bit).

In terms of generating memcpy versus rep movs, that’s controlled by a threshold in LLVM, which is apparently much smaller in LLVM than it is in gcc; see https://github.com/llvm-mirror/llvm/blob/a026cdc11a64e7c87b44d09a53c72fdf9e2163ea/lib/Target/X86/X86Subtarget.cpp#L294 . It’s possible the threshold should be tweaked; IIRC rep movs has gotten faster on recent processors.