Yes, at last, my Rust is faster than C!

ZiCog · December 24, 2019, 12:12pm

I have sometimes mentioned here my attempts to reimplement a solution to the Project Euler problem #256 "Tatami-Free Rooms" in Rust. The original being in C and written by E.J.Olsen.

Today I have arrived at a version that is faster than the original C both on my x86_64 PC and on the ARM of a Rasperry Pi 3B.

A simple makefile builds and times them:

On the Pi 3B

pi@debian-buster-64:~/tatami-rust $ make run
gcc -Wall -O3 -o prune prune.c -march=native -mtune=native
gcc -Wall -O3 -o limited limited.c -march=native -mtune=native
RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo build --release --features=use_i32
   Compiling tatami_rust v0.1.0 (/home/pi/tatami-rust)
    Finished release [optimized] target(s) in 3.56s
time ./limited
T(85765680)=200
4.34user 0.00system 0:04.36elapsed 99%CPU (0avgtext+0avgdata 1220maxresident)k
0inputs+0outputs (0major+69minor)pagefaults 0swaps
time ./prune
Pr(1300)=10657
T(85765680)=200
4.14user 0.00system 0:04.14elapsed 99%CPU (0avgtext+0avgdata 1164maxresident)k
0inputs+0outputs (0major+68minor)pagefaults 0swaps
time ./target/release/tatami_rust 200
Using 32 bit integers.
PNUM = 1300
FNUM = 10
SMAX = 100000000
T(85765680)=200
4.04user 0.00system 0:04.04elapsed 99%CPU (0avgtext+0avgdata 1528maxresident)k
0inputs+0outputs (0major+85minor)pagefaults 0swaps

On the x86_64 PC:

zicog@monster:/mnt/c/Users/zicog/tatami-rust$ make run
gcc -Wall -O3 -o prune prune.c -march=native -mtune=native
gcc -Wall -O3 -o limited limited.c -march=native -mtune=native
RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo build --release --features=use_i32
   Compiling libc v0.2.66
   Compiling tatami_rust v0.1.0 (/mnt/c/Users/zicog/tatami-rust)
   Compiling time v0.1.42
    Finished release [optimized] target(s) in 6.02s
time ./limited
T(85765680)=200
0.90user 0.01system 0:00.92elapsed 100%CPU (0avgtext+0avgdata 584maxresident)k
0inputs+0outputs (0major+175minor)pagefaults 0swaps
time ./prune
Pr(1300)=10657
T(85765680)=200
0.81user 0.00system 0:00.83elapsed 96%CPU (0avgtext+0avgdata 584maxresident)k
0inputs+0outputs (0major+177minor)pagefaults 0swaps
time ./target/release/tatami_rust 200
Using 32 bit integers.
PNUM = 1300
FNUM = 10
SMAX = 100000000
T(85765680)=200
0.60user 0.00system 0:00.64elapsed 94%CPU (0avgtext+0avgdata 876maxresident)k
0inputs+0outputs (0major+257minor)pagefaults 0swaps

All code is here if anyone wants to play along: https://github.com/ZiCog/tatami-rust

Next up, somebody has written a multi-threaded version that is much faster. Anyone know the best way to queue work into a thread pool in Rust?

Thanks to all who have helped with this along the way.

Merry Christmas.

Yandros · December 24, 2019, 12:52pm

Neat, congrats! Could you (also) compile the C code with clang for your comparison? Using the same LLVM backend will let us compare just the languages rather than language + backend

ZiCog · December 24, 2019, 1:07pm

OK.

On the PC there is not much difference using clang. Rust is a tiny bit ahead:

$ make
clang -Wall -O3 -o prune prune.c -march=native -mtune=native
clang -Wall -O3 -o limited limited.c -march=native -mtune=native
RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo build --release --features=use_i32
   Compiling tatami_rust v0.1.0 (/mnt/c/Users/zicog/tatami-rust)
    Finished release [optimized] target(s) in 1.26s
time ./limited
T(85765680)=200
0.71user 0.00system 0:00.73elapsed 98%CPU (0avgtext+0avgdata 584maxresident)k
0inputs+0outputs (0major+177minor)pagefaults 0swaps
time ./prune
Pr(1300)=10657
T(85765680)=200
0.64user 0.00system 0:00.66elapsed 95%CPU (0avgtext+0avgdata 584maxresident)k
0inputs+0outputs (0major+176minor)pagefaults 0swaps
time ./target/release/tatami_rust 200
Using 32 bit integers.
PNUM = 1300
FNUM = 10
SMAX = 100000000
Pr(1300)=10657
T(85765680)=200
0.60user 0.01system 0:00.63elapsed 98%CPU (0avgtext+0avgdata 880maxresident)k
0inputs+0outputs (0major+257minor)pagefaults 0swaps

On the PI 3 ARM Rust is a little bit behind, about 7% slower

$ make
clang -Wall -O3 -o prune prune.c -mtune=native
clang -Wall -O3 -o limited limited.c  -mtune=native
RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo build --release --features=use_i32
   Compiling tatami_rust v0.1.0 (/home/pi/tatami-rust)
    Finished release [optimized] target(s) in 6.41s
time ./limited
T(85765680)=200
4.02user 0.00system 0:04.02elapsed 99%CPU (0avgtext+0avgdata 1076maxresident)k
0inputs+0outputs (0major+66minor)pagefaults 0swaps
time ./prune
Pr(1300)=10657
T(85765680)=200
3.74user 0.00system 0:03.74elapsed 99%CPU (0avgtext+0avgdata 1124maxresident)k
0inputs+0outputs (0major+67minor)pagefaults 0swaps
time ./target/release/tatami_rust 200
Using 32 bit integers.
PNUM = 1300
FNUM = 10
SMAX = 100000000
Pr(1300)=10657
T(85765680)=200
4.03user 0.00system 0:04.04elapsed 100%CPU (0avgtext+0avgdata 1488maxresident)k
0inputs+0outputs (0major+84minor)pagefaults 0swaps

The amazing thing about this is that the Rust code is a simple transliteration from C to Rust with a bit of rustification thrown in to keep clippy happy.

Note however that cheat a bit by calculating a table of prime numbers as static from build.rs. But that is only a very small part of the run time. Especially when we scale the thing up to larger problems.

Yandros · December 25, 2019, 8:10pm

For fun, you can hack your way into doing this in C with something like:

# Makefile

.PHONY: all clean # generated.c # Add it if you want the code generation to always run

all: main

main: main.o other_deps.o
    $(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)

main.o: main.c generated.c
    $(CC) -I. $(CPPFLAGS) $(CFLAGS) -c $<

%.o: %.c
    $(CC) $(CPPFLAGS) $(CFLAGS) -c $<

generated.c: build
    { echo "/* Auto-generated file: do not edit manually */" && ./$< ; } > $@ 

build: build.c
    $(CC) -o $@ $<

clean:
    rm -f main build generated.c *.o

while having

#include "generated.c"

in main.c, and a build.c that uses printf to generate the C code.

(if the generated code has a fixed API, such as a static array of integers, you can instead use a fixed generated.h as with any other lib (i.e., having #include "generated.h" in the files and generated.o be used when linking the final binary), where generated.c is created by ./build).

ZiCog · December 25, 2019, 8:23pm

Cool.

I'm all up for hacking. But that is not my idea of fun. I hate make files.

My plan is to never use C again, unless I really have to...

cbiffle · December 25, 2019, 10:27pm

In general, the answer is almost always "use Rayon," but it will depend on the pattern of work generated. Rayon provides things like parallel iterators which will, behind the scenes, handle distributing chunks of work across threads. In my experience, the results are roughly on par with what I would produce by hand (in terms of throughput). However, Rayon also has a low-level API with operations like ThreadPool::spawn.

ZiCog · December 25, 2019, 10:55pm

Ah thanks.

The C code I am wanting to implement in Rust is not amenable to splitting iteration over arrays into chunks.

It's a recursive divide and conquer algorithm. I have no idea what work load each 'arm' of the recursion gets.

I guess after a few levels of recursion one has run out of cores to run the threads on, so all that work queuing is a useless overhead anyway.

But the Rayon ThreadPool::spawn looks like a good place to start with and get this code running at all.

cbiffle · December 25, 2019, 11:07pm

That's okay. Rayon will balance the load by work-stealing across threads. I'd give it a try if possible.

ZiCog · December 25, 2019, 11:19pm

OK. Sounds promising.

I'm working on it here: https://github.com/ZiCog/tatami-rust

I'll be sure to be back with results.

ZiCog · December 28, 2019, 10:52am

I now have a version of the threaded Tatami problem, queue.c, coded up in Rust, queue.rs. It is using a rayon thread pool :

let pool = rayon::ThreadPoolBuilder::new()
        .num_threads(8)
        .build()
        .unwrap();

and rayon's spwan_fifo:

    pool.spawn_fifo(move || {
        Twork(&mut yp, Tisn, &mut g);
    });

With a touch of AtomicPtr to replace the gcc built-in "__atomic_compare_exchange_n(&gMin, &smin, xp.s, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED))

fn __atomic_compare_exchange_n(ptr: &mut u32, expected: &mut u32, mut desired: u32) -> bool {
    let some_ptr = AtomicPtr::new(ptr);
    let value =
        some_ptr.compare_exchange(expected, &mut desired, Ordering::Relaxed, Ordering::Relaxed);
    match value {
        Ok(_expected) => true,
        Err(_expected) => false,
    }
}

It compiles without error and clippy does not complain. There is no 'unsafe' anywhere.

But it crashes and burns with this inscrutable error message:

$ cargo run   --features=use_i32 200
Running Rust translation of queue.c...
thread 'thread 'thread 'thread 'thread 'thread '<unnamed><unnamed><unnamed>thread '<unnamed><unnamed><unnamed>thread '' panicked at '' panicked at '' panicked at '<unnamed>' panicked at 'T(100000000)=200
' panicked at '' panicked at '<unnamed>attempt to subtract with overflowattempt to subtract with overflowattempt to subtract with overflow' panicked at 'attempt to subtract with overflowattempt to subtract with overflowattempt to subtract with overflow

Likely there is something I don't understand about rayon and/or AtomicPtr. And there is always the possibility of a silly mistake elsewhere.

Any brave soul up for giving it a little review?

I tried to keep the code as much the same "shape" as the original C and kept the bad style names for easy comparison.

There is a makefile their to build at run all the C and Rust. Just type "make".

alice · December 28, 2019, 10:57am

Regarding your __atomic_compare_exchange_n, it is a no-op. All that happens is that a new variable some_ptr is created (and it's value is a copy of ptr), and then you modify the contents of that variable, and throw it away.

I assume you intended to modify the contents of ptr? Since you have a mutable reference, the compiler guarantees that no other references alias with that pointer, so you can just write to it using *ptr = desired.

This would be a correct implementation according to your contract, and it has no data races.

fn __atomic_compare_exchange_n(ptr: &mut u32, expected: &mut u32, mut desired: u32) -> bool {
    if *ptr == *expected {
        *ptr = desired;
    } else {
        *expected = *ptr;
    }
}

Did you by any chance intend to share some data but somehow fail to actually share it?

ZiCog · December 28, 2019, 11:13am

What I need there is something that does the same as the gcc buit-in function '__atomic_compare_exchange_n'. Which does the following:

This built-in function implements an atomic compare and exchange operation.**
This compares the contents of *ptr with the contents of *expected and...**
    if equal, writes desired into *ptr.**
    if they are not equal, the current contents of *ptr is written into *expected.**
    True is returned if desired is written into *ptr**
    False is returned otherwise,**

The doc for AtomicPtr.compare_exchange looks like what I want:

Stores a value into the pointer if the current value is the same as the  `current`  value.

The return value is a result indicating whether the new value was written and containing
the previous value. On success this value is guaranteed to be equal to  `current` .

This of course updating a u32 that is shared by many threads.

The original code calls __atomic_compare_exchange_n in a loop and tests the returned boollean.

alice · December 28, 2019, 11:15am

The AtomicPtr type is for swapping out pointers, if you want to change the value they point to, you need AtomicU32.

By having a mutable reference to the u32, you have written a proof that it is not shared between threads, and by the compiler not rejecting the code, this proof has been verified by the compiler.

(or alternatively if it really is shared between several threads, you have written a proof that your code is unsound, which I find unlikely if you have not used unsafe)

alice · December 28, 2019, 11:25am

Please see this playground that demonstrates that your function is a no-op.

alice · December 28, 2019, 11:30am

I read your code, and this is where it went wrong:

if (pow(log(pMax.into()), sqrtOf2) / log(p.into())) < fifteen {
    let mut yp: Factors = *xp;
    let mut g = *gMin;
    pool.spawn_fifo(move || {
        Twork(&mut yp, Tisn, &mut g);
    });
    return;
}

Since Factors is copy, the yp variable now contains a copy of xp. This copy is then moved into the closure, and the spawned closure now modifies this copy of xp, which of course will not modify xp itself. The same happened to gMin.

If you wish to share it, you should, well, share it. Of course, the compiler will prevent you from having mutable references to Factors in several threads, so you will have to resort to immutable references. Luckily you can define factors like this:

pub struct Factors {
    s: u32,
    fmax: usize,
    i: usize,
    p: [AtomicU32; FNUM],
    n: [AtomicU8; FNUM],
}

(i.e. make every field you want to modify behind the shared reference atomic, I'm not quite sure which you are modifying, so I just changed some of them arbitrarily)

This will mean that since you have a &Factors, you can get an &AtomicU32 and then you can use AtomicU32::compare_and_swap to modify it through the shared reference.

ZiCog · December 28, 2019, 11:49am

Hmm... as far as I can tell the Factors structure is not shared across threads. In the original code threads are spawned with:

factors* yp = new factors;
            *yp = xp;
            Lc++;
            pool->enqueue([yp] {
                Twork(*yp);
                delete yp;
            });

Where 'pool' is a hand crafted thread pool with a queue that those lamda functions can be lined up for execution. Surely that 'yp' is a copy of xp is which what I do in Rust with 'let mut yp: Factors = *xp;'

From there on the Factors gets passed down via mutable borrows.

The only thing that is shared across threads is gMin. Which I guess I should make into an 'AtomicU32'

alice · December 28, 2019, 11:51am

Ah! In that case you are right: Make gmin an atomic integer, and modification should work
You may run into lifetime problems when spawning the thread because sending references across threads is difficult to do correctly. It's possible to do with rayon, but it may require a refactor of the code, and to avoid that, you can wrap gMin in an Arc. This will only add a Runtime cost when creating or cloning the Arc, so if you have just one clone per thread, it should be pretty cheap.

ZiCog · December 28, 2019, 12:53pm

Thanks for the help Alice.

Durp, durp, of course. AtomicU32. No wonder the documentation was making no sense when I was reading about AtomicPtr and expectingit to be like the gcc built-in!

I had a feeling Arc would sneak in here eventually....

alice · December 28, 2019, 12:57pm

If you can spawn every thread using either join or inside the same rayon scope, you can avoid the Arc.

ZiCog · December 28, 2019, 1:13pm

Hmm... not sure how that would look just now.

For now I'd like the code to remain as much the same "shape" as the C original so that I don't get lost. If Arc and that spawn_fifo thing work as as I hope they do perhaps I can at least get it running properly.

Then I might think about rearranging things...

Topic		Replies	Views
Comparison of a simple code between Rust and C!	25	2058	January 12, 2023
Help -- Fastest Prime Sieve, in Rust help	22	2759	November 9, 2019
A Friendly Challenge for Rust community	36	4426	January 12, 2023
Unexpected performance from array bound tests and more	13	2266	January 12, 2023
How to make this parallel prime sieve faster?	30	5512	January 12, 2023

Yes, at last, my Rust is faster than C!

Related topics