Yes, at last, my Rust is faster than C!

I believe you can rearrange the code like this:

pub fn Tinv(n: u32) -> u32 {
    let mut x = Factors::new();
    let mut gMin: AtomicU32 = AtomicU32::new(SMAX);

    rayon::scope(|scope| {
        Tqueue(&mut x, n, &gMin, scope);
    });

    gMin.into_inner()
}

fn Tqueue<'scope>(xp: &mut Factors, Tisn: u32, gMin: &'scope AtomicU32, scope: &Scope<'scope>) {
    // ...
    if (pow(log(pMax.into()), sqrtOf2) / log(p.into())) < fifteen {
        let mut yp: Factors = *xp;
        scope.spawn(move |_scope| {
            Twork(&mut yp, Tisn, gMin);
        });
        return;
    }
    // ...
}

(my link to the rayon scope in the previous post was wrong, sorry about that, I fixed it now)

Grrr.... more "line noise". I hate line noise in my code.

Let me try and get my head around that a while...

Normally when using spawn functions that run something concurrently, the variables borrowed by the closure must live for the 'static lifetime to avoid issues with the spawned thread living longer than the reference.

However the rayon::scope function that I called in Tinv guarantees that all threads spawned inside the scope using the provided scope variable will have finished when the call to rayon::scope returns. Thus it is okay for those threads to borrow things from outside the scope, because e.g. gMin cannot be dropped while the rayon::scope function is still running.

In the type &Scope<'scope> we are saying that we are dealing with a scope, and the lifetime 'scope represents the lifetime of everything inside the surrounding rayon::scope call. Thus if we have a reference with the lifetime 'scope, we require that this reference is valid inside the entirety of this scope. Threads spawned by scope.spawn will return before the end of the rayon::scope call returns, so such a thread is included in this lifetime and gMin is guaranteed to be valid for the duration of that thread.

So by adding 'scope to Tqueue we're basically saying “I want to be able to use the reference gMin inside threads spawned from the scope”, and the compiler guarantees you can only call Tqueue if gMin actually is valid for that long.

Thanks for that great explanation alice. It might have taken me a long while to glean all that from the docs. I must say that all this "scope: &Scope<'scope>" is a bit mind numbing.

The upshot is that I now have your suggestion in place and it all builds without complaint. Does not run though. Seems to be starting threads and running work on them. Crashes out with a billion overflow on subtract errors.

I'm not at all clear about hat atomic exchange thing. In the C code I found that this worked fine:

    while (xp.s < smin) {
        smin = __atomic_exchange_n (&gMin, xp.s, __ATOMIC_RELAXED);
    }

Which I now have in Rust as:

    while xp.s < smin {
        smin = gMin.swap(xp.s, Ordering::Relaxed);
    }

Is that appropriate?

I have to go through the code with a fine tooth comb. I likely have mistakes in there somewhere.

Basically the atomic swap is trying to perform this operation:

gMin = min(gMin, xp.s);

but atomically. The while loop works this way:

  1. The smin variable contains the last seen value of gMin (but some other thread may have changed it since then).
  2. We know that no other thread will ever increase the value of gMin, so if smin <= xp.s, we can stop as we then have gMin <= smin <= xp.s, thus our operation would leave gMin unchanged.
  3. Swapping unconditionally like you do in your new Rust code would not work, because what if we have this situation: gMin < xp.s < smin ? This can happen if some other thread decreased gMin since we last saw it, and if we replace the value at gMin with xp.s, we performed our operation incorrectly (we increased gMin).
  4. However if we are in this situation: xp.s < smin = gMin, then replacing gMin with xp.s is fine, because that definitely decreases it. Luckily the atomic compare_and_swap method does exactly this:
fn compare_and_swap(&self, current: u32, new: u32, order: Ordering) -> u32 {
    // this but atomic
    if *self == current {
        *self = new;
        return current;
    } else {
        return *self;
    }
}
  1. The point is that self is only changed if it is equal to what we thought it was. So in this case we say: replace it with xp.s, but only if it's currently equal to smin, which is the last value we saw. If gMin is different from smin, we just update smin, reevaluate if we still need to update it, and try again if so.
  2. This does not deadlock, because if one thread fails to update, it means some other thread succeeded.

Thus you want this:

while xp.s < smin {
    smin = gMin.compare_and_swap(smin, xp.s, Ordering::Relaxed);
}

Note that this technique for collecting many values into their minimum could be encapsulated in a new type like this.

That said your panic is probably from something else, so hopefully Rust can tell you which line underflowed and maybe you can find your mistake using that.

OK. compare_and_swap it is.

Rust is trying to tell me something with a few hundred lines of messages like this:

thread 'thread 'thread 'thread 'thread '<unnamed><unnamed><unnamed><unnamed>thread '<unnamed>' panicked at 'thread '' panicked at '' panicked at '' panicked at '<unnamed>' panicked at 'attempt to subtract with overflow<unnamed>attempt to subtract with overflowattempt to subtract with overflowattempt to subtract with overflow' panicked at 'attempt to subtract with overflow', ' panicked at '', ', ', attempt to subtract with overflow', src/queue.rsattempt to subtract with overflowsrc/queue.rssrc/queue.rssrc/queue.rs', src/queue.rs:', :::src/queue.rs:46src/queue.rs464646:46:::::46:2146212121:21
:


21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
21

thread 'thread 'thread 'thread 'thread '<unnamed>thread 'thread '<unnamed><unnamed><unnamed><unnamed>' panicked at '<unnamed><unnamed>' panicked at '' panicked at '' panicked at '' panicked at 'attempt to subtract with overflow' panicked at '' panicked at 'attempt to subtract with overflowattempt to subtract with overflowattempt to subtract with overflowattempt to subtract with overflow', attempt to subtract with overflowattempt to subtract with overflow', ', ', ', src/queue.rs', ', src/queue.rssrc/queue.rssrc/queue.rssrc/queue.rs:src/queue.rssrc/queue.rs::::46::46464646:4646::::21::21212121
2121

I gather it is complaining about the subtract in queue.rs line 46. That is in a harmless helper function which is good, it's the same as the non threaded version.

Enabling RUST_BACKTRACE produces 20 thousand lines of output. This could take some time....

Try with:

# Cargo.toml
[profile.dev]
panic = "abort"

(which will make the first panic abort the program to avoid having panics from other threads polluting the output), and then:

RUST_BACKTRACE=full cargo run
1 Like

Great tip. Unfortunately it removes useful information.

In the full trace I can make out the following in the noise, repeated many times:

  14: tatami_rust::queue::tfree
             at src/queue.rs:46
  15: tatami_rust::queue::T
             at src/queue.rs:79
  16: tatami_rust::queue::Twork
             at src/queue.rs:111
  17: tatami_rust::queue::Twork
             at src/queue.rs:134
  18: tatami_rust::queue::Twork
             at src/queue.rs:122
  19: tatami_rust::queue::Twork
             at src/queue.rs:134
  20: tatami_rust::queue::Twork
             at src/queue.rs:122
  21: tatami_rust::queue::Twork
             at src/queue.rs:122
  22: tatami_rust::queue::Tqueue::{{closure}}
             at src/queue.rs:162

Using abort the only line in my code I see is:

  14:     0x7fc2a01f83e6 - tatami_rust::queue::tfree::he9ddb25726b4164c
                               at src/queue.rs:46

Either way it does not help much. It just shows that threads start, they run through the expected chain of calls and then fail on subtract in a little leaf function.

Clearly I have something wrong in that chain, just got to find it...

1 Like

BINGO! Got it!

On a Raspberry Pi 3B: The threaded Rust solution:

$ time ./target/release/tatami_rust 200
T(85765680)=200

real    0m1.243s
user    0m4.826s
sys     0m0.032s

The original threaded C solution

$ time ./queue
T(85765680)=200

real    0m1.127s
user    0m4.148s
sys     0m0.012s

A speed up of about 3.4 over the single thread version, on a 4 core machine. Not bad!

Once again we find Rust on the Pi is about 10% slower than C on ARM. On the x86_64 PC they are neck and neck.

This will no doubt get Rust very near the top of the rankings in the little performance challenge going on here: Tatami Pi Fun. - Page 10 - Raspberry Pi Forums

It also inspires further confidence that I won't be let down for having bet our little companies business on Rust.

Thanks for all the help alice and all.

You are computing (k - 1) * (n + 1) which panics when k = 0. You may need to special case that situation, for instance, doing

k   .checked_sub(1)
    .map(|it| it * (n + 1))
    .unwrap_or(value_when_k_is_0)

or just k.saturating_sub(1) * (n + 1) if value_when_k_is_0 == 0


EDIT: I just noticed there is a division by k just before, so nevermind this post. Still, it seems that for different reasons, the case k = 0 ought to be handled.

Hell will freeze over before I want to replace a simple arithmetic expression with all that indecipherable map and lambda functional programming stuff.

I'm not sure if it is by accident or design but the case that k = 0 never happens when things are working right. It might as well just panic and make it known something is wrong.

I guess an assertion there would be appropriate.

Yes, go and use debug_assert!, they help debug code while not affecting --release performance :wink:

debug_assert_ne!(k, 0, "Hell is freezing!!");
9 Likes

Ha! Love it.

1 Like

I'm really enjoying this thread! Learning a lot about, well, threads.

1 Like

News just in...

It's reported that this solution now runs on a Pi 4 in almost exactly the same time as the winning C solution in the performance chart here: Tatami Pi Fun. - Page 10 - Raspberry Pi Forums

pi@debian-buster-64:~/tatami-rust $ time target/release/tatami_rust 200
T(85765680)=200

real    0m0.549s
user    0m2.137s
sys     0m0.004s

Well done everybody !

4 Likes

Scaling up the original Project Euler problem #256 #256 Tatami-Free Rooms - Project Euler to solve T(S) = 1000 needs moving to 64 bit primes. The Rust solution using a rayon thread pool handily beats the current champion written in C and using pthreads. On my x86_64 PC:

pqplum64 (C + pthreads)
T(63405342000)=1000

real 1m49.223s
user 12m14.781s
sys 0m0.219s

tatami_rust_threaded64 (Rust + rayon)
Using 64 bit integers.
PNUM = 40000
FNUM = 20
SMAX = 100000000000
Pr(40000)=479909
Running Rust translation of queue.c...
Uing 8 cores.
T(63405342000)=1000

real 1m36.383s
user 11m53.078s
sys 0m0.281s

Which again is about 3.5 times faster than the fastest single thread C code we have.

Once again, thanks for the input everybody.

However, the situation on ARM is dire. There the threaded Rust solution is slower than the single threaded one !

That is perhaps a topic for another thread....

4 Likes

Punny, although the report is not funny.

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.