So in a recent conversation here I was reminded of an old Rust project in my GitHub and I thought I would check it out again just for fun. fibo_4784969 in Rust
So I cloned it to my new Mac M4 and tried to build it. Sadly out of the box it fails with errors unfathomable to me.
Compiling num-bigint v0.4.0
error[E0308]: mismatched types
--> /Users/Michaael/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/num-bigint-0.4.0/src/biguint/convert.rs:70:19
|
70 | .div_ceil(&big_digit::BITS.into())
| -------- ^^^^^^^^^^^^^^^^^^^^^^^ expected `u64`, found `&_`
| |
| arguments to this method are incorrect
|
= note: expected type `u64`
found reference `&_`
note: method defined here
--> /Users/Michaael/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/num/mod.rs:1151:5
|
1151 | / uint_impl! {
1152 | | Self = u64,
1153 | | ActualT = u64,
1154 | | SignedT = i64,
... |
1168 | | bound_condition = "",
1169 | | }
| |_____^
= note: this error originates in the macro `uint_impl` (in Nightly builds, run with -Z macro-backtrace for more info)
help: consider removing the borrow
|
70 - .div_ceil(&big_digit::BITS.into())
70 + .div_ceil(big_digit::BITS.into())
|
error[E0308]: mismatched types
--> /Users/Michaael/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/num-bigint-0.4.0/src/biguint/convert.rs:585:19
|
585 | .div_ceil(&u64::from(bits))
| -------- ^^^^^^^^^^^^^^^^ expected `u64`, found `&u64`
| |
| arguments to this method are incorrect
|
note: method defined here
--> /Users/Michaael/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/num/mod.rs:1151:5
|
1151 | / uint_impl! {
1152 | | Self = u64,
1153 | | ActualT = u64,
1154 | | SignedT = i64,
... |
1168 | | bound_condition = "",
1169 | | }
| |_____^
= note: this error originates in the macro `uint_impl` (in Nightly builds, run with -Z macro-backtrace for more info)
help: consider removing the borrow
|
585 - .div_ceil(&u64::from(bits))
585 + .div_ceil(u64::from(bits))
|
error[E0308]: mismatched types
--> /Users/Michaael/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/num-bigint-0.4.0/src/biguint/convert.rs:613:19
|
613 | .div_ceil(&u64::from(bits))
| -------- ^^^^^^^^^^^^^^^^ expected `u64`, found `&u64`
| |
| arguments to this method are incorrect
|
note: method defined here
--> /Users/Michaael/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/num/mod.rs:1151:5
|
1151 | / uint_impl! {
1152 | | Self = u64,
1153 | | ActualT = u64,
1154 | | SignedT = i64,
... |
1168 | | bound_condition = "",
1169 | | }
| |_____^
= note: this error originates in the macro `uint_impl` (in Nightly builds, run with -Z macro-backtrace for more info)
help: consider removing the borrow
|
613 - .div_ceil(&u64::from(bits))
613 + .div_ceil(u64::from(bits))
|
error[E0308]: mismatched types
--> /Users/Michaael/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/num-bigint-0.4.0/src/biguint.rs:398:54
|
398 | let root_scale = extra_bits.div_ceil(&n64);
| -------- ^^^^ expected `u64`, found `&u64`
| |
| arguments to this method are incorrect
|
note: method defined here
--> /Users/Michaael/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/num/mod.rs:1151:5
|
1151 | / uint_impl! {
1152 | | Self = u64,
1153 | | ActualT = u64,
1154 | | SignedT = i64,
... |
1168 | | bound_condition = "",
1169 | | }
| |_____^
= note: this error originates in the macro `uint_impl` (in Nightly builds, run with -Z macro-backtrace for more info)
help: consider removing the borrow
|
398 - let root_scale = extra_bits.div_ceil(&n64);
398 + let root_scale = extra_bits.div_ceil(n64);
|
For more information about this error, try `rustc --explain E0308`.
error: could not compile `num-bigint` (lib) due to 4 previous errors
warning: build failed, waiting for other jobs to finish...
^
Trying to load the thing into my Zed editor results in rust-analyser hanging up forever with message "Building compile-time-deps"
As if by magic cargo update and cargo clean fixed it. Thanks!
Now the question is: Why is the Rust version 25% slower than my C++ solution?
% time ./c++/fibo_karatsuba > /dev/null
./c++/fibo_karatsuba > /dev/null 0.40s user 0.00s system 99% cpu 0.401 total
% time ./rust/target/release/fibo_4784969 > /dev/null
computing F(4784969): 0.523s
printing F(4784969): 0.002s
./rust/target/release/fibo_4784969 > /dev/null 0.53s user 0.00s system 99% cpu 0.530 total
Run that like they are both using handcrafted big integer maths, as was the spirit of the original challenge.
On my Jetson AGX Orin, where I can use OMP for parallel processing it even worse:
$ time ./c++/fibo_karatsuba_omp > /dev/null
real 0m0.097s
user 0m0.638s
sys 0m0.009s
$ time ./rust/target/release/fibo_4784969 > /dev/null
computing F(4784969): 1.005s
printing F(4784969): 0.004s
real 0m1.015s
user 0m0.999s
sys 0m0.013s
The Million Digit Fibonaci Challenge is still open for anyone bored enough to tackle it
I have a question about native: I read somewhere that it can add multiple different implementations of code blocks in the binary and uses a runtime detection of CPU features to select which of the implementations to use (allowing a more portable binary, at the cost of some extra runtime checks and a slightly larger binary). Is this true?
Back in the day, IIRC, the C compiler I used had a similar feature, but I believe the compiler identified the CPU type and optimized accordingly (meaning a much less portable binary).
target-cpu=nativeis the option that means the compiler uses the CPU type of the host. It would not make sense for it to add run-time detection, because by definition it means to be as non-portable as provides any benefit.
My understanding is that if you want to do multiple variants then there isn't a complete story there, but it’s going to look something like building on top of #[target_feature] — explicitly asking for variants of a function to be compiled, and then explicitly dispatching to them after feature detection. Implicit dispatching would run the risk of putting the branching in the wrong place (inside rather than outside a hot loop).
There's no automation for this at the compiler level in general (not in C either). You can get some library routines that select appropriate optimized implementations either via runtime detection or linker magic (e.g. memcpy on macOS uses this), but that's implemented manually by the libraries.
Automatically you may get multiple versions of a loop, with or without SIMD, depending on iteration count or alignment, but a CPU instruction set must already be known. Runtime CPU detection is too tricky and potentially slow to just sprinkle it all over the place.
While I prefer target-cpu=znver5 for online compilers (for any language) because it's short, I believe the correct default target should be x86-64-v3. It covers Intel processors since Haswell from 2013, and virtually every AMD processors people care. (is there any Bulldozer lover?) Ideally it should be x86-64-v4 but Intel being troll with its little cores makes this target unusable for most of their customer lineups.
Sorry, I have not understood your actual performance issue. You had C++ code which used Open-MP, and compared that to plain Rust code? And is your C++ code and Rust code both using LLVM backend, or are you compiling your C++ code with a different compiler?
It's unrelated to what we are discussing here, but just to clarify what probably happens in the broken-phone story: Intel compuler always had this “split functions with auto-selection” trick and it was half-implemented by both gcc and llvm, but, importantly, not by clang (it's called multiversioning and requires manual annotations in gcc and clang). Intel, novadays, uses llvm as backend for its own compiler and it can do these “autosplit’ tricks, but AFAIK no one added them to rustc and clang.
Sorry, I have not been very clear. This all started back in 2020 with the "Million Digit Fibonacci Challenge" on the Raspberry Pi user forums. Which was just a bunch of us challenging each other to show solutions in our favourite languages that ran on a Raspberry Pi 4 computer. A rule of the challenge was that no non-standard libraries could be used.
Later a Rust solution was offered by Pi forum user gvissers.
Now days I have no Raspberry Pi around but Mac M4 laptops and Nvidia Jetsons.
On the Macs compiling C++ with g++ is actually using Clang and LLVM (Apple clang version 17.0.0). On Jetsons g++ will be using GCC.
So, the performance issue is that the Rust solution is 25% slower than the C++. That may be down to things like the fact that I implemented my own arrays/vectors for storing the big integers that uses the stack rather than allocating for numbers below a certain size. I don't know.
There is no competition with GMP, that is massively faster. As it happens the Rust solution has a command line parameter to select from various big number libs including GMP. Using GMP gets us down to a run time of 74ms from the hand crafted big integer of 438ms !
But using GMP is against the rules of the challenge.