I'm definitely showing my systems programming ignorance here with this question. What I'm wondering is if there is a performance hit when using a u32 (or any x32 int, etc) when the chip is x86_64? It would seem like the OS or chip would have to do additional work.
For x86_64 there almost certainly is not a performance hit. If you have multiple
u32s packed in an array, slice or struct, there many be a performance improvement due to better cache locality (since twice as many
u32s fit in each cache line as
Does this caching come for free, or is it something you have to explicitly ask for?
The caching is automatically provided by the CPU chipset, and may not work if you bough your CPU before 1990. But even without CPU cache, loading millions of u64s from memory would takes double amount of time than loading millions of u32s.
You may also gain a performance benefit from vectorising and SIMD.
For example, SSE gives you 128-bit registers for SIMD operations... This means the processor could load 4x
u32s into a register and execute an operation on all 4 integers in one instruction. In comparison if you used
u64 it could only apply the operation to 2 integers at a time. Later SIMD instruction sets (SSE2, SSE3, AVX-512, etc.) can work with larger numbers of values.
The compiler will automatically apply these sorts of optimisations when it believes they're valid and would improve performance.
Certainly I have seen significant slow downs when operating on large arrays of 64 bit data. Most likely attributable to cache pressure.
Thank y'all for all the information. It's really good stuff.
As usual, don't forget to measure the performance of whatever you are doing yourself. Better not to take anyone's word for it.
Perhaps there is a performance hit. Perhaps in your application the impact of that is negligible and taking the hit is more convenient.
I'm sure we could make micro-benchmarks to demonstrate that either is faster than the other in different circumstances.
There is a performance hit when
u32 is used for array indexing inside loops. That's because overflow behavior of
u32 is different than overflow of
usize, and the compiler has to pointlessly emulate that.
Could you give an example for that? Some small toy examples I wrote don't support that claim?
Thanks, that sounds like an interesting read
Half a year ago some guys and I were having fun pitting C against Rust in creating solutions to the Project Euler problem number 256 : Tatami-Free Rooms problem : https://projecteuler.net/problem=256
I ended up with a Rust solution that matched the performance of C. It could be built to use 32 or 64 bit arithmetic.
32 bit math was faster on smaller problems. 64 bit bit was required for bigger problems.
My Rust code and the 32 bit and 64 bit solutions are here https://github.com/ZiCog/tatami-rust if you want to check it out and play with it.
Along with instructions as to how to build the same Rust code so that it uses either 32 or 64 bit math. Basically I generate code containing types and constants form a build.rs file. If anyone knows a better way to do that it would be great to hear.
This is not the first time I have found 32 bit math to be faster.
For the most part, the compiler's actually free to upgrade your integer types to whatever's best, at least for local usage.
When you tell rustc to use
u32, it will almost certainly be a 32 bit integer when stored in (observable) memory. But when it's loaded into registers to actually do work with, the optimizer can pick any way of doing the arithmetic that gives the same result.
If adding two
u32, the compiler is free to choose between a 32 bit add (that preserves the upper 32 bits of the 64 bits of scratch space) or a 64 bit add (which doesn't) with a 32 bit store (which discards the upper 32 bits), depending on which it expects will be faster.
Compilers are really good at optimizing your code, especially when it's mostly math. They're almost certainly better at microoptimising your code than you are. Write the code that you want to read and expresses intent first, then you can adjust it if profiling shows that it's necessary.
As a general rule, though, smaller numbers are typically easier to work with because the processor is capable of doing more with its limited memory budget.
If you're curious, here's a good video overview of what awesome things compilers do every compile:
The only trouble I have with using the most appropriate integer type for "mostly maths" is when a result of the mathematics is going to be used as indexes into an array at some point. For example using
u16 as an index requires explicit casts. So I can end up using
usize everywhere even if that doesn't best express the intent, simply because it's more readable overall.
I would use the compact storage representation, together with a short macro to do the
usize conversion at the point of indexing. See my earlier post on this.
Just as a reminder, the Euler folks respectfully request that solutions to problems > 100 should not be posted online.
That's interesting. I'm guessing most people would store their solutions in version control, so are they referring to anywhere online (e.g. GitHub) or just public forums?
Their wording is very broad: Solutions are not to be posted online where they can be found by others. Compliance is generally excellent. It's a real badge of honor to say that you've solved all the Euler problems, and obviously that wouldn't be the case if you could just Google the answers.
Ah, oops, Had I read that before I would have respected it their wishes. But I had not read that. In fact I have not read any of the Project Euler pages much. This was just a piece of code that was under discussion for many weeks on another forum https://www.raspberrypi.org/forums/viewtopic.php?t=240287&p=1563671 where people were comparing implementations in various programming languages.
I don't claim any badge of honor for that solution. It was just piece of code that looked interesting to recast from C to Rust as an exercise in becoming familiar with Rust. And to demonstrate to myself that Rust would perform as well as C.
To be honest I don't fully understand how it works.
What to do? I guess I should take down my repo there. But the code is all over the place in various forms now.
Or in the spirit of Euler I might just say "meh, here is a solution, without any rigorous proof".