My new year's resolution was to become a Rustacean! So I started reading the Rust book.
So what I've learnt is that there are various types of integers and what there characteristics are (signed/unsigned, 8, 16, 32, 64 and 128 bit, 'size'-type).
However, what is still unclear and to me is how these types are being used in practice. Do you really always choose the smallest type (in bits) that's big enough to get the job done? Or do you just use 32-bit integers because there is not really a difference in performance between these types anyways?
What about the 'size'-type? I read that this should be used with numbers that have something to do with the memory. What? Why? I understand the characteristics, that size becomes 32-bit or 64-bit depending on the OS, but how is this even beneficial, and why wouldn't we just use the size-type everywhere?
By far, the most common type you'll see in the standard library and most public APIs is usize, but that shouldn't be taken as a general recommendation; it's only because a lot of numbers in public APIs are either indices or lengths.
usize is for indices and lengths
32-bit integers are useful for indices into things that you know won't exceed 2^32 elements, when many of them need to be stored somewhere and performance is required. When this is done, it's not uncommon to define a newtype wrapper or a type alias (e.g. in petgraph, codespan) for your indices so that you can enforce this choice more easily and document the reason behind it.
If you're doing integer math, like number theory and stuff, use i64.
I see no reason to ever use isize other than for the (extremely few) standard library functions that ask for one. If I need to subtract two usizes and the result could be negative, i64 could only be more correct than isize, never less correct.
i32 is a reasonable default -- it's even what Rust will assume a literal is, if nothing forces it to be something else.
So you might as well use it, unless something else pushes you to a different type. The two common places that will show up:
Use usize for things that you index by, because needing to cast any other type is enough of a pain to just not bother.
You're storing a large enough set of them for the memory and cache usage to matter, at which point you should think more about the range you actually need. (Like a heightmap in a game probably doesn't need 32-bit resolution for the heights.)
Honestly, I actually find that I never need signed integers in normal computer science stuff, so basically always use unsigned numbers. (And when I do need negatives for things, it's more often geometry-type stuff that fits better as floating-point than integers anyway.)
But that's a more controversial opinion. C#, for example, says that you should basically always use (32-bit) int no matter what -- and even for things where that's not ok, like file positions, it uses a signed long instead of an unsigned ulong that seems more logical to me.
So I basically just avoid the question at the first level of advice
Google's C++ style guide says to avoid unsigned types in most cases, even when negative values are unreasonable, iirc primarily because of C/C++'s dumb approach to signed vs. unsigned overflow and how this affects optimization. Rust thankfully has much better ways of dealing with this so I see no reason to ever avoidu32 et al.
This is also due to C/C++ implicitly and silently converting between signed and unsigned integer types without checking their values, meaning that there is no real type safety in the notion of an integer that cannot be negative. Rust also fixes this quirk.
In C++ I’d only use unsigned types for bit manipulation, large values that wouldn’t fit in signed types (when simply widening the integer wasn’t an option), algorithms that depend on unsigned overflow, or where the standard library demanded (i.e. std::size_t, and *::size_type, who’s unsignedness is widely acknowledged as a mistake in the C/C++ world).
In Rust, I can happily and safely use unsigned types where I want non-negative integers.
It's worth pointing out that this is a constraint imposed by the runtime, not the C# language.
The Common Language Runtime (the VM that C# runs on) is used by other languages which don't have the concept of an unsigned integer like VB 6, so to maintain portability it became common practice to use signed integers everywhere in C#.
You can't use usize and isize when you have to use a specific size of integer. For example:
When interfacing with a C API that takes or returns uint8_t, int16_t, etc. If your function returns an uint8_t but you call it as if it returned a usize (that is specified to be at least 16 bits long), then you'll get Undefined Behavior.
when implementing a protocol that expects a given size of integer, e.g. binary serialization formats often use 32-bit integers independent of what the platform-native word size is. If you use a usize and you are on a 64-bit platform, but the serialzation format expects a 32-bit integer, then you will have a corrupted payload.
Otherwise, when working with iterators and collections (counts, indexing, etc.) it's often worth sticking with usize and isize, because it's both guaranteed to be big enough (it doesn't limit everything to 4 billion on a 64-bit platform), but unlike u64, it doesn't incur the extra penalty of moving and operating on two separate integers on 32-bit platforms.
One interesting exception is: for database identifiers, I unconditionally use 64-bit integers, because I don't want my database IDs to be platform-dependent, and the performance penalty of serializing/deserializing 1 vs 2 integers is negligible compared to basically anything a database ever does.
I'm just left with one last question I have difficulties to wrap my head around.
because it's both guaranteed to be big enough (it doesn't limit everything to 4 billion on a 64-bit platform)
I understand what you mean here, but then on the other hand, isn't it often the case that my indexes would go up to about the same number, no matter on what system they run, just because of the business logic demanding it, which is the same on every machine?
So in that case, I'm actually opening the program up to potential errors that are hard to detect (index overflow on 32-bit systems, but not on 64-bit systems), am I not? So I'd still have to check somewhere whether the number is going higher than 32-bit or not. And if this is not the case anyways, I could also just directly use u32, right?
If indexes are real, consecutive, non-negative indexes or counts of in-memory objects, then there is no way you can create enough objects for an index overflow to occur, simply because there is not going to be enough memory. 32-bit systems can usually handle only 2^32 bytes, so unless you have zero-sized objects, one object will take up at least 1 byte, and you will run out of physical memory before being able to allocate enough objects for their index or count to overflow.
And even when this is not quite true because e.g. your OS is willing to use swap space on disk, Rust still relies extensively on something like this assumption, even in unsafe code in the implementation of the standard library. So I would think there's some sort of protection against it, for instance, the Rust allocator simply not giving you more memory than usize can represent, and OOM'ing instead of swapping, or something like that.
Where this problem can occur is already covered by the example I gave with respect to database keys. What you have to keep in mind is within a given environment Rust programs run in, using usize remains consistent out of necessity (and somewhat by design). The problem essentially arises when you want to externalize counts and indexes.
That's exactly what I meant by "I don't want my database IDs to be platform-dependent" in my previous post, there really is nothing more to it – and of course, databases are not the only situation in which platform-independent integer size is important. In fact, as soon as any data leaves the runtime environment you are developing and testing on, you have to ensure it's in a stable format digestible by others. This applies to serialization formats, files sent over the network, etc. This is why serialization formats and network protocols usually resort to fixed-with when encoding integers, but the principle applies more generally to all data types.
It's very unlikely I'll encounter such case but I'm been thinking that in theory if the CPU/OS support paging memory in/out, you could store more data in memory than you can address. Then it'd be the same argument as using u64 for DB ids since the storage system can address more elements than the CPU can. That said, I'm not sure std could/should be ported to a memory-paged system.
There are lots of reasons to use the various integer types. For example, if you have something you know is always going to fall in a specific range, deciding what size integer type to use can depend on how many values of that type you expect to have at any given time. For example, if you have something the will only ever have less than 256 unique values, but you need millions of them, then using u8 vs usize can have a huge impact on memory usage.
Also, for things that naturally fall into the exact range of specific integer types, like variables in emulation software that represent actual computer hardware, using an appropriate type can help avoid problems caused by invalid values that an actual machine could never generate.
In a nutshell, they are all useful. It depends completely on the situation, and there is no one-size-fits-all approach.
I also wanted to point out that as a general rule, programmers are bad at accounting for arithmetic overflow possibilities in their code. This isn't because programmers are bad at math, but because the complexity explodes exponentially as you perform more arithmetic. Take for example the following simple equation:
x - y + 10
Let's assume x and y are unsigned integer types of some size. How can it overflow?
Well, y might be more than 10 larger than x, resulting in wrapping to negative. Another possibility is x might be big enough and y small enough that the result can overflow when 10 is added to their difference. Obviously this a a simple formula, yet it has two different failure cases. More complex math results in more failure possibilities. To make matters worse, the order in which you perform the operations can matter for overflow purposes, even if they don't matter from a pure mathematical standpoint. These things conspire to make overflows a royal pain.