Why use usize for indexing (and lengths)?


#1

I have thought about the old discussions about polymorphic indexing recently, and I have come to wonder why indexing works via usize (also the length of slices and such). The reasons for my curiosity are that a) the specification states that “The theoretical upper bound on object and array size is the maximum isize value.”, while USIZE_MAX is double that, and b) it might introduce subtle bugs when the usize is finally cast to an isize for pointer offsetting. (The standard library’s RawVec guards against this and stays in conformance with the spec by checking that the length never exceeds ISIZE_MAX, but it would seem like an easy way to unintentionally introduce memory corruption bugs in third-party low-level crates.) Why not just use isize to avoid these problems and make the specification’s restrictions explicit in the type system? (Or use some kind of restricted usize that can not go above ISIZE_MAX?) I have been looking for discussions on this topic, but have found none that address this.


#2

The ISIZE_MAX restriction is present due to limitations in the current implementation of the compiler’s backend. In contrast, a negative slice index is always invalid.


#3

Then why not a sort of restricted usize? It would solve these problems all the same.


To me, that does not seem to be that strong an argument: For most indexing operations, not all values in the unsigned range are valid either, but this does not preclude arguments of unsigned type: The index function cannot avoid taking values for which a mapping does not exist in either case. However, while I would be curious to talk about this, I would not want to drag you into a debate: If you do not place a counter-argument, I will not hold it against you. In any case, thanks for your reply! :slight_smile:


#4
  1. The language is stable. We are not going to be changing the type used for indexing any time soon.
  2. Again, we don’t want to pin language semantics to a limitation in the compiler.
  3. While LLVM can support strangely sized integer types like u31 and u63, I am not aware of any language that actually exposes them.
  4. What is the actual problem being solved here? Allowing code like RawVec to avoid the size check? There’s a trade off here between providing some potential convenience to the very small number of people doing that kind of work and introducing another almost-usize-but-not-quite integer type for the much larger number of people using things like Vec.

“Most” is a dangerous road to go down! For example, Java uses 32 bit signed integers for indexing. This probably made sense in 1995 when Java was first designed (2GB arrays should be enough for everybody!), but that is no longer the case today, and size limitations on things like byte[] and MappedByteBuffer are serious problems in some contexts.


#5

I can understand that. Well, too bad I did not come upon this earlier :slight_smile:
Thinking about it, it also seems to me that every argument I could give for a (by default) restricted usize could just as well go in favor of simply using an isize, where the latter would still leave the normal usize option open for the rare cases (I assume they are rare, at least) where the second half of the range is required. However, I think I can still say something about:

I would have suggested making the restricted usize the default (see above). Normal users would not be bothered by it, while low-level programmers could profit from this, and it would tie this piece of library and application code closer together. I think that the latter probably boils down to a question of taste, and I cannot (by going on only one example) really estimate the validity of the former, so this is not a strong argument, but that was my reasoning.


If I see things correctly, this bound would only apply for 32-bit systems (if you were to use isize in the Rust case, not Java)… In that case, I do not think this restriction weighs too heavy: you have only 4GiB to work with in total, and IIRC normal Windows reserves half of that for the kernel, leaving only 2GiB for the application anyway (though it can be configured to allow up to 3: https://msdn.microsoft.com/en-us/library/windows/desktop/aa366912(v=vs.85).aspx). In the worst case, you would have to split up your data… I admit that again I can probably not give a very well founded opinion here, so I welcome corrections! (And thanks for the interesting point about Java’s indexing, which seems good to keep in mind.)

For 64 bit, on the other hand, I am pretty sure an isize would suffice.


#6

After more careful consideration, I have reached the conclusion that, although not very different, usize is indeed the better type for indexing with both current and potential future backends of rustc. To close this discussion, I have reproduced my thoughts below, and I hope that if someone else should ever wonder about this, I hope they can save themselves some time thinking about it by reading the following arguments. On this note, I would also like to thank @sfackler again for taking the time to engage with my arguments :slight_smile:


In the current framework, usize and isize are distributed symmetrically around the range of possible indices: isize extends below, while usize continues above. In this context, the current in-bounds check for indices seems actually quite elegant, because instead of requiring an extra comparison to check whether the index is in the right range (>= 0 for isize, < ISIZE_MAX for usize), it is simply contained in the index < self.len() comparison, due to the manual restriction on Slice/RawVec's length. This essentially turns the length into the restricted usize I suggested above. This method, although it seems more fragile to me, might also have higher runtime performance, because it requires only one instead of two comparisons in fn index. Additionally, while I still think that this might lead to subtle memory corruption bugs in low-level third-party code that returns slices (using the unsafe fn std::slice::from_raw_parts, it is currently possible to create slices with lengths above ISIZE_MAX), this case does seem very improbable to me, and most of the relevant code can probably be implemented with the secured RawVec.

In possible future implementations, where pointer offsetting may not be limited to isizes, usize is the clear choice for indices, covering the whole possible range, and without the potential for those memory corruption bugs.

I still think, however, that as a matter of convenience, offering the option to index with isize or other signed integers would be a very good thing. That topic, though, does not belong into this discussion, but into the one about polymorphic indexing.