"as" considered harmful?

Sorry yes. I can appreciate the need for alignment of objects. Not sure where I pull that 128 I mentioned above from.

there's also a slight risk of introducing overflow errors that way; say, a function takes u16 X and Y coordinates, to use them to index into a grid; right now, you're forced to think about casting the types to array indexes, and you're likely to do that correctly—before multiplying with row/column strides—if the final type could just be u16 you could forget that and do all arithmetic as u16, easily over-running the range of the type

(slightly worried about this after having seen this error often in C, though nothing in Rust could possibly beat the platform-dependent danger of using C's char type as index)

I realize I'm a bit late to this party, but I didn't see something I consider really the logical reason for indexing an array with usize.

  1. usize is not a pointer. It is an integer. Its only relationship with pointers are that it is the same amount of bits as pointers on the architecture.
  2. If my array is smaller than 128 elements, i8 can be used to index it safely. If it is smaller than 256 elements, u8 can index into it safely. Etc.
  3. The max array length on your machine is theoretically when you fill the max addressable RAM with an array of byte-sized and aligned elements. If your machine is 32 bits, you can hold 2^32 - 1 elements and then the range of valid indices is 0..2^32 - 1. This is conveniently also the range of usize on a 32 bit machine. Likewise for any other pointer width.
  4. Thus, usize is guaranteed to always work for any array.
2 Likes

I think that was all mentioned or at least hinted at above. Probably lost in the noise.

My thoughts were that one could index arrays with other than usize and no "as" and the like if:

a) Whatever integer one used as an index was silently used as a usize. But that smacks of the chaos one can get into in C with silent coercions.

b) Arrays were defined with two types. One being the type of the elements and the other being the type used for the length and indexing. But that just sounds horrible!

All in all, I'm sold on what we have now.

Arrays aside it still worries me in general that "as" can throw away bits with no warning or run time overflow error. I consider "as" as harmful. Or at least worthy of extra attention like "unsafe".

1 Like

I see it as (hehe...) follows: it is never unsafe to throw away the bits. Sometimes it is even expected (floating point cast to integer). The behaviour when casting lossily is well-defined. Furthermore, if a bad cast causes a memory-unsoundness bug, it will be caught somehow by the compiler at another location. Thus, I'm quite happy to let as be safe and lossy. Unsafe isn't the right keyword to use here, IMO. The semantics are different. as already tells me something needs to be checked carefully. However, it cannot introduce memory unsafety. Thus unsafe is the wrong keyword. Well, that's my 10c anyway...

1 Like

No, no, I was not suggesting that use of "as" be wrapped in "unsafe". I'm happy for "unsafe" to only be about memory safety. As you say, the semantics are different.

Sounds like we agree then.

1 Like

Not quite correct: you can theoretically have an array of any nonnegative integer length iff the stride of the array is 0 (i.e. the memory size of the array is 0 no matter the number of elements). It's just that any reasonable array is going to fit into memory even with a nonzero stride.

Rust is all about saving programmers ass. I hope I can help you understand how the compiler is saving your ass here in multiple ways.

Firstly, indexing in Rust is pretty rare - whenever I see code like your example, I always try to find a way to make it into iterator. Code using iterators is faster than indexing. The example above already showed a way to do it.

Secondly, you should probably understand the concept of what I call "type bubbling". Whenever a function has specific requirements encoded in a type, the caller must meet those requirements or pass them up to its caller. While making the requirements of arrays encoded in type perfectly would be extremely difficult, usize provides at least two sanity checks:

  1. The index must not be negative. It makes zero logical sense to be. If this requirement wasn't encoded in the type, the indexing operator would have to check lower bound too - that means about twice the price of bound checking! Indexing in tight loop would become even more problematic.

So according to type bubbling you discovered, that the input to your function must not be negative. You can either check it at the beginning, as other people suggested, or pass it up. Which brings us to another observation: does it make logical sense for your fmax variable to be negative? Since it represents a maximum count, it doesn't seem so. Negative counts are as illogical as mixing up strings and numbers. So your fmax type should clearly be positive. In Rust, its name must start with u. (Unless you're writing your newtype wrapper.)

  1. While Rust still needs bound checks, it's useful to do basic sanity checks beforehand. For instance on 32 bit platform, it's impossible to address more than 2^32-1 bytes, therefore any index bigger than that is inaccessible. It's a waste of time for program to continue if someone enters 6 billion into fmax on 32-bit architecture, because it's known beforehand that it will panic. Rust pushes you to move error checking sooner and that is a good thing. Imagine if you were relying on bound checks only. The user enters 6billion on RPi 3 and after a day of computation, it finds the program crashed, wasting the time and resources. In contrast, if the user is immediately reminded that he can't use number that big, there's no waste.

So since your algorithm can not possibly work on too big numbers, you should bubble up the type, making fmax usize. Further, if you want your program to be really great, you should put if fmax > p.len() check at the top of the function. This will also allow the compiler to eliminate further checks, making it faster!

But now you might say that whoever requested that program wanted it to work on numbers beyond memory limit. That saves your ass too, because you know realize, your program is doing no special work to make it function. You might forget to use files or some other tricks to make it work for large numbers. This way you're reminded that you have a silent bug in your program - it doesn't do what's specified.

Conclusion: Rust just saved your ass/helped you write better code by reminding you of two checks you forgot about. as as used in your example is harmful, but the solution isn't to add another bound check in every single indexing operator, but add bound check before your computation begins. These can (and should) be manifested in types (at least the lower bound).

6 Likes

I think as is fine for casts, but it should not be used for numbers conversions. As suggested before, it is preferable to use From and TryFrom instead. For some number conversions (i.e. integer to float) this does not work though, so you either have to use as, create you own conversion methods, or use a crate. For the latter I have been using conv in the past, but it does have some pitfalls and seems to be no longer maintained. Nowadays, using num_traits::cast::NumCast might be a better choice.

Edit: However, num_traits does not provide an option for converting a float to an integer by rounding. In this sense, it is not possible to replace conv with num_traits.

That is exactly why I am here.

After decades of writing software in all kind of languages: assembler, PL/M, Coral, Pascal, Ada, C, C++, I know that I need all the ass saving I can get from a programming language.

I have only known one "ass saving" language before, Lucol. In Lucol not only were you memory safe you were time safe. The compiler reported the maximum run time of your modules. Important when you don't want to overrun a 100ms time budget in an avionics control loop. Of course nobody has heard of Lucol now a days.

Always eager to learn new tricks I have been thinking about that iterator example above for almost a week. It's a lot more opaque than any loop a normal programmer would recognize. What with all that .iter.zip.zip stuff.

With the promise of "faster", today I knuckled down, figured it out and made it work. Sure enough it saves 1 second off a 4 minute execution time. As crudely as I time it.

Sadly it also introduces an issue, it only runs in release builds:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.05s
     Running `target/debug/tatami_rust`

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)

As a result I'm not sold on that "idiomatic Rust" way just yet.

The code is here: GitHub - ZiCog/tatami-rust: A solution to the Project Euler problem number 256 : Tatami-Free Rooms. if anyone would like to help "Rustify" it.

1 Like

A bit tangential, but remember that iterators might very well be slower than a loop in debug mode. Can't get the quote now.

That looks like a compiler bug, I see no reason why it should cause stack overflow. Zip isn't recursive.

Yeah, unfortunately, combinators are something not very known from different languages (I think Python has them too though.) I was avoiding them for a while (few months, I think) too, but eventually learned to love them.

Anyway, sorry for not being very clear that the speed difference might not be huge and it's probably actually slower in debug mode.

I recall seeing in another thread (which I can't find right now) that debug mode has a smaller default stack size. It's possible to set a larger value, but I can't find that either.

I think the problem here is that I am what they affectionately call a "boomer" now a days.

When I learned to program we had variables, arrays, assignments, conditionals, loops, functions. It was all the craze. It was called "structured programming"

Basically all of computer science is: sequence, selection and iteration. Starting from the first known algorithm, two thousand years ago, Euclid's GCD.

Now, I'm open to suggestions as to how to express sequence, selection and iteration in different ways. Perhaps with a view to making them less prone to programmer error. Or enable better compiler optimizations. And so on.

So far I'm not convinced. An array and a loop and indices are clear enough. The bounds checking of Rust will tell me if I get it wrong. The "iter.zip.zip.whatever" style is obfuscation to me. In my, admittedly limited, experiments so far it has no space or time performance benefit either.

The code is in the github link above if anyone would like to show otherwise.

I have no idea of course. I would not expect that ".iter.zip.zip" thing to be recursive.

I might naively expect that it builds a new array on the stack that is an array of structs made of the corresponding input elements. Then it builds another new array whose elements are a stuct of those structs plus the new array elements.

After all that is how the source reads.

The whole algorithm that is embedded in is recursive. So boom! It dies.

I guess the optimized release build sorts that out.

Is this a bug I should report somewhere?

1 Like
thread 'main' has overflowed its stack

I think this is because the debug build inlines PR into every function that uses it, which overflows the stack quickly. The release build realizes that there only needs to be one global. You can fix this by making PR static instead of const, there is always only ever one copy of a static.

Constants and immutable static variables might seem similar, but a subtle difference is that values in a static variable have a fixed address in memory. Using the value will always access the same data. Constants, on the other hand, are allowed to duplicate their data whenever they’re used.

(from the book)

edit: This might still be a compiler bug, but I'm pretty sure that's the reason: when building on my local machine the stack overflow is when work() tries to buy a 640344 byte stack frame; the 40k i64 primes take up 320k bytes, and PR is referenced 2x in work(). Also changing to static definitely fixes it.


By the way:

b) Arrays were defined with two types. One being the type of the elements and the other being the type used for the length and indexing. But that just sounds horrible!

This is somewhat true, the Index<T> trait is implemented for an indexable type and says that it can be indexed by T. But there can be more than one such implementation, which is why you can index a slice by a usize or by a Range.

2 Likes

When I learned to program, we had loops and go to; structured programming came later. I had never run into iterators until I started with Rust, so I used loops everywhere. After seeing some code written in functional style, I went back and tried it. The main benefit to me is that I don't need match when I iterate over Option or Result. Most of the time I now find the functional style more readable, but I use a comment to explain why for the few cases where it isn't.

I agree that reading some of the samples people provide (iter.zip.zip.whatever) is hard to understand as one big chunk. My strategy is to look at the types produced one step at a time. That being said, I sometimes find myself writing the loop version first, a habit I hope to break one day.

6 Likes

I guess this is a pretty unfortunate example to explain the power of iterators. Once you try out thing like filter, find... especially with something that isn't an array, you might get it. I think of them as pipelines in Unix.

But as I said, one check at the toped while also using usize should be pretty good alternative.

1 Like

Yay, it works!

Changing that 'const' to 'static' got the debug build working.

Turns out that the loop style is more significantly slower than the functional style as a debug build:

Loop style : 155m 11s
Functional style : 143m 28s

I had a similar discussion here a few months back when creating my first ever Rust program. In that case I ended up preferring the functional style.

2 Likes

Yeah exactly. For any array A, if I were to retrieve element x, I'd go to the address of A (a pointer) and then x*M bytes of data is returned (M being the memory allocated to the element). That's exactly what usize is for, pointers. And the reason usize is there, is because pointer sizes will vary across different processor architectures

Here's your answer to why usize and isize exist. If you do notice (and as I can probably say you have noticed) that isize and usize do not have a defined size, since pointer sizes vary across platforms.

So why can't we index simply using u32?

Safety. It's all about safety. As the size of usize/isize will vary across architectures, it can be used to represent all the memory addresses for that arch on which the program is running. Hence, the maximum possible length of a vector, for an arch is the maximum size that can be contained in an isize . (note: isize::MAX == (usize::MAX/2)).

Hence using usize sizes for indices prevents you from creating and using vectors larger than the memory address size for that specific architecture.

PS: I do realize I'm a bit late here