Does rust really need higher kinded types?

Well thats a good thing :slight_smile: In that case you can find my implementation (so far) of the algorithms from Elements of Programming here: https://github.com/keean/elements_in_rust If you can improve on the performance of any of them I would be very interested to know about it (and so I would think would be Alexander Stepanov).

My advice would be you need to look at a lot of different algorithms before you start trying to abstract patterns from them.

A lot of this may be true but it's getting away from my point. Or, I suppose it's supporting my point. The point was that I think there should be more agreed upon traits that can be used for various types to implement but HKT is mysteriously blocking it.

Using these common traits as a protocol, we can actually use the genericity that people have worked so hard to develop. For example, I have made a shonky Generalised Search Tree. The GeoRust team has made rust-geo. If the rust-geo package implements std::traits::Contains on their various polygon types as well as various intersection and union functions, then they could use the GST to store them with little effort.

If we don't have these agreed upon types (in an agreed upon place; i.e. std namespace) then we end up with a Tower of Babel of types where rust-geo implements their own intersection and unions and there needs to be a bridge type that converts it to GST intersections and unions. It's more of a social problem rather than a technical one.

The starting point should be algorithms and not collections, so what might a good algorithm be to look at, perhaps how a point can be inside or outside an arbitrary polygon. We might start with an analysis like this: Point in Polygon Strategies

A point for the purpose of generality can probably be just a tuple. Looking at these algorithms some of them access the points in the polygon sequentially, so these should be implemented using a single pass forward iterator. Lets focus on these algorithms, maybe the cross product sign algorithm. So this algorithm can be implemented independently of the collection type used.

Anyway my point is the categorisation should start from the algorithms, not from the collections, and polygon_contains_point is a very different algorithm from vector_contains_value, and is another layer of abstraction higher, as it involves knowing what the values in a container represent, as opposed to a simple search for a value.

Stepanov puts it much better than I can:

It is as if mathematicians would start with axioms. You do not start with axioms - you start with proofs. Only when you have found a bunch of related proofs, can you come up with axioms. You end with axioms. The same thing is true in programming: you have to start with interesting algorithms. Only when you understand them well, can you come up with an interface that will let them work.

I was rushing out the door on the prior reply, so I wasn't able to fully explain. I will do so now.

My upthread post, linked 3 times to the pertinent post in the other thread. In that pertinent post, I pointed out that Oleg had pointed out that iterators (afaics he refers to them as 'cursors') are unable to model some general invariants such as for example when the collection is a remote database. Additionally, Oleg points out that iterators can't encapsulate the resource lifetimes that the iterator allocates (and although one could argue that Rust's lifetimes can be applied, I view them as a pita that will make libraries so complex that just like C++ it may likely require 2 years for a programmer to become proficient in complexity and delay the diffusion of adoption to well beyond a decade or more, as compared to managed/functional languages that require only 3 - 6 months).

However there is something inverting the control with iterators can do which an an enumerable can not:

There may be other ways to solve that case without an iterator.

The general abstraction we are trying to model is two distinct algorithms: 1) a data resource; and 2) the consumer of the former. But these two algorithms are not always optimally resonant through any one interface design choice.

So while the iterator may not be strictly an anti-pattern in all cases, it is not the be-all generalized interface for joining two distinct algorithms. Alexander Stepanov seems to have overstated his claims for full generality or you are misinterpreting his claims.

Stepanov says to start with the algorithms. If you look at the algorithms implemented here they are categorised by their access patterns (a.k.a. iterators).

Each algorithm is optimal for the purpose for which it is intended. Notice how many algorithms share the same access pattern. We abstract this access pattern and then require that collections that can provide it efficiently do so, and instantly all those collections can use those algorithms.

Could any of those algorithms be implemented more efficiently with enumerators? Would the code be easier to read and understand, would it allow more general algorithms, would the machine code output be any better?

Oleg argued there are cases that can be more efficient with enumerables.

And again it seems you are ignoring the lack of encapsulation of the resource lifetimes that iterators carry with them?

Edit: appears you were addressing the resource lifetime issue in the other thread:

Iterators do not own the data, they only borrow it, so they don't have any concern with lifetimes. They should not encapsulate lifetime (ownership) that is the responsibility of the collection.

I was perhaps misunderstanding what you mean by an iterator. I thought the iterator can leak mutable references to the items enumerated. Thus I assume an iterator can't borrow.

But I still don't think you are correct in any case, because the iterator can't entirely own its own data structure because if the data resource is mutated by another interface on the data resource, then the iterator must be modified by the data resource.

I can't see how you have a cursor into the data resources without allocating a resource that the iterator can't completely own. And also we need to remember to deallocate the iterator. There appear to be resource lifetime issues.

Try/catch is a control flow feature and it is not waiting on HKTs. There were people who were against it on the ground that they just wanted to have Monad instead.

Traits for collections want HKTs of the same sort as StreamingIterator, so as to define their iter(&'a self) -> Self::Iter<'a>, iter_mut(&'a mut self) -> Self::IterMut<'a>, and drain(&'a self) -> Self::Drain<'a> methods.

Here is the definition of the Iterator trait in Rust, just including the one method that must be implemented (the other methods have default implementations, though they can be overridden if there's a more efficient way to do them; I've elided many irrelevant details below):

pub trait Iterator {
    type Item;
    fn next(&mut self) -> Option<Self::Item>;
}

Item can be any type; and in particular, over many collections, you can choose to either get an iterator over owned items, immutably borrowed items, or mutably borrowed items. Here's the IntoIterator implementation for Vec that yields owned items:

impl<T> IntoIterator for Vec<T> {
    type Item = T;
    type IntoIter = IntoIter<T>;
    fn into_iter(mut self) -> IntoIter<T> { ... }
}

impl<T> Iterator for IntoIter<T> {
    type Item = T;
    fn next(&mut self) -> Option<T> { ... }
}

There are also the iter and iter_mut method on slices (and you can call on Vec because Vec derefs to a slice) which give you iterators over immutable or mutable references:

fn iter(&self) -> Iter<T> { ... }
fn iter_mut(&mut self) -> IterMut<T> { ... }

Where those types implement the Iterator trait like so:

impl<'a, T> Iterator for Iter<'a, T> {
    type Item = &'a T;
    fn next(&mut self) -> Option<&'a T> { ... }
}
impl<'a, T> Iterator for IterMut<'a, T> {
    type Item = &'a mut T;
    fn next(&mut self) -> Option<&'a mut T> { ... }
}

So, if by "leak a mutable reference", you mean that you can get mutable references during iteration, and save them somewhere else, then yes, you can do that:

fn main() {
    let mut v = vec![1, 2, 3];
    let mut n: Option<&mut i32> = None;
    
    v.iter_mut().map(|x| n = Some(x)).count();
    
    let x = n.unwrap();
    *x += 1;
    println!("{}", x);
}

So as you can see, you can indeed borrow in an iterator.

Actually, Rust used to have what Oleg calls enumerables as its primary form of iteration. In the Rust lexicon, the terminology has generally been to refer to them as "internal" vs. "external" iteration, where "internal" iteration is what Oleg refers to as enumerables, and "external" is what he refers to as cursors (in Rust, there's another concept called Cursors that has been discussed but never implemented, which is like a somewhat more flexible iterator that can go forwards and backwards but then can only act by reference rather than being able to consume the collection like iterators can). You should probably read this thread about the reasons for the switch from internal to external iteration, and in particular this message which explains the reasoning.

It turns out that in Rust, external iteration is a lot easier to make work with the borrow checker, works better with code that does things like early return from loops, and is more efficient and easier to optimize.

The downsides are discussed there too, in particular the greater difficulty of implementing external iteration, as you have to write a struct by hand that encapsulates the state. Python or C# style generators with a yield keyword would make this easier; the idea has been tossed around, some people have experimented with compiler plugins for doing so.

2 Likes

One of the good things about rust is the ability to compile without the standard library, as it enables people to experiment with other styles, keeping the base language. In Stepanov's "Elements of Programming" he goes through what are properly called coordinate structures, but got renamed iterators buy the C++ standard library people, and that kind of stuck, starting with the single pass forward iterator 'Iterator' and moving on to multi-pass forward iterators 'ForwardIterator', BidirectionalIterator, IndexedIterator, RandomIterator, and then more complex one like the BifurcateCoordinate (it iterator for binary trees) and multi-dimensional iterators. On of the interesting things Stepanov has done is separate the dereference traits (Readable, Writable, Mutable) from the iterator itself. You can see how far I have got porting this to Rust here. My plan is to first translate the code as it is, and then look at how it might be improved to take advantage of Rust's specific features later. I am interested in general comments and style tips etc. I haven't needed HKT yet.

Robert Harper seems to think Haskell's monad approach to exceptions is "unsafe". Reactions here and here. I haven't studied the issue yet.

So HKT for iterators are only necessary because of Rust's default resource lifetime tracking. I think @keean had also pointed that out.

I should unpack my mental model that I implied upthread:

  1. That when borrowing a mutable references, iterators do not enscapsulate resource lifetimes. An enumerable could leak resource lifetimes also if the callback function isn't pure (i.e. if it is a mutable reference). But they are not equivalent categories, because the iterator can be used in many more varied imperative coding scenarios where it can leak, so the compiler can't make a simple check on whether the callback function is pure because when using iterators there is no callback function (i.e. no constraint requiring functional programming) instead imperative spaghetti.

  2. Afaik, your trait Iterator does not constrain the implementation from borrowing a mutable reference from the collection. There is nothing in the trait's signature for the consumer of the iterator to enforce at compile-time whether the iterator is leaking resources lifetimes other than to assume it is employing Rust's resource lifetime checking. For example, if we have an iterator instance and then we also add, delete, and rearrange the order of items in the collection, that iterator has to be updated by the collection, thus the collection has to allocate some resource such as storing a callback function.

Well yeah why should it be surprising that an imperative spaghetti requires a low-level tsuris for resource lifetime checking.

Edit: Afaics, the functional programming alternative to the imperative spaghetti of iterators and so as to be able to implement all the algorithms @keean refers to from Essentials of Programming, is to have a separate trait for each such algorithm. While one might implement these traits with iterators, collections can also be optimized to implement some of them natively and with encapsulated resource lifetimes. So then @keean gets what he wants and I get what I want. Win win. And we then discourage most users from dealing with the low-level tsuris of iterators and resource lifetimes.

Sounds like that requires multiple mutable aliases to the same location in memory, which of course, Rust prevents.

In general, there are a lot of differences in efficiencies and ergonomics when comparing analogous interfaces between Rust and Scheme - given that the former has linear types, mutability, and (importantly) guaranteed eager invocation of destructors, but lacks garbage collection or even tail call elimination. Certainly Kiselyov's proposed construction could not work without the last of those.

But it's not just that - some of the statements in that paper are either highly Scheme/dynamic-type specific without mentioning it (despite the fact that the paper also brings up Haskell), or just inaccurate:

Another problem of cursors is telling the user that there are no more
values to access. When a collection is exhausted, cursor operations
typically return an "out-of-band" value such as EOF (cf. getc() in C)
or raise an exception (cf. iterator.next() in Python). None of these
approaches appear entirely satisfactory. Unlike cursors, an enumerator
does not need to do anything special to tell the iteratee that the
traversal is finished: the enumerator will simply return. There is no
longer a problem with out-of-band values and user's disregarding such
values.

This is completely solved by Option (aka Maybe), which Rust is hardly the first language to feature.

Cursors are simply less efficient. A cursor must maintain the state of
the traversal. That state may be invalid. Each operation on a cursor,
therefore, must first verify that the cursor is in the valid
state. The cost of the checks quickly adds up.

With encapsulation there is no need to check for invalid states.

This issue of the inefficiency of cursors is well known. For example,
it is highly advisable to use "native" operations to move large
sections of an array (ArrayCopy in Java) rather than copy
element-by-element. The cost of an array bound check on each access to
an element is one of the considerations.

This conflates two issues: the cost of bounds checks, and the advantage of using something that lowers to memcpy (typically implemented in assembly with vectorized code) over manual copying. While it says that bounds checks are only "one of the considerations", the latter has nothing to do with the choice of iteration mechanism, except insofar as a sufficiently powerful optimizer (like LLVM's) can automatically transform suitable manual copy loops to memcpy, which is possible for both styles.

It is often said that the key to the best performance is to
do more on the server. To find the maximum value of a database table
column, it's far faster to evaluate

   select MAX(col) from collection

than to open a cursor on the collection and retrieve all values from
'col' searching for the maximum. Stored procedure languages (some of
which are quite sophisticated) were introduced to make the server-side
processing easier and powerful. We should stress that

select MAX(col) from collection

is precisely equivalent to

foldl max lower_bound collection

It is certainly not equivalent: if there is a suitable index on col, select MAX(col) from collection is a quick O(log n) tree traversal, if not O(1), rendering the comparison nonsensical. If there is no index, you end up with a complex comparison between a dynamically optimized (but typically interpreted rather than compiled) query engine within the SQL server, and an IPC mechanism (cursors) with relatively high latency, where in both cases the process will be I/O bound most of the time.

This has nothing whatsover to do with microoptimization of an in-process, typically memory- or compute-bound iteration interface, possibly subject to a generic optimizer but not to algorithmic special-casing based on the callback function. The factors affecting performance are completely unrelated.

The only connection I can imagine is if Kiselyov is implying (very vaguely) that a hypothetical database system could, in fact, sort of merge the client language with stored procedure languages, and do things like introspect on the callback, identify common functions, even automatically send snippets of the user's code to the server to execute there. If so, that's a fairly special case, which could justify use of enumerators, but is unlikely to ever apply to statically compiled languages like Rust that tend to value explicitness over magic. I haven't heard of an actual system doing this in any language.

1 Like

Remember afaik, the Rust lifetimes aren't a magic bullet and the programmer has to remember to enforce them. I am positing that the signature of the trait Iterator obfuscates whether lifetimes are being enforced by the implementation given that I posit that the collection has to maintain a mutable reference to the iterator and when the iterator is a mutable reference of the collection, then the mutation of the collection can't occur by another code path. So either we have imperative spaghetti gridlock (if the lifetimes are strictly enforced) or we don't have resource checking. Either way it appears to be a mess that is best to not subject naive programmers to and should be cordoned off for the expert (e.g. library) programmers.

P.S. readers please note my edit at the bottom of my prior post pontificating on a "win-win" suggestion.

You may be misunderstanding that post: only unsafe code has to remember to enforce lifetimes. Safe code will not compile if it violates lifetime rules. (More precisely, references in unsafe code are subject to the same rules as in safe code, but there are other tools available to unsafe code such as raw pointers, transmute, etc. that can violate the rules.)

The collection doesn't have to maintain a reference to the iterator; usually it's the other way around.

Are you sure? Please re-read my reasoning. How can the collection be changed and not tell the iterator that some items were inserted before its current position without violating the invariants of an iterator?

Mutability and iteration is a can-of-worms.

I don't think lifetimes are a magic bullet and didn't imply they were. Nevertheless, Rust disallows aliasing mutable memory in safe code, so your scenario can't happen in Rust today, plain and simple.

In general, the collection can't be changed, as that requires an &mut reference to it, which cannot be taken while the iterator already has a (mutable or immutable) reference. This is statically verified by the compiler. A collection specifically designed to allow concurrent mutation could have a different API (based on RefCell etc.) that might have those complications, but typical Rust containers just forbid it.