Managing lifetime references is miserable. Easier to copy by value and avoid the whole thing?

I am working with this clustering code library: clustering/src/lib.rs at main · xgillard/clustering · GitHub

The basic principle is you run a function like:

let mut samples: Vec<Vec<f32>> = vec![]; //, ie [[x,y], [x,y], [x,y]]
let clustering = kmeans(k, &samples, max_iter); //clustering = Clustering<'_, Vec<f32>> 

And you get back:

/// This is the result of a kmeans clustering
pub struct Clustering<'a, T> {
    /// The set of elements that have been clustered (in that order)
    pub elements: &'a [T],
    /// The membership assignment
    /// membership[i] = y means that element[i] belongs to cluster y
    pub membership: Vec<usize>,
    /// The centroids of the clusters in this given clustering
    pub centroids: Vec<Centroid>,
}
pub struct Centroid(pub Vec<f64>);

This is very nice except for when you start trying to save the Clustering result object into another struct, which has a Mutex on it, and is wrapped as a Resource you are accessing.

The absolute "Rust torture" begins as you must now propagate all those lifetime <'a> mentions all over everything that now touches this things. And I don't know where it all stops or how to even look at all the visual mess and errors it is creating.

The misery of trying to manage this and all the endless errors does not seem frankly worth the hassle. I don't know how to figure it all out, and what for? All this obsession over managing a reference to an array of floats? Perhaps this is necessary in other situations but here I can just copy by value and be done with it. It will take no computational effort compared to the actual work the algorithm must perform (which is thousands to millions of times more complex than a simple array copy).

Then I can escape the ugly syntax and horrible bizarre <'a> Rust idiosyncracy that perhaps some might love but I would rather avoid.

My inclination is then just to edit the package and remove the & reference and <'a> (ie. pass the Vector by value or copy) so I don't have all this insane lifetime management to deal with for no good reason. Is that crazy or rational in your opinion?

It seems like lifetimes of references just becomes code pollution very quickly and the only reason you should ever do it is if you actually truly have to.

If you look at the code it seems the elements which are taken as a reference setting off this whole catastrophe are never mutated (because you would not want to) so besides the minimal cost of copying the data once at the start, there is no downside to just doing that. It shouldn't matter.

Am I understanding this correctly? I am asking this partly philosophically as someone who has no specific "passion" for any coding language on earth (including Rust) and just wants to get a job done without learning every niche intricacy of every language that has no bearing on my life. There is not enough time in the day.

I don't know how you guys can enjoy the concept of lifetime management like this and unless you are doing some incredibly performance intensive task or you need something shared around and mutated in many places it seems irrational.

This is me after a day of trying to get a basic K-Means clustering system running. :slightly_smiling_face:

In most cases, lifetime-carrying types (& or any struct that contains them) are expected to be short-lived and not stored long-term. In this case, the library author probably envisioned that you would want to immediately process the clustering result into your own domain-specific structs instead of keeping the result object directly.


There are certainly reasonable changes that can be made along these lines. Personally, I wouldn't feel comfortable making the assumption that copying and transferring ownership of the items is always the right call, but making it possible seems like a good idea.

One possible way to do this would be to define Clustering along these lines instead:

pub struct Clustering<TArr> {
    /// The set of elements that have been clustered (in that order)
    pub elements: TArr,
    /// The membership assignment
    /// membership[i] = y means that element[i] belongs to cluster y
    pub membership: Vec<usize>,
    /// The centroids of the clusters in this given clustering
    pub centroids: Vec<Centroid>,
}

pub fn kmeans<T: Elem + Sync, TArr: AsRef<[T]>>(k: usize, elems: TArr, iter: usize) -> Clustering<TArr> { ... }

That way, the existing use cases are handled by Clustering<&'a [T]>, but you can use Clustering<Vec<T>> or Clustering<Rc<[T]>> if you don't want code to get infected by the 'a lifetime.


An alternative approach, if you don't want to modify the library, would be to extract the membership and centroids fields from the Clustering struct as soon as you have it and store them directly.

2 Likes

Those are good ideas. Thanks. I started modifying the code in the mean time before I saw your reply and ran into an issue which I think you were anticipating. If I change it to

pub struct Clustering<T> {
    pub elements: [T],
    pub membership: Vec<usize>,
    pub centroids: Vec<Centroid>,
}

Then I suffer because the generic "T" system can't work anymore. Unlike a reference, [T] has varying sizes so it doesn't know what to do:

error[E0277]: the size for values of type `[T]` cannot be known at compilation time
  --> src/clustering_lib.rs:94:19
   |
94 |     pub elements: [T],
   |                   ^^^ doesn't have a size known at compile-time
   |
   = help: the trait `Sized` is not implemented for `[T]`
   = note: only the last field of a struct may have a dynamically sized type
   = help: change the field's type to have a statically known size
help: borrowed types always have a statically known size
   |
94 |     pub elements: &[T],
   |                   +
help: the `Box` type always has a statically known size and allocates its contents in the heap
   |
94 |     pub elements: Box<[T]>,
   |                   ++++   +

So I would have to hardcode the type of array (which is fine for me, but just a point to make).

I am not sure what you meant by TArr ie. in my case Vec<Vec<f32>>.

I can't use that code as written. I get

error[E0412]: cannot find type `TArr` in this scope
  --> src/clustering_lib.rs:94:19
   |
94 |     pub elements: TArr,
   |                   ^^^^ not found in this scope

I would be interested for an easy fix to still keep these methods generic if one exists. Otherwise I can hard code the type as it is always Vec<Vec<f32>> for me.

But probably you are correct and the easiest method is to just remove the

    pub membership: Vec<usize>,
    pub centroids: Vec<Centroid>,

out into my parent struct and discard the Clustering object. I didn't think of that.

I changed the name of the type variable to TArr here, to emphasize that it's going to be a collection of items rather than a single one:

pub struct Clustering<TArr> {

And then the actual element type T only shows up as a type variable on the kmeans function (and likely others) alongside TArr.

Also, it's worth noting that I didn't actually implement anything; I meant that for illustrative purposes only, and there will be other code changes required to adapt the library for the new definition.

I do agree that lifetime management can be a pain. However the need for it is there for very good reason.

My approach is to minimise any use of lifetime tick marks in my code. I take the view that as soon as I find myself trying to propagate lifetime annotations all over my code it is a clue that I'm probably doing something wrong, that my code is (or more correctly my data) is not structured well for the purpose I want to use it for. At which point it's time to step back from all the line by line syntactic details and review the over all plan.

I have no idea what you are trying to do but a few things hit me after a glance at your code.

  1. You have some vector samples and you end up with a vector of references into it, elements. Straight way there is a lifetime issue. those references are only valid as long as samples is alive and one has to convince the compiler that is always true.

  2. That bunch of references is then buried in a struct, which is itself is now only valid as long as samples is alive.

  3. Then, I infer from the mention of Mutex, you intend to pass that struct to another thread. This sounds like an impossibility already. The second thread would have references to the samples vector which is still owned by the first thread. Who's to say the first thread does not drop the samples while the second thread is still running and trying to use them?

  4. Then, I infer from your mention of rustler, that you intend to pass all this into Erlang world. Now there would be Erlang code that depends on the continued existence of samples somewhere in the Rust code.

I think I agree with 2e71828, getting into lifetimes is useful for short-lived or short term usage of data preferably in a confined area of code. When it comes to sharing data between threads and code far away in time and space, especially in a different language things become impossible.

It's time to take ownership of your data and pass that ownership to whoever needs it.

2 Likes

Oh I see. I get what you meant now. Thanks. I will take your last suggestion then, which I think is the correct one for my goals (simplest solution). Move the centroids and membership into the persistent object and throw out the 'lifetime polluted' object.

ie. Like

pub struct MyCluster {

    my_membership: Vec<usize>,
    my_centroids: Vec<Centroid>,
}

Then:

    let clustering = kmeans(k, &samples, max_iter); 
    
    let mut my_cluster_container = MyCluster::new();
    my_cluster_container.my_membership = clustering.membership;
    my_cluster_container.my_centroids = clustering.centroids;

Then I keep my_cluster_container which has moved the important things over to it. Anything I need from &samples I could also copy in there at this point.

Much better and easier. Seems to work that I can tell. Thanks.

3 Likes

All references except &'static are temporary. The lifetime annotation keeps track of where the struct is borrowing the data from (when you use a reference, the data is not stored in the struct).

This lifetime misery and spamming of <'a> everywhere is intentional, because structs containing temporary references are not normal data types. They are a special case, where the usage of the struct must always be limited to the scope of the data source the struct is borrowing from, and Rust will never extend lifetime of any loan — it can't, because it doesn't have a garbage collector. So if you have a struct that became a temporary view of data in a local variable, that struct is as inflexible and as limited as the scope of that local variable.

In most cases the right approach is to not use temporary references in structs at all. In this case, the function is merely clustering an existing data set, to it does make sense to reference the data instead of copying it. However, if you want to keep that data, you need to create a struct that has its own permanent storage of this data, not just a temporary view into a local variable elsewhere.

For such cases, Rust has Cow<'a, [T]> type, which can contain either a temporary view &[T] into data elsewhere, or you can call Cow::Owned(cow.into_owned()) on it to make a copy of the data as owning Vec<T>, which turns Cow<'temporary, [T]> into Cow<'static, [T]> (in this case lifetime becomes irrelevant, and 'static is a placeholder that makes borrowing irrelevant). You could also just make another struct type with a regular Vec<T> in it, and copy the data in there.


BTW, Rust has a very annoying limitation here — a struct is not allowed to borrow data from itself, or any place that stores the data and the struct together (these are called self-referential). Structs can only borrow data that is independent of them, and is already stored elsewhere. This means you can't keep clustering: Clustering<'a, T> and samples variable in the same struct or tuple.

5 Likes

Hmm, doesn't Cow::into_owned() consumes self and returns T as ToOwned?

A method like:

fn detach_cow<'a, T: ToOwned>(cow: Cow<'_, T>) -> Cow<'a, T> {
    Cow::Owned(cow.into_owned())
}

is missing on Cow, isn't?

2 Likes

Yes, you're right.