Help Needed with KMeans.centroids() Method in Rust

Hello Rust Community,

I'm currently working on a project involving the linfa-clustering crate in Rust, specifically using the KMeans algorithm to cluster data. I'm encountering an issue with the centroids() method, which is returning centroids in an unexpected format.

According to the documentation and examples I've seen, centroids() should return centroids as arrays of two columns (for 2D data). However, when I apply it to a basic Iris dataset for example, I get centroids like this:

[[63.49999999870944, 6.034615384556769, 2.784615384593053, 4.315384615375798],
 [13.00000000161835, 5.027999999996514, 3.479999999940242, 1.4600000000174267],
 [114.50000128197149, 6.645833287602458, 2.9333333196899436, 5.6624999115742085],
 [38.00000000162139, 4.9839999999982885, 3.3560000000479704, 1.4679999999829425],
 [138.5000012823333, 6.575000041839055, 3.0125000236583857, 5.441666738200124],
 [89.49999999870744, 5.846153846225979, 2.7730769230986314, 4.303846153855346]]

Instead of the expected format:

[[ 6.25559436e+01 -1.97357959e-02],
 [-6.30286820e+01 -5.90974586e-01],
 [-1.35102612e+01 -7.83418521e-02],
 [ 1.19636167e+01  1.08604407e+00],
 [-3.85579122e+01  5.89060663e-01],
 [ 3.75776032e+01 -1.05313324e+00]]

I'm unsure why the centroids are returned in four columns instead of two. I've double-checked my dataset and clustering parameters, but I can't seem to find the issue.

If anyone has experience with linfa-clustering or the KMeans algorithm in Rust and can provide insights or suggestions on how to correctly retrieve centroids in the expected format, I would greatly appreciate your help.

here my source code:

Thank you in advance!

Best regards

Victor Rodriguez

I'm going to start out by saying that I don't have any experience with this crate, and I don't have a ton of experience with these algorithms. I know what they are, but I haven't used them specifically. Given that...

It looks to me like the problem is that the part where you're trying to reduce the dimensionality, isn't. It seems like

    // reduce dimensionality of the dataset
    let new_points = embedding.predict(dataset.clone());

Makes new_points into a DatasetBase, which holds both a records which has your original shape of (n, 4), and targets, which has the desired shape of (n,2). But when I look at the implementations of both fit and predict, they appear to both operate on records, which explains why you're getting answers respective to the original dataset with 4 features. So I think your dimensionality reduction isn't working as you expect.

I went searching through the repo to find out how dimensionality reduction is supposed to work, and the only thing I saw was this blog post for a prior release: Linfa Toolkit . In it, it uses the transform method to actually finish the dimensionality reduction. I tried calling that on the embedding struct and it seemed to create a DatasetBase with the correct dimensionality. But at that point, I decided I was past my expertise to know if that's what you want to do.

I hope my vague stumbling was helpful, but if it wasn't, I would suggest you
open an issue in the GitHub repo for the linfa project and ask there how to do what you're looking to do; the author should be able to give you clearer instructions.

Thanks a lot, @KSwanson! Your help allowed me to identify the bug. I have updated the GitHub repository of my experiments with the correct code to match centroids with the Python implementation. Here is the snippet that made the difference:

let DatasetBase {
    records, targets, ..
} = new_points.clone();

let targets_db = DatasetBase::from(targets.clone());

I can then pass this to KMeans:

let _model = KMeans::params(optimal_k)
    .fit(&targets_db)
    .expect("KMeans fitted");

Now, the centroids match:

Rust centroids:

[[0.0459256061926112, 0.578095490863391],
[-2.6408407583131304, -0.19051995278137163],
[2.0672799105736805, -0.11279797423926441],
[3.1282010704615657, -0.6264387171050151],
[1.0991107682005696, 0.1354053664731982]]

Python centroids:

[[ 0.04592561 -0.57809549]
[ 1.09911077 -0.13540537]
[-2.52124909  0.5281888 ]
[ 2.06727994  0.11279799]
[ 3.12820371  0.62644023]
[-2.78123098 -0.20587391]]

The PC2 is multiplied by -1 but is ok for me, maybe a different implementation on the Rust library

This is a great community and very kind to help

Thanks a lot

I'm glad my rubber ducking helped!

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.