Help getting a simple K-Means Clustering implementation running in Rust? Errors?

GENERAL GOAL

I am trying to get a simple K-Means Clustering implementation running.

My goal is simply this:

  • Provide a list RawPoints of [x,y] points ie. (long, lat) in any simple argument format
  • Run a k-means function on the list of data points for n iterations
  • Return the [x,y] positions of the centroids solved
  • Return a list of which indices of the RawPoints should be assigned to which centroids.

I am not sure how people usually handle the data, but I could imagine that for RawPoints i=0 to i=n, you could return a list of Centroids, and each value in the Centroids list would be a list of the i indices of RawPoints that go for that Centroid.

So in pseudocode, you would have:

fn convertData(rawData: Vec<[f32; 2]> ) -> type {
    return readyForKMeans; 
}

fn runKMeans(rawData: Vec<[f32; 2]>) -> whatever {
    let convertedData = convertData(rawData);
    let result = processKMeans(convertedData);
    let centroids = result.centroids; //list of [x,y] points for N number centroids
    let pointsIndicesPerCentroid = result.indices; //this has N length (1 per centroid) and each entry is a list of indicies between [i..rawData.length()] 
}

So hypothetically if you had 9 points that were assigned to 3 centroids, you might get a result of:

centroids = [(0,3), (2,5), (10,3)]; //some type of list of the 3 centroid [x,y] positions
pointIndices[0] = [0,1,2]; //first three i points in original 9 point data list go to centroid 0 ie. are clustered around (0,3)
pointIndices[1] = [3,8,7]; //these three i points in original 9 point data list go to centroid 1 ie. around (2,5)
pointIndices[2] = [4,5,6]; //these three i points in original 9 point data list go to centroid 2 ie. (10,3)

Is this generally how a K Means would be done and returned?

I am open to any good or simple K Means approach or library that is effective and easy to use.

SPECIFIC ATTEMPT

I have attempted to do something like this using RKM as it looks simple and I think works roughly this way.

However I am hitting errors even just trying to replicate their demo code.

Their Code

Their code says here.

fn read_test_data() -> Array2<f32> {
    let mut data_reader = csv::Reader::from_path("data/iris.data.csv").unwrap();
    let mut data: Vec<f32> = Vec::new();
    for record in data_reader.records() {
        for field in record.unwrap().iter() {
            let value = f32::from_str(field);
            data.push(value.unwrap());
        }
    }
    Array2::from_shape_vec((data.len() / 2, 2), data).unwrap()
}

pub fn main() {
    let data = read_test_data();
    let (means, clusters) = rkm::kmeans_lloyd(&data.view(), 3);
    println!(
        "data:\n{:?}\nmeans:\n{:?}\nclusters:\n{:?}",
        data, means, clusters
    );
    plot_means_clusters(&data.view(), &means.view(), &clusters);
}

My Code

fn get_test_data(count : u32) -> Array2<f32> {
    //create some meaningless data here to test by formula     
    let arr: Array2<f32> = Array2::from_shape_fn((5, 2), |(row, col)| {
        (row as f32) * 2.0 + (col as f32)
    });
    return arr;
}

fn kmeans_test(count : u32) -> Result<String, String> {
    let data = get_test_data(count);
    let (means, clusters) = rkm::kmeans_lloyd(&data.view(), 3);
    let out_string = format!("data:\n{}\nmeans:\n{}", data, means);
    return Ok(out_string.to_string());
}

Error

With the above I get

error[E0308]: mismatched types
    --> src/lib.rs:108:47
     |
108  |     let (means, clusters) = rkm::kmeans_lloyd(&data.view(), 3);
     |                             ----------------- ^^^^^^^^^^^^ expected `&ArrayBase<ViewRepr<&_>, Dim<...>>`, found `&ArrayBase<ViewRepr<&f32>, ...>`
     |                             |
     |                             arguments to this function are incorrect
     |
     = note: `ArrayBase<ViewRepr<&f32>, Dim<...>>` and `ArrayBase<ViewRepr<&_>, Dim<...>>` have similar names, but are actually distinct types
note: `ArrayBase<ViewRepr<&f32>, Dim<...>>` is defined in crate `ndarray`
    --> /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ndarray-0.16.1/src/lib.rs:1280:1
     |
1280 | pub struct ArrayBase<S, D>
     | ^^^^^^^^^^^^^^^^^^^^^^^^^^
note: `ArrayBase<ViewRepr<&_>, Dim<...>>` is defined in crate `ndarray`
    --> /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ndarray-0.12.1/src/lib.rs:1026:1
     |
1026 | pub struct ArrayBase<S, D>
     | ^^^^^^^^^^^^^^^^^^^^^^^^^^
     = note: perhaps two different versions of crate `ndarray` are being used?
note: function defined here
    --> /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rkm-0.8.1/src/lib.rs:393:8
     |
393  | pub fn kmeans_lloyd<V: Value>(data: &ArrayView2<V>, k: usize) -> (Array2<V>, Vec<usize>) {
     |        ^^^^^^^^^^^^

I am not sure what the problem is.

Both my and their "data" function are creating an Array2<f32>. So why is it telling me the argument to rkm::kmeans_lloyd is wrong in my case?

The error also says "note: perhaps two different versions of crate ndarray are being used?" But I am not sure what to do about this. I have to add ndarray = "0.16.1" to my cargo.toml or it says this is missing on build and it can't find ndarray.

I should note I am running Rust inside another language like this but I am not aware of any specific reason this should or shouldn't work.

Any obvious explanation?

Or if you know a better or simpler working way to do a K Means clustering like I describe I am open to it.

Thanks for any help.

Try specifying version 0.13.0 instead.

Latest version of ndarray is 0.16.1 but rkm has a dependency on 0.13.0 - you are creating an Array2 and passing it to rkm so I guess that is where the mismatch is coming from...

Should have mentioned I tried that and it didn't work either. Man Rust is a tough language. I gave up on this K Means package at this point and I will try another. Thanks.