GENERAL GOAL
I am trying to get a simple K-Means Clustering implementation running.
My goal is simply this:
- Provide a list
RawPoints
of[x,y]
points ie. (long, lat) in any simple argument format - Run a k-means function on the list of data points for
n
iterations - Return the
[x,y]
positions of the centroids solved - Return a list of which indices of the
RawPoints
should be assigned to which centroids.
I am not sure how people usually handle the data, but I could imagine that for RawPoints
i=0
to i=n
, you could return a list of Centroids
, and each value in the Centroids list would be a list of the i
indices of RawPoints
that go for that Centroid
.
So in pseudocode, you would have:
fn convertData(rawData: Vec<[f32; 2]> ) -> type {
return readyForKMeans;
}
fn runKMeans(rawData: Vec<[f32; 2]>) -> whatever {
let convertedData = convertData(rawData);
let result = processKMeans(convertedData);
let centroids = result.centroids; //list of [x,y] points for N number centroids
let pointsIndicesPerCentroid = result.indices; //this has N length (1 per centroid) and each entry is a list of indicies between [i..rawData.length()]
}
So hypothetically if you had 9 points that were assigned to 3 centroids, you might get a result of:
centroids = [(0,3), (2,5), (10,3)]; //some type of list of the 3 centroid [x,y] positions
pointIndices[0] = [0,1,2]; //first three i points in original 9 point data list go to centroid 0 ie. are clustered around (0,3)
pointIndices[1] = [3,8,7]; //these three i points in original 9 point data list go to centroid 1 ie. around (2,5)
pointIndices[2] = [4,5,6]; //these three i points in original 9 point data list go to centroid 2 ie. (10,3)
Is this generally how a K Means would be done and returned?
I am open to any good or simple K Means approach or library that is effective and easy to use.
SPECIFIC ATTEMPT
I have attempted to do something like this using RKM as it looks simple and I think works roughly this way.
However I am hitting errors even just trying to replicate their demo code.
Their Code
Their code says here.
fn read_test_data() -> Array2<f32> {
let mut data_reader = csv::Reader::from_path("data/iris.data.csv").unwrap();
let mut data: Vec<f32> = Vec::new();
for record in data_reader.records() {
for field in record.unwrap().iter() {
let value = f32::from_str(field);
data.push(value.unwrap());
}
}
Array2::from_shape_vec((data.len() / 2, 2), data).unwrap()
}
pub fn main() {
let data = read_test_data();
let (means, clusters) = rkm::kmeans_lloyd(&data.view(), 3);
println!(
"data:\n{:?}\nmeans:\n{:?}\nclusters:\n{:?}",
data, means, clusters
);
plot_means_clusters(&data.view(), &means.view(), &clusters);
}
My Code
fn get_test_data(count : u32) -> Array2<f32> {
//create some meaningless data here to test by formula
let arr: Array2<f32> = Array2::from_shape_fn((5, 2), |(row, col)| {
(row as f32) * 2.0 + (col as f32)
});
return arr;
}
fn kmeans_test(count : u32) -> Result<String, String> {
let data = get_test_data(count);
let (means, clusters) = rkm::kmeans_lloyd(&data.view(), 3);
let out_string = format!("data:\n{}\nmeans:\n{}", data, means);
return Ok(out_string.to_string());
}
Error
With the above I get
error[E0308]: mismatched types
--> src/lib.rs:108:47
|
108 | let (means, clusters) = rkm::kmeans_lloyd(&data.view(), 3);
| ----------------- ^^^^^^^^^^^^ expected `&ArrayBase<ViewRepr<&_>, Dim<...>>`, found `&ArrayBase<ViewRepr<&f32>, ...>`
| |
| arguments to this function are incorrect
|
= note: `ArrayBase<ViewRepr<&f32>, Dim<...>>` and `ArrayBase<ViewRepr<&_>, Dim<...>>` have similar names, but are actually distinct types
note: `ArrayBase<ViewRepr<&f32>, Dim<...>>` is defined in crate `ndarray`
--> /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ndarray-0.16.1/src/lib.rs:1280:1
|
1280 | pub struct ArrayBase<S, D>
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
note: `ArrayBase<ViewRepr<&_>, Dim<...>>` is defined in crate `ndarray`
--> /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ndarray-0.12.1/src/lib.rs:1026:1
|
1026 | pub struct ArrayBase<S, D>
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
= note: perhaps two different versions of crate `ndarray` are being used?
note: function defined here
--> /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rkm-0.8.1/src/lib.rs:393:8
|
393 | pub fn kmeans_lloyd<V: Value>(data: &ArrayView2<V>, k: usize) -> (Array2<V>, Vec<usize>) {
| ^^^^^^^^^^^^
I am not sure what the problem is.
Both my and their "data" function are creating an Array2<f32>
. So why is it telling me the argument to rkm::kmeans_lloyd
is wrong in my case?
The error also says "note: perhaps two different versions of crate ndarray
are being used?" But I am not sure what to do about this. I have to add ndarray = "0.16.1"
to my cargo.toml or it says this is missing on build and it can't find ndarray
.
I should note I am running Rust inside another language like this but I am not aware of any specific reason this should or shouldn't work.
Any obvious explanation?
Or if you know a better or simpler working way to do a K Means clustering like I describe I am open to it.
Thanks for any help.