This is a 6 rows/ 3 columns matrix and I want to know the most similar 2 rows before the each target row for the last 3 rows(that is 4th,5th,6th). Take the 4th row as example, I want to know which 2 rows in 1th, 2th, 3th rows are most similar with the 4th row and after that I want to get the mean of last column of the most similar rows to be collect.
Is there some smart method to do this?
"similar" is a very vague term, so "most similar" does not adequately describe what exactly you are after. Either try to explain the exact notion or "similarity" you are after, or maybe tell us more about your use case if you need help coming up with a fitting notion of "similar".
So points that are close under euclidean distance, got it. For further clarification, you explained the 4th row example well, but for 5th and 6th row, the 2 most “similar” out of which rows exactly are considered? All preceding or the 3 preceding? Given your description, I would assume that all preceding is more likely, but I’m not 100%. So for 5th row, the 2 most similar out of 1st, 2nd, 3rd, and 4th row would be found (and then the mean of the last entry taken), and for 6th row, all 5 previous rows would be considered. Is this correct?
yes you are right;
for the 5th row, the most 2 similar rows should come from 1th/2th/3th/4th rows;
for the 6th row, the most 2 similar rows should come from 1th/2th/3th/4th/5th rows;
Now I am trying to do this by 'smartcore' crate, but not be ok...
This is a naive approach, but I think it does what you want. I'm not very familiar with how to use ndarray the best and simplest way, maybe someone can improve my code?
I basically iterate over every row the way I think you want, i.e. for row 3 I look at rows 0 to 2, for row 4 I look at row 0 to 3 and for row 5 I look at row 0 to 4. I compute the squared euclidean distance for each and save them with the row index in a vector. I sort the vector based on the distance and take the two row indices with the lowest distance, computing the mean of the elements in the last column at these respective indices:
fn main() {
let arr = ndarray::Array2::<f32>::from_shape_vec(
(6, 3),
vec![
1.0, 2.3, 2.5, 3.2, 3.5, 6.5, 4.5, 6.5, 2.5, 5.8, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
9.0,
],
)
.unwrap();
for (i, row1) in arr.rows().into_iter().skip(3).enumerate() {
let mut similarity: Vec<(f32, usize)> = Vec::new();
for (j, row2) in arr.rows().into_iter().take(3 + i).enumerate() {
let mut d = 0.;
for (c1, c2) in row1.into_iter().zip(row2.into_iter()) {
d += (c1 - c2).powi(2)
}
similarity.push((d, j));
}
similarity.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap());
let most_similar: Vec<usize> = similarity.into_iter().take(2).map(|v| v.1).collect();
let last_column = arr.columns().into_iter().last().unwrap();
let mean = (last_column[most_similar[0]] + last_column[most_similar[1]]) / 2.;
println!("{mean}");
}
}
You can use array-broadcasting operations to calculate a rĂ—r matrix of distances, but I think you might need to hand-write a loop to do the searching:
// Calculate all element-wise differences
let diff = &arr.view().insert_axis(Axis(0))
- &arr.view().insert_axis(Axis(1));
assert_eq!(diff.shape(), &[6,6,3]);
// Square those differences and sum along the original columns
let dist = (&diff * &diff).sum_axis(Axis(2));
assert_eq!(dist.shape(), &[6,6]);
Here’s what I came up with, though it’s a bit lengthy because I didn’t want to use sorting to find the two smallest distances, and neither did I want to produce any new allocations (apart from storing the mean values in some new data structure).
use ndarray::{azip, s, Array2};
fn main() {
let arr = Array2::<f64>::from_shape_vec(
(6, 3),
vec![
1.0, 2.3, 2.5, 3.2, 3.5, 6.5, 4.5, 6.5, 2.5, 5.8, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
9.0,
],
)
.unwrap();
let means: Vec<_> = (3..arr.nrows())
.map(|i| {
let row = arr.row(i);
let mut closest_dist = f64::INFINITY;
let mut closest_last = f64::NAN;
let mut second_closest_dist = f64::INFINITY;
let mut second_closest_last = f64::NAN;
arr.slice(s![..i, ..])
.rows()
.into_iter()
.for_each(|other_row| {
let mut d = 0.0;
azip!((&x in &row, &y in &other_row) d += (x - y).powi(2));
if d < second_closest_dist {
if d < closest_dist {
second_closest_dist = closest_dist;
second_closest_last = closest_last;
closest_dist = d;
closest_last = *other_row.last().unwrap();
} else {
second_closest_dist = d;
second_closest_last = *other_row.last().unwrap();
}
}
});
(closest_last + second_closest_last) / 2.0
})
.collect();
dbg!(&means);
}