2D ndarray: Select most 2 similar rows before for some last rows

I have a 2D ndarray whick looks like below:

let arr = ndarray::Array2::from_shape_vec((6, 3), vec![1.0, 2.3, 2.5, 3.2, 3.5, 6.5, 4.5, 6.5, 2.5, 5.8, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]).unwrap();

This is a 6 rows/ 3 columns matrix and I want to know the most similar 2 rows before the each target row for the last 3 rows(that is 4th,5th,6th). Take the 4th row as example, I want to know which 2 rows in 1th, 2th, 3th rows are most similar with the 4th row and after that I want to get the mean of last column of the most similar rows to be collect.
Is there some smart method to do this?

"similar" is a very vague term, so "most similar" does not adequately describe what exactly you are after. Either try to explain the exact notion or "similarity" you are after, or maybe tell us more about your use case if you need help coming up with a fitting notion of "similar".

3 Likes

Two rows, r1 and r2;
"similar" means the value of (r1[0] - r2[0]) ^ 2 + (r1[1] - r2[1]) ^ 2 + (r1[2] - r2[2]) ^ 2 is as possible as small.

So points that are close under euclidean distance, got it. For further clarification, you explained the 4th row example well, but for 5th and 6th row, the 2 most “similar” out of which rows exactly are considered? All preceding or the 3 preceding? Given your description, I would assume that all preceding is more likely, but I’m not 100%. So for 5th row, the 2 most similar out of 1st, 2nd, 3rd, and 4th row would be found (and then the mean of the last entry taken), and for 6th row, all 5 previous rows would be considered. Is this correct?

1 Like

yes you are right;
for the 5th row, the most 2 similar rows should come from 1th/2th/3th/4th rows;
for the 6th row, the most 2 similar rows should come from 1th/2th/3th/4th/5th rows;

Now I am trying to do this by 'smartcore' crate, but not be ok...

This is a naive approach, but I think it does what you want. I'm not very familiar with how to use ndarray the best and simplest way, maybe someone can improve my code?

I basically iterate over every row the way I think you want, i.e. for row 3 I look at rows 0 to 2, for row 4 I look at row 0 to 3 and for row 5 I look at row 0 to 4. I compute the squared euclidean distance for each and save them with the row index in a vector. I sort the vector based on the distance and take the two row indices with the lowest distance, computing the mean of the elements in the last column at these respective indices:

fn main() {
    let arr = ndarray::Array2::<f32>::from_shape_vec(
        (6, 3),
        vec![
            1.0, 2.3, 2.5, 3.2, 3.5, 6.5, 4.5, 6.5, 2.5, 5.8, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
            9.0,
        ],
    )
    .unwrap();

    for (i, row1) in arr.rows().into_iter().skip(3).enumerate() {
        let mut similarity: Vec<(f32, usize)> = Vec::new();

        for (j, row2) in arr.rows().into_iter().take(3 + i).enumerate() {
            let mut d = 0.;

            for (c1, c2) in row1.into_iter().zip(row2.into_iter()) {
                d += (c1 - c2).powi(2)
            }

            similarity.push((d, j));
        }

        similarity.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap());

        let most_similar: Vec<usize> = similarity.into_iter().take(2).map(|v| v.1).collect();

        let last_column = arr.columns().into_iter().last().unwrap();

        let mean = (last_column[most_similar[0]] + last_column[most_similar[1]]) / 2.;

        println!("{mean}");
    }
}

Playground.

1 Like

You can use array-broadcasting operations to calculate a r×r matrix of distances, but I think you might need to hand-write a loop to do the searching:

    // Calculate all element-wise differences
    let diff = &arr.view().insert_axis(Axis(0))
             - &arr.view().insert_axis(Axis(1));
    assert_eq!(diff.shape(), &[6,6,3]);
    
    // Square those differences and sum along the original columns
    let dist = (&diff * &diff).sum_axis(Axis(2));
    assert_eq!(dist.shape(), &[6,6]);
2 Likes

Thanks, much better:

use ndarray::Axis;

fn main() {
    let arr = ndarray::Array2::<f32>::from_shape_vec(
        (6, 3),
        vec![
            1.0, 2.3, 2.5, 3.2, 3.5, 6.5, 4.5, 6.5, 2.5, 5.8, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
            9.0,
        ],
    )
    .unwrap();

    // Calculate all element-wise differences
    let diff = &arr.view().insert_axis(Axis(0))
             - &arr.view().insert_axis(Axis(1));
    assert_eq!(diff.shape(), &[6,6,3]);
    
    // Square those differences and sum along the original columns
    let dist = (&diff * &diff).sum_axis(Axis(2));
    assert_eq!(dist.shape(), &[6,6]);

    for i in 3..6 {
        let mut similarity: Vec<(usize, &f32)> = dist.row(i).into_iter().take(i).enumerate().collect();

        similarity.sort_by(|a, b| a.1.partial_cmp(b.1).unwrap());

        let most_similar: Vec<usize> = similarity.into_iter().take(2).map(|v| v.0).collect();

        let last_column = arr.columns().into_iter().last().unwrap();

        let mean = (last_column[most_similar[0]] + last_column[most_similar[1]]) / 2.;

        println!("{mean}");
    }
}

Playground.


This still feels like it can be handled better:

let last_column = arr.columns().into_iter().last().unwrap();

let mean = (last_column[most_similar[0]] + last_column[most_similar[1]]) / 2.;

And if performance is important than the sorting should be replaced with a basic linear search for the two smallest distances.

1 Like

Here’s what I came up with, though it’s a bit lengthy because I didn’t want to use sorting to find the two smallest distances, and neither did I want to produce any new allocations (apart from storing the mean values in some new data structure).

use ndarray::{azip, s, Array2};

fn main() {
    let arr = Array2::<f64>::from_shape_vec(
        (6, 3),
        vec![
            1.0, 2.3, 2.5, 3.2, 3.5, 6.5, 4.5, 6.5, 2.5, 5.8, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
            9.0,
        ],
    )
    .unwrap();

    let means: Vec<_> = (3..arr.nrows())
        .map(|i| {
            let row = arr.row(i);
            let mut closest_dist = f64::INFINITY;
            let mut closest_last = f64::NAN;
            let mut second_closest_dist = f64::INFINITY;
            let mut second_closest_last = f64::NAN;
            arr.slice(s![..i, ..])
                .rows()
                .into_iter()
                .for_each(|other_row| {
                    let mut d = 0.0;
                    azip!((&x in &row, &y in &other_row) d += (x - y).powi(2));
                    if d < second_closest_dist {
                        if d < closest_dist {
                            second_closest_dist = closest_dist;
                            second_closest_last = closest_last;
                            closest_dist = d;
                            closest_last = *other_row.last().unwrap();
                        } else {
                            second_closest_dist = d;
                            second_closest_last = *other_row.last().unwrap();
                        }
                    }
                });
            (closest_last + second_closest_last) / 2.0
        })
        .collect();
    dbg!(&means);
}
4 Likes

I tried to clean this up a bit. It won't be as efficient a @steffahn's solution, but it might be a little easier to modify if requirements change:

fn main() {
    const R:usize = 6;
    const C:usize = 3;
    const N:usize = 2;

    let arr = ndarray::Array2::<f32>::from_shape_vec(
        (R, C),
        vec![
            1.0, 2.3, 2.5, 3.2, 3.5, 6.5, 4.5, 6.5, 2.5, 5.8, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
            9.0,
        ],
    )
    .unwrap();

    // Calculate all element-wise differences
    let diff = &arr.view().insert_axis(Axis(0))
             - &arr.view().insert_axis(Axis(1));
    assert_eq!(diff.shape(), &[R,R,C]);
    
    // Square those differences and sum along the original columns
    let dist = (&diff * &diff).sum_axis(Axis(2));
    assert_eq!(dist.shape(), &[R,R]);

    // Storage for our search results
    let mut most_similar: [[usize; N];R-N] = [[0;N];R-N];
    
    for i in N..R {
        let mut similarity: BinaryHeap<(Reverse<NotNan<f32>>, usize)> =
            dist.slice(s![i, ..i]).indexed_iter()
                .map(|(i,&f)| (Reverse(NotNan::new(f).unwrap()), i))
                .collect();

        for v in most_similar[i-N].iter_mut() {
            *v = similarity.pop().unwrap().1;
        }
    }
    
    let last_col = arr.slice(s![.., C-1]);
    let result = most_similar.map(|ids| last_col.select(Axis(0), &ids).mean().unwrap());
    
    dbg!(result);
}
2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.