I had a simple projection algorithm written using ndarray and nalgebra, and noticed the latter performs substantially faster. I then tested that even for simple indexing operations it is much faster. Is there any trick I'm missing? Or otherwise is it simply not recommended to use ndarray for low-dimensional cases? What is the main source of this difference in performance? I thought it might be that ndarray has to perform one extra check to validate the indexing, but even then at worst it should be 2x the runtime of nalgebra, not over 5x... I also tried rewriting the code to e.g. *array.uget_mut((0,0)) += 1.0 and adding an unsafe wrapper, but even then ndarray is over 2x slower.

use nalgebra::Point2; // 0.32.2
use std::time::Instant;
use std::hint::black_box;
use ndarray::prelude::*;
fn index_vector(vector: &mut Vec<Point2<f64>>) -> (){
for _ in 0..10000 {
vector[0].x += 1.0;
vector[0].y += 1.0;
vector[0].x += 0.5;
vector[0].y += 0.5;
vector[0].x -= 1.0;
vector[0].y -= 1.0;
vector[0].x -= 0.5;
vector[0].y -= 0.5;
vector[0].x += 1.0;
vector[0].y += 1.0;
vector[0].x -= 0.5;
vector[0].y -= 0.5;
}
}
fn index_array(array: &mut Array2<f64>) -> () {
for _ in 0..10000 {
array[[0,0]] += 1.0;
array[[0,1]] += 1.0;
array[[0,0]] += 0.5;
array[[0,1]] += 0.5;
array[[0,0]] -= 1.0;
array[[0,1]] -= 1.0;
array[[0,0]] -= 0.5;
array[[0,1]] -= 0.5;
array[[0,0]] += 1.0;
array[[0,1]] += 1.0;
array[[0,0]] -= 0.5;
array[[0,1]] -= 0.5;
}
}
fn main() {
let mut v = vec![Point2::new(0.0, 0.0)];
let start = Instant::now();
for _ in 0..10 {
let _ = black_box(index_vector(&mut v));
}
let elapsed = start.elapsed();
println!("Elapsed time: {:?}", elapsed);
let mut v = array![[0.0, 0.0]];
let start = Instant::now();
for _ in 0..10 {
let _ = black_box(index_array(&mut v));
}
let elapsed = start.elapsed();
println!("Elapsed time: {:?}", elapsed);
}

You are running in debug mode, without optimization. Comparing performance of non-optimized builds is meaningless. If you run it in release mode, with optimizations turned on, then there's no substantial difference in the two runtimes, both logged time deltas are on the order of ~550µs.

I think the example I gave was too simplistic, and the original question was unfortunately clouded by my misuse of the Debug/Release mode for timing purposes.

Here's a slightly more complex example function which finds the longest edge in a list of consecutive points, and returns both the edge and its length:

use nalgebra::Vector2; // 0.32.2
use std::time::Instant;
use std::hint::black_box;
use ndarray::prelude::*;
fn longest_edge_nalgebra(vector: &mut Vec<Vector2<f64>>) -> (f64, Vector2<f64>) {
let begin_idx = 0;
let end_idx = vector.len()-2;
let mut longest_edge = vector[0].clone();
let mut longest_length = longest_edge.norm();
for i in begin_idx..=end_idx {
let edge = vector[i+1] - vector[i];
let edge_length = edge.norm();
if edge_length > longest_length {
longest_length = edge_length;
longest_edge = edge;
}
}
(longest_length, longest_edge)
}
fn longest_edge_ndarray(array: &mut Array2<f64>) -> (f64, Array1<f64>) {
let begin_idx = 0;
let end_idx = array.nrows()-2;
let mut longest_edge = array.row(0).to_owned();
let mut longest_length = longest_edge.dot(&longest_edge).sqrt();
for i in begin_idx..=end_idx {
let edge = &array.row(i+1) - &array.row(i);
let edge_length = edge.dot(&edge).sqrt();
if edge_length > longest_length {
longest_length = edge_length;
longest_edge = edge
}
}
(longest_length, longest_edge)
}
fn main() {
let mut v = Array::from_shape_fn((5000,2), |(i,j)| if j == 0 {i as f64} else { 0. });
let start = Instant::now();
for _ in 0..10 {
let _ = black_box(longest_edge_ndarray(&mut v));
}
let elapsed = start.elapsed();
println!("Elapsed time ndarray: {:?}", elapsed);
let mut v: Vec<Vector2<f64>> = vec![];
for n in 0..5000 {
v.push(Vector2::new(n as f64, 0.0));
}
let start = Instant::now();
for _ in 0..10 {
let _ = black_box(longest_edge_nalgebra(&mut v));
}
let elapsed = start.elapsed();
println!("Elapsed time nalgebra: {:?}", elapsed);
}

Output in release mode:

Elapsed time ndarray: 1.207955ms
Elapsed time nalgebra: 291.768µs

There is over a 4x difference. For even more complex algorithms such as projecting points, this further increases in my implementations. So my original question still stands (unless I'm still messing up the timing), what is the reason for this difference? Am I using the ndarray crate wrong somehow, can its performance be improved?

One thing I noticed is that if I do not return the longest edge itself, but just the longest length, the two are comparable in performance (Playground). Based on this, changing just the initial part of the original function to let mut longest_edge = array![array[[0,0]], array[[0,1]]]; leads to ~750µs... (Playground). I would hope the more readable functions can still be used to get at least such an improvement. Changing how the longest edge yields faster results than nalgebra (Playground), but again I would hope this is obtainable with the library functions, I guess using knowledge of the array sizes matters here and that's the fundamental source of the difference.

ndarray's internal presentation is much more complex since it allows slicing/striding. Additionally nalgebra uses some constant generic inside, which further reduce allocation and bound checking.

In general, indexing ndarray::Array is much more costly. But If you find a way to use ndarray::Zip instead of indexing directly, it won't have too much a difference.

Here's a version of longest_edge_ndarray that's within 10% of the nalgebra one.

fn longest_edge_ndarray(array: &Array2<f64>) -> (f64, Array1<f64>) {
let mut longest_length = array.row(0).dot(&array.row(0)).sqrt();
let mut longest_edge_i = None;
array
.axis_windows(Axis(0), 2)
.into_iter()
.enumerate()
.for_each(|(i, array)| {
let edge_length = ndarray::Zip::from(array.row(1))
.and(array.row(0))
.fold(0., |sum, a, b| {
let diff = a - b;
sum + diff * diff
})
.sqrt();
if edge_length > longest_length {
longest_length = edge_length;
longest_edge_i = Some(i);
}
});
let longest_edge = if let Some(i) = longest_edge_i {
&array.row(i + 1) - &array.row(i)
} else {
array.row(0).to_owned()
};
(longest_length, longest_edge)
}

One thing nalgebra do is that, a Vector2<T> is stored as [T; 2] because the length is known at compile time, but in ndarray, a Array<T> is similar to a Vec<T> that has length 2, which cause allocation. In your version of longest_edge_ndarray, let edge = &array.row(i+1) - &array.row(i); unfortunately creates an Array<f64> which allocates. So skipping that will gain lot of time.

Yeah I guess it's about knowing the dimension along various axis. The version I wrote which allocates and saves the longest edge in every iteration but explicitly indexes for [0] and [1] still seems to be the fastest (Playground):

Elapsed time ndarray: 226.857µs
Elapsed time ndarray allocating: 165.475µs
Elapsed time nalgebra: 230.937µs

Then again, I'm not sure about measuring time in playground because just by reusing the input array v and commenting out the line of code I marked in the playground code yields:

Elapsed time ndarray: 216.636µs
Elapsed time ndarray allocating: 1.817783ms
Elapsed time nalgebra: 226.927µs