Very fast initialization of a Vec of Vecs

A crate which follows this thread of "use a single array with fancy indexing" quite a long way down the rabbit hole is ndarray:

use ndarray::Array2;

fn main() {
    let arr = Array2::<f32>::zeros((60, 10));
    println!("{:?}", arr);
}

Playground link

5 Likes

Talk is cheap, let's check the code. Compiler Explorer

It seems the vec![0; 60 * 1000_000_000] takes the specialized fast path utilizing __rust_alloc_zeroed() function, while vec![[0; 60]; 1000_000_000] failed to take that path and manually initialize each elements during iteration of 1000_000_000 cycles. We may improve it later, but for now, use ndarray :smiley:

2 Likes

While creating a flat vec using an index is extremely fast and very memory efficient, one side effect is that the memory seek of a flat Vec is 100 times slower than a Vec of Vecs. I might end up allocating this way at startup while loading the vec of vec memory strucuture and then switching to that. In that way, I can have fast startup and then fask seek.

I have generally found that the reverse is true: Reading from a flat Vec<T> is much faster in real-world use cases than reading from a Vec<Vec<T>>. The former involves just one pointer invalidation and possible cache miss, while the latter involves two. The former will also be more cache-friendly than the latter, for workloads that read multiple rows in order. It also reduces the number of bounds checks.

3 Likes

That is very surprising to me. Unless you happen to get really unlucky with cache misses, the nested Vec should always be slower or at best about the same speed as the flat one. Can you make a self-contained example that shows such a dramatic speed difference?

1 Like

I quite like ndarray. However, there is some wrapper overhead and is slightly slower when creating the data structure and is slower on reads than a vec of vecs.

Somehow, something I can't explain yet and must be due to some Rust / LLVM SIMD optimizations, I can loop over each element of every vec in the set of 1B x 60 vecs of vecs in 27ns on a single core, which seems impossible on the surface.

Vecs in Rust in general, are crazy fast; faster than I can replicate in C. Amazing.

Using a flat indexed vec, this slows to 13ms, which is also fast, but slower than vec of vecs.

1 Like

I'm trying to understand this myself using perf. I don't understand it. The easiest explanation in my mind is a malloc fault in Rust, but I doubt that's the case. Creating the vec of vecs allocates up to 150GB so it's not small. Maybe Rust is taking advantage of SIMD instructions to do up to 128 ADDs per cycle? That's the only way the math could work, which could allow up to 128B operations / second hypothetically.

That kind of dramatic difference in the opposite direction we'd expect (making N small allocations should be slower than 1 large allocation) is an extremely strong sign that something went wrong with the benchmark, and you're mostly measuring something unrelated to the data structure choice. Could you show us your exact code and exactly how you're running it and evaluating the times? In particular, are you using a proper benchmarking framework that takes care of warming caches and aggregating thousands of runs and so on to avoid drawing conclusions from random noise? (from our perspective reading this thread, that sort of basic methodology error is by far the most likely explanation for what you're seeing)

6 Likes

Oh, one possible explanation is that vec![0; m*n] is optimized to use calloc, which on Linux for example uses copy-on-write pages to avoid actually zeroing memory until you write to it. This means that you may pay the initialization cost on first access, rather than on creation. Assuming you write to every page, then you should still pay the same cost overall, but it will show up in a different part of your code.

8 Likes

I'm not using proper benchmarking in my code for sure. Here's the test code:

use std::time::Instant;

fn main() {
    // index flat vector using index vs. vec of vecs
    let _now = Instant::now();
    let mut _sequence_matrix3: Vec<u16> = vec![0; 60 * 1000000000];
    let _elapsed = _now.elapsed();
    println!("Initialized flat 1B * 60: {:?}", _elapsed);


    // 1B x 60 loop
    let _now = Instant::now();
    let mut _counter = 0;
    for _i in 0..1000000000 {
        for _j in 0..60 {
            _counter += _sequence_matrix3[_j * 1000000000 + _i];
        }
    }
    let _elapsed = _now.elapsed();
    println!("Looped through 1 Billion conversations (flat vec): {:?}", _elapsed);


    // create 1 billion x 60 vec of vecs
    let _now = Instant::now();
    let mut _sequence_matrix2: Vec<Vec<u16>> = vec![vec![0; 60]; 1000000000];
    let _elapsed = _now.elapsed();
    println!("Loaded 1 Billion sequences into memory: {:?}", _elapsed);


    // loop of 1b sequences single threaded
    let mut _counter = 0;
    let _now = Instant::now();
    for _sequence in _sequence_matrix2.iter_mut() {
        for _turn in _sequence.iter_mut() {
            _counter += *_turn;
        }
    }
    let _elapsed = _now.elapsed();
    println!("Looped through 1 Billion conversations (vec of vecs): {:?}", _elapsed);
}

That tradeoff on fast allocation vs fast read makes sense to me. Thank you.

What mbrubeck wrote wasn't a tradeoff of fast allocation vs fast reads.
It was just improving the allocation time at the cost of the first access of every page when doing a read.
So when doing many reads, this option should be faster in allocation and reads.

5 Likes

Well I checked it again and it seems right. Also, my code has to potentially visit every element (boil the ocean) in which case the index function is inefficient requiring 2 CPU cycles. Still I love this index approach for other things. I guess taking the allocation and assignment penalty up front and having faster read times is the tradeoff that is right for what I'm working on.

1 Like

Just to double check again, are you sure your benchmarking is sound?

I tried to compile and run your benchmarks, but unfortunately I couldn't allocate enough memory. After reducing the size by 2 orders of magnitude, I got ~50ns for the vec of vecs, which seems like the compiler has optimized it out entirely. If this could go wrong on my computer, are you sure everything's testing what you think it is on yours?

You're of course free to make any decision you want! But as I would consider it common knowledge that flat vecs perform better than nested vecs, I want to investigating this further.


Besides the benchmarks, I had a question. You said the index function is inefficient requiring 2 CPU cycles - could you explain your reasoning there?

My understanding is that random indexing into both will require multiple CPU cycles. For the flat vec, it needs to do math. But for a nested vec, I would assume it needs do even more - 1 memory lookup to get the address in memory of the inner vec, and then another extra lookup in distinct ram (not in the same cache line) to get the actual value.

For iterated non-random lookups, that second memory lookup can be elided. But wouldn't the math in the flat-vec loop be equally elided? I would 100% assume the compiler turns your entire loop into for x in 0..(1000_000_000 * 60) { ... matrix[x] }. It should even elide that index operation, and directly increment the pointer, if it's optimizing correctly.


In the interest of seeing if I could write a benchmark and get different data, I copied your code, but architected it using criterion. This allows us to surround information with a black_box to prevent the compiler from completely elliding loops, gives us warmup time to ensure that both tests get a fair, warm CPU, and produces statistics about multiple iterations of each benchmark.

I can't run with your original size on my computer, so my benches have reduced order of magnitude - and that might not be the most relevant. But they could give some insight, and if you want, I think you could also run these benchmarks on your hardware, to see if you get different data.

I say slightly more reliable, as microbenchmarks are always wrong to some degree. But without warming up the CPU, and without putting input/output values in a black box to prevent the compiler from optimizing them out, it's easy to very wrong results from code that looks like it'd obviously provide the correct ones. Benchmarking on modern CPUs is hard, to say the least.

In any case, I've uploaded those criterion benchmarks at GitHub - daboross/vec-nested-vs-flat-benchmarks at 2e2610cf0e21c36ff807e3839c71f355c1cf862b - the actual code at src/benches/vec_benchmarks.rs. Even with the reduced magnitude, it's taking a while to run on my computer, so I'll edit this post when I get my results. And if my results match yours, I'll just have to change my fundamental assumption that flat is better :slight_smile:


Edit: finished my benchmark, and it agrees with your results! Here's my output:

     Running target/release/deps/vec_benchmarks-b47d5f89b8856b0c
Gnuplot not found, using plotters backend
flat vec of 10000000/alloc                                                                           
                        time:   [9.1287 us 9.2967 us 9.5547 us]
flat vec of 10000000/loop                                                                            
                        time:   [362.57 ms 372.24 ms 385.14 ms]
flat vec of 10000000/loop with assert                                                                            
                        time:   [345.59 ms 359.19 ms 368.29 ms]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low severe
  1 (10.00%) high mild

nested vec of 10000000/alloc                                                                           
                        time:   [991.16 ms 1.0137 s 1.0483 s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
nested vec of 10000000/loop                                                                            
                        time:   [157.88 ms 170.09 ms 184.04 ms]
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) low mild

I find this super interesting, as it really doesn't agree with my previous assumptions. I thought having everything inline would always give better results, as there'd be more caching alignment and less indirection.

If I understand it correctly, these results also strikes out @mbrubeck's idea:

Since the loop benchmarks are run multiple times on the same vec, including warmup runs, it doesn't look like just a cost on first access.

If anyone has any ideas for why this might be faster, I'm all ears. I don't think the difference is just the indexing math, as LLVM should be able to optimize that out, but maybe even with that I'm misunderstanding your statement, eisen?

7 Likes

I would double-check the assembly before making such a statement. LLVM's ability to optimize out array bound checks is relatively poor, and that has bitten me several times before.

After a quick look, shouldn't the assert be assert!(v.len() >= OUTER_VEC_SIZE * 60); rather than assert!(v.len() >= OUTER_VEC_SIZE); ?
Also, the flat versions seem to iterate the u16 in this order (index):
0, OUTER_VEC_SIZE, 2*OUTER_VEC_SIZE,...
In contrast to the nested vecs, which do it in this order:
0,1,2,3...
If I'm not mistaken? (Which would also result in much worse cache behavior I think)

I will post my benchmark results after fixing (if its wrong) this and post them

2 Likes

Yep

That's a really good call - and that's an explicit difference from the nested vecs! This could definitely cause the slowdown, indexing really different parts of memory like that causes the kind of slowdown I expected from the nested vecs.

To match the behavior of the nested vecs loop, it should be _i * OUTER_VEC_SIZE + _j, not _j * OUTER_VEC_SIZE + _i. I didn't even realize that when I copied the benchmarking code in.

I've also fixed those in the repository (and edited the links in the previous post to point to old versions).

Edit:

I don't know enough about all the x86_64 instructions to really tell this, but if anyone else wants to, I've stuck the benchmarking code in godbolt: Compiler Explorer

1 Like

Close but not right. It has to be: i * 60 + j :slight_smile:

1 Like

I'm apparently not thinking. Thanks!

Another small detail: When allocating the nested Vec you didn't use the constant, I guess the constant was introduced later in the source :stuck_out_tongue:

Small benchmarks are done, I will now start the benchmarks with the correct input size :slight_smile:
I also added some more benchmarks in order to simplify reasoning about what's going on.
The first one is for the flat vec but uses unsafe in order to avoid all bound checks (to see if the compiler fails to optimize them).
The second is also for the flat vec but uses a functional approach in order to avoid bound checks.