Very fast initialization of a Vec of Vecs

I'm not using proper benchmarking in my code for sure. Here's the test code:

use std::time::Instant;

fn main() {
    // index flat vector using index vs. vec of vecs
    let _now = Instant::now();
    let mut _sequence_matrix3: Vec<u16> = vec![0; 60 * 1000000000];
    let _elapsed = _now.elapsed();
    println!("Initialized flat 1B * 60: {:?}", _elapsed);


    // 1B x 60 loop
    let _now = Instant::now();
    let mut _counter = 0;
    for _i in 0..1000000000 {
        for _j in 0..60 {
            _counter += _sequence_matrix3[_j * 1000000000 + _i];
        }
    }
    let _elapsed = _now.elapsed();
    println!("Looped through 1 Billion conversations (flat vec): {:?}", _elapsed);


    // create 1 billion x 60 vec of vecs
    let _now = Instant::now();
    let mut _sequence_matrix2: Vec<Vec<u16>> = vec![vec![0; 60]; 1000000000];
    let _elapsed = _now.elapsed();
    println!("Loaded 1 Billion sequences into memory: {:?}", _elapsed);


    // loop of 1b sequences single threaded
    let mut _counter = 0;
    let _now = Instant::now();
    for _sequence in _sequence_matrix2.iter_mut() {
        for _turn in _sequence.iter_mut() {
            _counter += *_turn;
        }
    }
    let _elapsed = _now.elapsed();
    println!("Looped through 1 Billion conversations (vec of vecs): {:?}", _elapsed);
}

That tradeoff on fast allocation vs fast read makes sense to me. Thank you.

What mbrubeck wrote wasn't a tradeoff of fast allocation vs fast reads.
It was just improving the allocation time at the cost of the first access of every page when doing a read.
So when doing many reads, this option should be faster in allocation and reads.

5 Likes

Well I checked it again and it seems right. Also, my code has to potentially visit every element (boil the ocean) in which case the index function is inefficient requiring 2 CPU cycles. Still I love this index approach for other things. I guess taking the allocation and assignment penalty up front and having faster read times is the tradeoff that is right for what I'm working on.

1 Like

Just to double check again, are you sure your benchmarking is sound?

I tried to compile and run your benchmarks, but unfortunately I couldn't allocate enough memory. After reducing the size by 2 orders of magnitude, I got ~50ns for the vec of vecs, which seems like the compiler has optimized it out entirely. If this could go wrong on my computer, are you sure everything's testing what you think it is on yours?

You're of course free to make any decision you want! But as I would consider it common knowledge that flat vecs perform better than nested vecs, I want to investigating this further.


Besides the benchmarks, I had a question. You said the index function is inefficient requiring 2 CPU cycles - could you explain your reasoning there?

My understanding is that random indexing into both will require multiple CPU cycles. For the flat vec, it needs to do math. But for a nested vec, I would assume it needs do even more - 1 memory lookup to get the address in memory of the inner vec, and then another extra lookup in distinct ram (not in the same cache line) to get the actual value.

For iterated non-random lookups, that second memory lookup can be elided. But wouldn't the math in the flat-vec loop be equally elided? I would 100% assume the compiler turns your entire loop into for x in 0..(1000_000_000 * 60) { ... matrix[x] }. It should even elide that index operation, and directly increment the pointer, if it's optimizing correctly.


In the interest of seeing if I could write a benchmark and get different data, I copied your code, but architected it using criterion. This allows us to surround information with a black_box to prevent the compiler from completely elliding loops, gives us warmup time to ensure that both tests get a fair, warm CPU, and produces statistics about multiple iterations of each benchmark.

I can't run with your original size on my computer, so my benches have reduced order of magnitude - and that might not be the most relevant. But they could give some insight, and if you want, I think you could also run these benchmarks on your hardware, to see if you get different data.

I say slightly more reliable, as microbenchmarks are always wrong to some degree. But without warming up the CPU, and without putting input/output values in a black box to prevent the compiler from optimizing them out, it's easy to very wrong results from code that looks like it'd obviously provide the correct ones. Benchmarking on modern CPUs is hard, to say the least.

In any case, I've uploaded those criterion benchmarks at https://github.com/daboross/vec-nested-vs-flat-benchmarks/tree/2e2610cf0e21c36ff807e3839c71f355c1cf862b - the actual code at src/benches/vec_benchmarks.rs. Even with the reduced magnitude, it's taking a while to run on my computer, so I'll edit this post when I get my results. And if my results match yours, I'll just have to change my fundamental assumption that flat is better :slight_smile:


Edit: finished my benchmark, and it agrees with your results! Here's my output:

     Running target/release/deps/vec_benchmarks-b47d5f89b8856b0c
Gnuplot not found, using plotters backend
flat vec of 10000000/alloc                                                                           
                        time:   [9.1287 us 9.2967 us 9.5547 us]
flat vec of 10000000/loop                                                                            
                        time:   [362.57 ms 372.24 ms 385.14 ms]
flat vec of 10000000/loop with assert                                                                            
                        time:   [345.59 ms 359.19 ms 368.29 ms]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low severe
  1 (10.00%) high mild

nested vec of 10000000/alloc                                                                           
                        time:   [991.16 ms 1.0137 s 1.0483 s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
nested vec of 10000000/loop                                                                            
                        time:   [157.88 ms 170.09 ms 184.04 ms]
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) low mild

I find this super interesting, as it really doesn't agree with my previous assumptions. I thought having everything inline would always give better results, as there'd be more caching alignment and less indirection.

If I understand it correctly, these results also strikes out @mbrubeck's idea:

Since the loop benchmarks are run multiple times on the same vec, including warmup runs, it doesn't look like just a cost on first access.

If anyone has any ideas for why this might be faster, I'm all ears. I don't think the difference is just the indexing math, as LLVM should be able to optimize that out, but maybe even with that I'm misunderstanding your statement, eisen?

7 Likes

I would double-check the assembly before making such a statement. LLVM's ability to optimize out array bound checks is relatively poor, and that has bitten me several times before.

After a quick look, shouldn't the assert be assert!(v.len() >= OUTER_VEC_SIZE * 60); rather than assert!(v.len() >= OUTER_VEC_SIZE); ?
Also, the flat versions seem to iterate the u16 in this order (index):
0, OUTER_VEC_SIZE, 2*OUTER_VEC_SIZE,...
In contrast to the nested vecs, which do it in this order:
0,1,2,3...
If I'm not mistaken? (Which would also result in much worse cache behavior I think)

I will post my benchmark results after fixing (if its wrong) this and post them

2 Likes

Yep

That's a really good call - and that's an explicit difference from the nested vecs! This could definitely cause the slowdown, indexing really different parts of memory like that causes the kind of slowdown I expected from the nested vecs.

To match the behavior of the nested vecs loop, it should be _i * OUTER_VEC_SIZE + _j, not _j * OUTER_VEC_SIZE + _i. I didn't even realize that when I copied the benchmarking code in.

I've also fixed those in the repository (and edited the links in the previous post to point to old versions).

Edit:

I don't know enough about all the x86_64 instructions to really tell this, but if anyone else wants to, I've stuck the benchmarking code in godbolt: Compiler Explorer

1 Like

Close but not right. It has to be: i * 60 + j :slight_smile:

1 Like

I'm apparently not thinking. Thanks!

Another small detail: When allocating the nested Vec you didn't use the constant, I guess the constant was introduced later in the source :stuck_out_tongue:

Small benchmarks are done, I will now start the benchmarks with the correct input size :slight_smile:
I also added some more benchmarks in order to simplify reasoning about what's going on.
The first one is for the flat vec but uses unsafe in order to avoid all bound checks (to see if the compiler fails to optimize them).
The second is also for the flat vec but uses a functional approach in order to avoid bound checks.

Yep! I added the constant as an attempt to clean up my modification that reduced the vector size - the original benchmark code was directly copied from @eisen's post. The original had 1 billion elements, but I don't have 16GiB ram.

I'm interested to see your benchmark results! I'm rerunning the benchmarks on my computer too, with both the bad and fixed iteration order, and an iterator version.

Well, I have the ram, but I'm not going to benchmark it for two main reasons:

  1. The ten million in your code is already way beyond the size of my CPU cache, so it should be enough to run into cache misses.
  2. Assuming that because of 1. the results are somehow the same, more loop iterations for the benchmarks should be better than longer benchmarks (from a statistics point of view)
1 Like

Here are my benchmark results (shortened a bit to the relevant parts):

     Running target/release/deps/vectest-bdb22a6b9926b5f4

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target/release/deps/vec_benchmarks-f8cf4d7741232e37
flat vec of 10000000/alloc
                        time:   [4.7095 us 4.7128 us 4.7144 us]
flat vec of 10000000/loop
                        time:   [204.42 ms 204.42 ms 204.42 ms]
flat vec of 10000000/loop with assert
                        time:   [204.42 ms 204.42 ms 204.42 ms]
flat vec of 10000000/loop with assert and unchecked
                        time:   [36.688 ms 36.688 ms 36.688 ms]
flat vec of 10000000/loop functional
                        time:   [37.505 ms 37.505 ms 37.505 ms]

nested vec of 10000000/alloc
                        time:   [996.71 ms 996.83 ms 996.94 ms]
nested vec of 10000000/loop
                        time:   [106.13 ms 106.13 ms 106.13 ms]

What we can see:

  1. The allocation of the flat vec is much faster.
  2. The loop does not get optimized better with the assert
  3. the unsafe unchecked version optimizes much better
  4. the functional version is basically as fast as the unsafe version
  5. The version using nested vecs is indeed faster than the obviously not well optimized flat versions
  6. The version using nested vecs is much slower than the optimized flat version (either unsafe or functional)

I would also post the code here, but I think that would violate daboross license (as I would post it in the playground without the license)

5 Likes

Ok, so I was wondering why the loop versions (especially the one with the assert) optimize so poorly.
As it turns out: assert!(v.len() >= ...) does not optimize well, but assert_eq!(v.len(), ...) does optimize nearly as well as the unchecked version (around 48ms, in a much shorter benchmark).

I have no idea what the problem of llvm is optimizing it as i have no knowledge about how it works.
Does someone know if something like this would be considered a bug in llvm?

1 Like

Yup, there are unelided bound checks in all indexing-based versions. Which also prevents these versions from getting the auto-vectorization that the iterator-based version gets, and that in turn makes a big difference in performance (rustc uses SSE vectorization by default, which allows processing 8x u16 integers per CPU instruction, leading to a 8x speedup in the best-case scenario).

For the record, here's a basic summary of how I check the assembly for those issues.
  • Panic conditions are grouped at the end of a function's assembly by rustc, after the return instruction (ret on x86), so I always look there first.
  • If you see a call (call instruction) to a function called panic_bounds_check in this part of the assembly, it means that you have unelided indexing bound checks around.
    • Another one you may want to watch out for is slice_index_len_fail, which occurs when the compiler fails to prove that a slicing expression stuff[i..j] is always in bounds.
  • Now you can look at the rest of the assembly. Every place where you see a conditional jump to the bound check panic that has just been located in the previous step, is a location where bound checks are ongoing in the assembly.
    • Conditional jump instructions can be located by leveraging the fact that their name starts with j, and that they are usually preceded by a comparison (cmp instruction).
  • If those jumps happens in the middle of a loop, the bound check happens on every iteration of a loop, and this is where you have a potential performance problem to take care of.
    • I locate loops in assembly by leveraging the fact they are pretty much the only circumstances under which a compiler will generate an assembly (conditional) jump that goes backwards in the function's code.
  • If I am lucky enough to also have a version of the code which shouldn't have bound checks (e.g. unchecked version) to compare with, I can also use the opportunity to check if the indexing-based version underwent the same kind of vectorization as the reference version.
    • The beginner's way to spot x86 vectorization is to look for bunches of assembly instructions with unusually complicated names (5+ letters). Of course, there are also the Intel and AMD manuals if you want a more rigorous instruction set reference.

After fixing the loop_flat_explicit_size_assert_fixed with the correct indexing expression (_i * 60 + _j), the bound checks remain.

But if I replace your assert!(v.len() >= OUTER_VEC_SIZE * 60); by an almost semantically equivalent let v = &v[0..OUTER_VEC_SIZE*60]; indexing trick, the bound check goes away and the generated code is as good as in the iterator-based version.

This is what I was referring to when I said that the bound check elisions of rustc (which AFAIK are mostly taken care of by LLVM) are very, very fragile. They only work in very special cases, and it takes very little code perturbations to break them.

11 Likes

Yes I wasn't using proper benchmarks but will try your code tonight with proper critierion measurements. Thank you for doing that.

I've tried it both with and without --release flag using default optimization level (3). The difference is replicated in both optimized and unoptimized regardless of how small or large I make the vector. A vector size of 3 billion eats up a ridiculous amount of memory (over 300GB), but then performs almost as well as a vector size of 1 billion once the vector has been initialized.

I think that a rational explanation for an LLVM optimized build (ala --release) can be explained by the loop vectorization optimization. It doesn't explain the unoptimized behavior however. There is some Rust/LLVM magic here that I can't explain and the more I stare at these results, the more I am suspect.

Now that others can replicate this behavior, I think I have to try to see where I can break this. Perhaps there is an optimization related to the value of every vector being the same and it's actually NOT doing the operation when it sees the same calculation? I will try building randomized vectors tonight and see what that does.

No problem!

@raidwas improved much further on the benchmarks I made - I've pushed their changes so if you grab the lastest it should have a few more interesting cases (their exact version is here, version with fixed asserts is here).

I look forward to it!

It's worth noting that while I replicated the behavior, @raidwas's and @HadrienG both found a flaw in both our benchmarks. Specifically, the manual non-iterator loop introduces bounds checking overhead, and that significantly slows down the flat-vec loop. Since the benchmark for the nested vec use iterators, they don't have any bounds check, but the non-nested one does.

@raidwas sent me their benchmark code, which I've posted on the github (see link above). This is the functional, non-loop-based flat vec method from that code:


fn loop_flat_functional(v: &Vec<u16>) {
    let _counter = v
        .chunks_exact(60)
        .fold(0u16, |acc, chunk| chunk.iter().sum::<u16>() + acc);
    black_box(_counter);
}

which is, from their benchmarks above, about 3 times faster than both the loop-based flat-vec and nested vec loops.

If/when you conduct further benchmarking, I highly recommend starting from @raidwas's version rather than my initial one :slight_smile:

3 Likes

Ok, I saw that and was interesting. Based on that, I would expect that both approaches should be relatively similar but will see.

1 Like

OK here are my results for a 100M x 60 matrix:

                        time:   [9.8032 us 9.9039 us 10.033 us]
                        change: [+0.9201% +1.7466% +2.8092%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
flat vec of 100000000/loop                                                                           
                        time:   [2.6696 s 2.6729 s 2.6764 s]
                        change: [+0.2982% +0.5109% +0.7376%] (p = 0.00 < 0.05)
                        Change within noise threshold.
flat vec of 100000000/loop with assert                                                                           
                        time:   [725.18 ms 725.74 ms 726.24 ms]
                        change: [+0.4857% +0.6252% +0.7773%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking flat vec of 100000000/loop with assert and unchecked: Collecting 10 samples in estimated 418.54 s (990 itera                                                                                                                         flat vec of 100000000/loop with assert and unchecked                        
                        time:   [422.76 ms 423.05 ms 423.35 ms]
                        change: [+0.1470% +0.3246% +0.5103%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
flat vec of 100000000/loop functional                                                                           
                        time:   [424.00 ms 424.34 ms 424.71 ms]
                        change: [+0.2020% +0.4208% +0.6468%] (p = 0.00 < 0.05)
                        Change within noise threshold.

Benchmarking nested vec of 100000000/alloc: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 400.0s. You may wish to increase target time to 446.9s.
nested vec of 100000000/alloc                                                                          
                        time:   [6.8803 s 6.9322 s 6.9864 s]
                        change: [-14.064% -8.0087% -0.4160%] (p = 0.05 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
nested vec of 100000000/loop                                                                           
                        time:   [1.2310 s 1.2315 s 1.2319 s]
                        change: [-1.6667% -1.2868% -0.8667%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

There's no question that a flat vector with assert and unchecked is faster than a vec of vecs in the benchmark test.

I took the benchmark code to write a small app that also included randomized numbers to compare the flat vector with assert and unchecked with a vec of vecs, just to make sure there isn't some LLVM optimization that optimizes for same calculations. It did not change the results.

I know it's not a proper benchmark, but once I build with --release with opt-level= 3, the loop effectively runs equally fast when comparing the flat vector to a vec of vecs. However, the flat vec memory allocation is definitely faster and more memory efficient; so faster overall.

Thanks again to everyone for your comments, code, and time. I'm really enjoying my Rust learning journey. All of you in the community continue to make it a great process.

See below the test code I mentioned above:

use rand;

const OUTER_VEC_SIZE: usize = 10_000_000;

fn main() {
    println!("Beginning test.");
    //test flat vec test
    let flat_vec = alloc_flat();
    loop_flat_explicit_size_assert_unchecked(&flat_vec);

    // test vec of vecs
    let vec_of_vecs = alloc_nested();
    loop_nested(&vec_of_vecs);

    println!("Done.");
}

fn alloc_flat() -> Vec<u16> {
    let _now = Instant::now();
    let mut _this_vec: Vec<u16> = vec![0; 60 * OUTER_VEC_SIZE];
    for _i in 0.._this_vec.capacity() {
        _this_vec[_i] = rand::random();
    }
    let _elapsed = _now.elapsed();
    println!("Allocated flat vec of size {:?} in {:?}", OUTER_VEC_SIZE, _elapsed);
    _this_vec
}

fn alloc_nested() -> Vec<Vec<u16>> {
    let _now = Instant::now();
    let mut _this_vec: Vec<Vec<u16>> = vec![vec![0; 60]; OUTER_VEC_SIZE];
    for _i in 0..OUTER_VEC_SIZE {
        for _j in 0..60 {
            _this_vec[_i][_j] = rand::random();
        }
    }
    let _elapsed = _now.elapsed();
    println!("Allocated vec of vecs of size {:?} in {:?}", OUTER_VEC_SIZE, _elapsed);
    _this_vec
}

fn loop_flat_explicit_size_assert_unchecked(v: &Vec<u16>) {
    assert_eq!(v.len(), OUTER_VEC_SIZE * 60);
    let _now = Instant::now();
    let mut _counter = 0;
    for i in 0..OUTER_VEC_SIZE {
        for j in 0..60 {
            _counter += unsafe { v.get_unchecked(i * 60 + j) };
        }
    }
    let _elapsed = _now.elapsed();
    println!("Iterated over {:?} sequences (flat vec): {:?}", OUTER_VEC_SIZE, _elapsed);
}

fn loop_nested(v: &Vec<Vec<u16>>) {
    let _now = Instant::now();
    let mut _counter = 0;
    for _sequence in v.iter() {
        for _turn in _sequence.iter() {
            _counter += *_turn;
        }
    }
    let _elapsed = _now.elapsed();
    println!("Iterated over {:?} sequences (vec of vecs): {:?}", OUTER_VEC_SIZE, _elapsed);
}
2 Likes