Just to double check again, are you sure your benchmarking is sound?
I tried to compile and run your benchmarks, but unfortunately I couldn't allocate enough memory. After reducing the size by 2 orders of magnitude, I got ~50ns for the vec of vecs, which seems like the compiler has optimized it out entirely. If this could go wrong on my computer, are you sure everything's testing what you think it is on yours?
You're of course free to make any decision you want! But as I would consider it common knowledge that flat vecs perform better than nested vecs, I want to investigating this further.
Besides the benchmarks, I had a question. You said the index function is inefficient requiring 2 CPU cycles - could you explain your reasoning there?
My understanding is that random indexing into both will require multiple CPU cycles. For the flat vec, it needs to do math. But for a nested vec, I would assume it needs do even more - 1 memory lookup to get the address in memory of the inner vec, and then another extra lookup in distinct ram (not in the same cache line) to get the actual value.
For iterated non-random lookups, that second memory lookup can be elided. But wouldn't the math in the flat-vec loop be equally elided? I would 100% assume the compiler turns your entire loop into for x in 0..(1000_000_000 * 60) { ... matrix[x] }
. It should even elide that index operation, and directly increment the pointer, if it's optimizing correctly.
In the interest of seeing if I could write a benchmark and get different data, I copied your code, but architected it using criterion
. This allows us to surround information with a black_box to prevent the compiler from completely elliding loops, gives us warmup time to ensure that both tests get a fair, warm CPU, and produces statistics about multiple iterations of each benchmark.
I can't run with your original size on my computer, so my benches have reduced order of magnitude - and that might not be the most relevant. But they could give some insight, and if you want, I think you could also run these benchmarks on your hardware, to see if you get different data.
I say slightly more reliable, as microbenchmarks are always wrong to some degree. But without warming up the CPU, and without putting input/output values in a black box to prevent the compiler from optimizing them out, it's easy to very wrong results from code that looks like it'd obviously provide the correct ones. Benchmarking on modern CPUs is hard, to say the least.
In any case, I've uploaded those criterion benchmarks at GitHub - daboross/vec-nested-vs-flat-benchmarks at 2e2610cf0e21c36ff807e3839c71f355c1cf862b - the actual code at src/benches/vec_benchmarks.rs. Even with the reduced magnitude, it's taking a while to run on my computer, so I'll edit this post when I get my results. And if my results match yours, I'll just have to change my fundamental assumption that flat is better
Edit: finished my benchmark, and it agrees with your results! Here's my output:
Running target/release/deps/vec_benchmarks-b47d5f89b8856b0c
Gnuplot not found, using plotters backend
flat vec of 10000000/alloc
time: [9.1287 us 9.2967 us 9.5547 us]
flat vec of 10000000/loop
time: [362.57 ms 372.24 ms 385.14 ms]
flat vec of 10000000/loop with assert
time: [345.59 ms 359.19 ms 368.29 ms]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) low severe
1 (10.00%) high mild
nested vec of 10000000/alloc
time: [991.16 ms 1.0137 s 1.0483 s]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
nested vec of 10000000/loop
time: [157.88 ms 170.09 ms 184.04 ms]
Found 2 outliers among 10 measurements (20.00%)
2 (20.00%) low mild
I find this super interesting, as it really doesn't agree with my previous assumptions. I thought having everything inline would always give better results, as there'd be more caching alignment and less indirection.
If I understand it correctly, these results also strikes out @mbrubeck's idea:
Since the loop benchmarks are run multiple times on the same vec, including warmup runs, it doesn't look like just a cost on first access.
If anyone has any ideas for why this might be faster, I'm all ears. I don't think the difference is just the indexing math, as LLVM should be able to optimize that out, but maybe even with that I'm misunderstanding your statement, eisen?