I'm looking for a faster way to create a large vector in memory. I'm not sure that it is possible but I wanted to see if anyone has recommendations. For example, the below takes more than 60s. That's not surprising considering the size; however, if there is an optimization, I sure would love to use it.
let mut _sequence_matrix2: Vec<Vec<u16>> = vec![vec![0; 60]; 1000000000];
The Vec of Vecs created in that code would allocate 1000000000Vecs on the heap, which would be incredibly slow.
I would recommend flattening to a one dimensional Vec, then calculating the index manually. For instance:
let mut _sequence_matrix2: Vec<u16> = vec![0; 60 * 1000000000];
// To access row `3` column `2`, the index is 3*number of columns + 2
let elem = _sequence_matrix2[3*1000000000]
vec! already does exactly this! When you call vec![x; N], it will use the given N to allocate the correct size initially. I think reimplementing that functionality would just making the code more verbose without much benefit.
If you do need a jagged array, but can tolerate slightly higher base memory usage, and don't need to do all the allocation upfront, you might create a Vec<Cow<'static, [u16]>> instead.
let mut _matrix: Vec<Cow<[u16]>> = vec![Cow::Borrowed(&[0; 60]); 1000_000_000];
Be aware that if you end up populating all the "rows" anyway, this will both end up taking more memory and, most likely, be slower than a Vec of Vecs. But it came to mind so I figured I'd just mention it.
175 microseconds with this approach. Wow. Thank you. It's not clear how I would use the wrapper you provided but I will spend some time understanding this gift so that I can use it. This is so fast that I will happily abandon the vec of vecs approach. Thanks again and everyone else for your responses. I really appreciate everyone's time to help.
It seems the vec![0; 60 * 1000_000_000] takes the specialized fast path utilizing __rust_alloc_zeroed() function, while vec![[0; 60]; 1000_000_000] failed to take that path and manually initialize each elements during iteration of 1000_000_000 cycles. We may improve it later, but for now, use ndarray
While creating a flat vec using an index is extremely fast and very memory efficient, one side effect is that the memory seek of a flat Vec is 100 times slower than a Vec of Vecs. I might end up allocating this way at startup while loading the vec of vec memory strucuture and then switching to that. In that way, I can have fast startup and then fask seek.
I have generally found that the reverse is true: Reading from a flat Vec<T> is much faster in real-world use cases than reading from a Vec<Vec<T>>. The former involves just one pointer invalidation and possible cache miss, while the latter involves two. The former will also be more cache-friendly than the latter, for workloads that read multiple rows in order. It also reduces the number of bounds checks.
That is very surprising to me. Unless you happen to get really unlucky with cache misses, the nested Vec should always be slower or at best about the same speed as the flat one. Can you make a self-contained example that shows such a dramatic speed difference?
I quite like ndarray. However, there is some wrapper overhead and is slightly slower when creating the data structure and is slower on reads than a vec of vecs.
Somehow, something I can't explain yet and must be due to some Rust / LLVM SIMD optimizations, I can loop over each element of every vec in the set of 1B x 60 vecs of vecs in 27ns on a single core, which seems impossible on the surface.
Vecs in Rust in general, are crazy fast; faster than I can replicate in C. Amazing.
Using a flat indexed vec, this slows to 13ms, which is also fast, but slower than vec of vecs.
I'm trying to understand this myself using perf. I don't understand it. The easiest explanation in my mind is a malloc fault in Rust, but I doubt that's the case. Creating the vec of vecs allocates up to 150GB so it's not small. Maybe Rust is taking advantage of SIMD instructions to do up to 128 ADDs per cycle? That's the only way the math could work, which could allow up to 128B operations / second hypothetically.
That kind of dramatic difference in the opposite direction we'd expect (making N small allocations should be slower than 1 large allocation) is an extremely strong sign that something went wrong with the benchmark, and you're mostly measuring something unrelated to the data structure choice. Could you show us your exact code and exactly how you're running it and evaluating the times? In particular, are you using a proper benchmarking framework that takes care of warming caches and aggregating thousands of runs and so on to avoid drawing conclusions from random noise? (from our perspective reading this thread, that sort of basic methodology error is by far the most likely explanation for what you're seeing)
Oh, one possible explanation is that vec![0; m*n] is optimized to use calloc, which on Linux for example uses copy-on-write pages to avoid actually zeroing memory until you write to it. This means that you may pay the initialization cost on first access, rather than on creation. Assuming you write to every page, then you should still pay the same cost overall, but it will show up in a different part of your code.