So… I want to fit a very big dataset in memory, which might even fit, but here is the catch: the default behavior of Vec<_> is to double its size when its allocation limit is reached. I got my program aborted when it tried to grow to 10GB, which is just over the size of my RAM…
… but the vector was only 5GB, with room to spare for the rest of the dataset (even with a OS and a database running). This doubling property of Vec<_> is compeletely undesirable for my usecase.
How can I solve this problem? I need to iterate over this vector linearly many times, avoiding cache misses at all costs.
It looks like the playground claims 8GB of total RAM (I’m guessing this is some container behind the scenes, I didn’t check). overcommit_memory is set to 0, which means the kernel uses a heuristic to fail an “obvious overcommit” - I guess this falls into that category .
If overcommit is not entirely disabled, one could mmap with MAP_NORESERVE to bypass swap reservation (this is basically like manually doing overcommit with mode 1), but (naturally) libc's impl does not use that flag. But one can see this in play via this playground - if you remove MAP_NORESERVE, mmap will fail (printed value is -1, I didn’t bother checking errno).
Can the DB not return a rowcount? Or is the DB itself streaming/cursoring?
This is likely not a helpful answer, but, you may want to reevaluate your design. If you’re buffering all rows from such queries, and I assume the resultset isn’t somehow limited to being 5GB, you’re likely setting yourself up for future failure if the dataset size continues to grow.
One option might be to stream the data to a (temp) file (assuming you’ll have more disk space than memory), and then mmap that file for processing (I’m assuming you’re doing readonly operations on the data?). This assumes that you cannot actually stream data from the DB in chunks and need to materialize all data before actually doing anything.
Db is (at the moment) cursoring. I maybe could try to estimate the number of rows, although that would be tricky. However, after reading through the answers, it is apparent that I will indeed need to reevaluate my design.
Yes, my dataset keeps growing every day (and my ambitions at processing everything too). Thnaks @vitalyd, that is a sensible answer, although I will still have a look at @kornel’s Vec<Vec<_>> solution, which I found quite elegant, too.
you could try (for experiment) allocating a large chunk with Vec<Vec<_>>, freeing that, and then alloc a large Vec<_> of the same total capacity
with Vec<Vec<_>>, you could take each cursored batch as the size of an inner Vec for easy iteration on the db read side as well
do you really need them all in memory at once? Can you iterate over each batch many times, or does each iteration depend on having completed a previous full iteration? Often the need for “everything in memory at once” comes from a random access pattern, whereas yours is linear.
And if your input dataset will continue to grow, what will you do when the total amount of data exceeds all available memory in your computer? At that point you will have to introduce an incremental processing strategy. So why not do that now, when you’re only half-way to exceeding your computer’s memory capacity, since you will be forced to do it eventually?
Repeated linear scans over this data will most likely result in the VM system paging one page at a time into a working set much smaller than the file. Maybe they’ll remain long enough in the buffer cache for the next pass, but this is still more expensive than the original concern of a cpu cache miss “at all costs”. Certainly, once the dataset becomes larger than the available memory, you’ll be paging off disk constantly.
This is certainly true to a large extent. I don’t think talking about cache misses as a major concern is really appropriate given the situation (i.e. loading GBs worth of data, possibly exceeding physical memory capacity, from a DB).
But, if accessing the mapping linearly, there should be some amount of prefetching going on by the kernel (and this can be further hinted via MAP_SEQUENTIALmadvise() call). If it’s a fast disk and processing is “slow enough” then it’s possible for the prefetch to hide (some of) the latency.
Of course, this is all speculative/general and @tokahuke should do some benchmarks with different strategies to see what works best for their specific workload. IMO, the most pressing issue to address would be the need for materializing the full dataset on the client - that’s not going to scale, and cache misses will be the least of their concerns .