Want to create HUGE Vec<_> that fits half the memory: got failAborted

tokahuke · January 30, 2019, 9:27pm

So.... I want to fit a very big dataset in memory, which might even fit, but here is the catch: the default behavior of Vec<_> is to double its size when its allocation limit is reached. I got my program aborted when it tried to grow to 10GB, which is just over the size of my RAM...

... but the vector was only 5GB, with room to spare for the rest of the dataset (even with a OS and a database running). This doubling property of Vec<_> is compeletely undesirable for my usecase.

How can I solve this problem? I need to iterate over this vector linearly many times, avoiding cache misses at all costs.

naim · January 30, 2019, 9:30pm

If you know the size ahead of time, you may want to try and call with_capacity, or reserve.

tokahuke · January 30, 2019, 9:35pm

No, I don't have that. It is a result set of a query, which I stream from the DB. I would be looking for a solution like a crate, a data structure.

cuviper · January 30, 2019, 9:35pm

I guess you're not on Linux? Overcommit would usually allow this until you actually access all of those memory pages.

tokahuke · January 30, 2019, 9:37pm

Well, I actually am on Linux (Ubuntu 16 Server). And no, I have no heard of overcommit yet.

... so... my understanding of the cause of my failAborted is after all wrong!?

naim · January 30, 2019, 10:05pm

LinkedList might be what you're looking for, it's less efficient than Vec though.

A thin wrapper around Vec might be good too, here is a small mock up that allows you too reserve space at your own instead of depending on Vec's implementation (reserve memory before Vec does).

OptimisticPeach · January 30, 2019, 10:24pm

But in that case, wouldn't it be inefficient given a small enough T? Every T would have a pointer to the next, bloating it alot for a large dataset with small components.

cuviper · January 30, 2019, 10:31pm

Oh, perhaps I'm wrong about how exactly overcommit works here. For instance on the playground, I can't allocate a single 40GB vector, but I can allocate a thousand 4GB vectors.

kornel · January 30, 2019, 11:10pm

Use Vec<Vec<T>>. With large enough inner Vec the overhead will be negligible. There's flat_map for iterating easily over this.

dcarosone · January 30, 2019, 11:11pm

The issue here will be that there isn't that much space available in one allocation, but the progressive allocation provides backpressure for the VM system to free up more space

cuviper · January 30, 2019, 11:36pm

It might be interesting to have a crate provide a nice API around Vec<Vec<T>>, behaving mostly like a flat Vec<T> apart from not being able to get a contiguous slice.

ckaran · January 30, 2019, 11:57pm

You might be able to tweak something like ropey to do what you want.

vitalyd · January 31, 2019, 1:27am

It looks like the playground claims 8GB of total RAM (I'm guessing this is some container behind the scenes, I didn't check). overcommit_memory is set to 0, which means the kernel uses a heuristic to fail an "obvious overcommit" - I guess this falls into that category .

If overcommit is not entirely disabled, one could mmap with MAP_NORESERVE to bypass swap reservation (this is basically like manually doing overcommit with mode 1), but (naturally) libc's impl does not use that flag. But one can see this in play via this playground - if you remove MAP_NORESERVE, mmap will fail (printed value is -1, I didn't bother checking errno).

vitalyd · January 31, 2019, 1:45am

Can the DB not return a rowcount? Or is the DB itself streaming/cursoring?

This is likely not a helpful answer, but, you may want to reevaluate your design. If you're buffering all rows from such queries, and I assume the resultset isn't somehow limited to being 5GB, you're likely setting yourself up for future failure if the dataset size continues to grow.

One option might be to stream the data to a (temp) file (assuming you'll have more disk space than memory), and then mmap that file for processing (I'm assuming you're doing readonly operations on the data?). This assumes that you cannot actually stream data from the DB in chunks and need to materialize all data before actually doing anything.

tokahuke · January 31, 2019, 2:00am

Db is (at the moment) cursoring. I maybe could try to estimate the number of rows, although that would be tricky. However, after reading through the answers, it is apparent that I will indeed need to reevaluate my design.

Yes, my dataset keeps growing every day (and my ambitions at processing everything too). Thnaks @vitalyd, that is a sensible answer, although I will still have a look at @kornel's Vec<Vec<_>> solution, which I found quite elegant, too.

dcarosone · January 31, 2019, 4:22am

So, some (hopefully) obvious points:

you could try (for experiment) allocating a large chunk with Vec<Vec<_>>, freeing that, and then alloc a large Vec<_> of the same total capacity
with Vec<Vec<_>>, you could take each cursored batch as the size of an inner Vec for easy iteration on the db read side as well
do you really need them all in memory at once? Can you iterate over each batch many times, or does each iteration depend on having completed a previous full iteration? Often the need for "everything in memory at once" comes from a random access pattern, whereas yours is linear.

TomP · January 31, 2019, 4:26am

And if your input dataset will continue to grow, what will you do when the total amount of data exceeds all available memory in your computer? At that point you will have to introduce an incremental processing strategy. So why not do that now, when you're only half-way to exceeding your computer's memory capacity, since you will be forced to do it eventually?

dcarosone · January 31, 2019, 4:31am

Repeated linear scans over this data will most likely result in the VM system paging one page at a time into a working set much smaller than the file. Maybe they'll remain long enough in the buffer cache for the next pass, but this is still more expensive than the original concern of a cpu cache miss "at all costs". Certainly, once the dataset becomes larger than the available memory, you'll be paging off disk constantly.

vitalyd · January 31, 2019, 2:06pm

This is certainly true to a large extent. I don't think talking about cache misses as a major concern is really appropriate given the situation (i.e. loading GBs worth of data, possibly exceeding physical memory capacity, from a DB).

But, if accessing the mapping linearly, there should be some amount of prefetching going on by the kernel (and this can be further hinted via MAP_SEQUENTIAL madvise() call). If it's a fast disk and processing is "slow enough" then it's possible for the prefetch to hide (some of) the latency.

Of course, this is all speculative/general and @tokahuke should do some benchmarks with different strategies to see what works best for their specific workload. IMO, the most pressing issue to address would be the need for materializing the full dataset on the client - that's not going to scale, and cache misses will be the least of their concerns .

p-kraszewski · February 4, 2019, 6:04pm

It definitely looks like a job for a server-side processing (probably some sort of map-reduce), not like a job for "pull everything and process in memory".

Topic		Replies	Views
Using most of a computers memory with a single Vec of unknown size help	3	376	January 12, 2023
Rust memory allocation questions help	4	1003	December 17, 2022
What is the maximum length of a vector? help	11	7069	August 22, 2020
Vec::with_capacity + read_to_end = overallocation help	13	1046	January 3, 2022
What is the efficient way of handling large vec's help	10	1152	August 14, 2020

Want to create HUGE Vec<_> that fits half the memory: got failAborted

Related Topics