Want to create HUGE Vec<_> that fits half the memory: got failAborted

So.... I want to fit a very big dataset in memory, which might even fit, but here is the catch: the default behavior of Vec<_> is to double its size when its allocation limit is reached. I got my program aborted when it tried to grow to 10GB, which is just over the size of my RAM...

... but the vector was only 5GB, with room to spare for the rest of the dataset (even with a OS and a database running). This doubling property of Vec<_> is compeletely undesirable for my usecase.

How can I solve this problem? I need to iterate over this vector linearly many times, avoiding cache misses at all costs.

If you know the size ahead of time, you may want to try and call with_capacity, or reserve.

2 Likes

No, I don't have that. It is a result set of a query, which I stream from the DB. I would be looking for a solution like a crate, a data structure.

I guess you're not on Linux? Overcommit would usually allow this until you actually access all of those memory pages.

Well, I actually am on Linux (Ubuntu 16 Server). And no, I have no heard of overcommit yet.

... so... my understanding of the cause of my failAborted is after all wrong!?

LinkedList might be what you're looking for, it's less efficient than Vec though.

A thin wrapper around Vec might be good too, here is a small mock up that allows you too reserve space at your own instead of depending on Vec's implementation (reserve memory before Vec does).

But in that case, wouldn't it be inefficient given a small enough T? Every T would have a pointer to the next, bloating it alot for a large dataset with small components.

Oh, perhaps I'm wrong about how exactly overcommit works here. For instance on the playground, I can't allocate a single 40GB vector, but I can allocate a thousand 4GB vectors.

Use Vec<Vec<T>>. With large enough inner Vec the overhead will be negligible. There's flat_map for iterating easily over this.

6 Likes

The issue here will be that there isn't that much space available in one allocation, but the progressive allocation provides backpressure for the VM system to free up more space

2 Likes

It might be interesting to have a crate provide a nice API around Vec<Vec<T>>, behaving mostly like a flat Vec<T> apart from not being able to get a contiguous slice.

You might be able to tweak something like ropey to do what you want.

It looks like the playground claims 8GB of total RAM (I'm guessing this is some container behind the scenes, I didn't check). overcommit_memory is set to 0, which means the kernel uses a heuristic to fail an "obvious overcommit" - I guess this falls into that category :slight_smile:.

If overcommit is not entirely disabled, one could mmap with MAP_NORESERVE to bypass swap reservation (this is basically like manually doing overcommit with mode 1), but (naturally) libc's impl does not use that flag. But one can see this in play via this playground - if you remove MAP_NORESERVE, mmap will fail (printed value is -1, I didn't bother checking errno).

1 Like

Can the DB not return a rowcount? Or is the DB itself streaming/cursoring?

This is likely not a helpful answer, but, you may want to reevaluate your design. If you're buffering all rows from such queries, and I assume the resultset isn't somehow limited to being 5GB, you're likely setting yourself up for future failure if the dataset size continues to grow.

One option might be to stream the data to a (temp) file (assuming you'll have more disk space than memory), and then mmap that file for processing (I'm assuming you're doing readonly operations on the data?). This assumes that you cannot actually stream data from the DB in chunks and need to materialize all data before actually doing anything.

1 Like

Db is (at the moment) cursoring. I maybe could try to estimate the number of rows, although that would be tricky. However, after reading through the answers, it is apparent that I will indeed need to reevaluate my design.

Yes, my dataset keeps growing every day (and my ambitions at processing everything too). Thnaks @vitalyd, that is a sensible answer, although I will still have a look at @kornel's Vec<Vec<_>> solution, which I found quite elegant, too.

So, some (hopefully) obvious points:

  • you could try (for experiment) allocating a large chunk with Vec<Vec<_>>, freeing that, and then alloc a large Vec<_> of the same total capacity
  • with Vec<Vec<_>>, you could take each cursored batch as the size of an inner Vec for easy iteration on the db read side as well
  • do you really need them all in memory at once? Can you iterate over each batch many times, or does each iteration depend on having completed a previous full iteration? Often the need for "everything in memory at once" comes from a random access pattern, whereas yours is linear.

And if your input dataset will continue to grow, what will you do when the total amount of data exceeds all available memory in your computer? At that point you will have to introduce an incremental processing strategy. So why not do that now, when you're only half-way to exceeding your computer's memory capacity, since you will be forced to do it eventually?

Repeated linear scans over this data will most likely result in the VM system paging one page at a time into a working set much smaller than the file. Maybe they'll remain long enough in the buffer cache for the next pass, but this is still more expensive than the original concern of a cpu cache miss "at all costs". Certainly, once the dataset becomes larger than the available memory, you'll be paging off disk constantly.

This is certainly true to a large extent. I don't think talking about cache misses as a major concern is really appropriate given the situation (i.e. loading GBs worth of data, possibly exceeding physical memory capacity, from a DB).

But, if accessing the mapping linearly, there should be some amount of prefetching going on by the kernel (and this can be further hinted via MAP_SEQUENTIAL madvise() call). If it's a fast disk and processing is "slow enough" then it's possible for the prefetch to hide (some of) the latency.

Of course, this is all speculative/general and @tokahuke should do some benchmarks with different strategies to see what works best for their specific workload. IMO, the most pressing issue to address would be the need for materializing the full dataset on the client - that's not going to scale, and cache misses will be the least of their concerns :slight_smile:.

1 Like

It definitely looks like a job for a server-side processing (probably some sort of map-reduce), not like a job for "pull everything and process in memory".