Datastructure help

Suppose we have 10B u64's. So we have:

let x : Vec<u64> = ();
x.len() = 10_000_000_000;

This takes 80G of Ram (maybe double), but it's okay.

Suppose now, instead of a single bucket we have 1M buckets, which store a total of 10B u64's; so now we have something like this:

let x: Vec<Vec<u64>>; = ...;
x.len() == 1_000_000;
sum_i x[i].len() = 10_000_000_000;

So now we're at 80G + 1_M * sizeof(Vec), still okay. [possibly double due to Vec's expansion].

So now we start receiving a bunch of Move(i, j) commands, which means "let t = x[i].pop(); x[j].push(t);" -- i.e. move last element from bucket i to bucket j.

These requests are not uniformly distributed, they may be adversarial.

So now we have this bad situation -- unless we regularly call Vec::shrink_to_fit, we can easily use more than 160G of memory. Adversary: fills one bucket, then moves all to next bucket, etc ...

Question: given the adversarial nature of this problem, what is a good data structure for this? I am looking for something to exploit the fact that there is only 10_000_000_000 elements total; so I want something that exploits this to keep the containers in the 1M buckets "not too wasteful"

EDIT: We also need to be able to efficiently, say O(size of output) answer queries of the form "get me all elements in bucket i"

Well, you could use shrink_to_fit whenever the array becomes smaller than half its capacity. Then you use at most twice the space you need.

I've been thinking about a LinkedList where each node is a [u64; 1000].

Iteration should not be too much worth given nodes are bigger than a 4kb page, and as for wasted space, each bucket wastes at most one LL node.

Sure. A linked list would also work.

2 Likes

This would potentially suffer from thrashing: you add two elements and the capacity doubles, then you remove two elements and the capacity halves.

However, there is a simple solution to this: halve the capacity (using shrink_to) when the length becomes less than a quarter of the capacity.

2 Likes

Hmm, yes, you have to be careful if the push and pop thresholds are equal. It would also work to only increase capacity by 50% when full.

1 Like

At this scale, you should probably be thinking really carefully about how paging and allocators work if you're concerned about adversarial actions. To be clear, I'm not an expert, but I've got some suggestions based on what I've read.

Every bump over the capacity requires a second, double sized allocation, then copying every element over, then deallocation of the previous, so you're not just potentially double the size but temporarily triple (old n + new 2n). This copying is also CPU heavy and will completely trash your caches, though I'm guessing it's probably not that big an impact overall in current hardware.

Adversarial actions could also try to get you to thrash pages and maybe allocator lock contention: I'm not sure how much you can do to protect yourself here, but I suspect existing allocators aren't really tuned for this scale, and at this size, every allocator I'm familiar with simply delegates to the os virtual allocator, so I wouldn't be too concerned about rolling your own here based on that. That might let you squeeze out a bit more item density or avoid contention hazards by keeping things thread local based on your application.

That said: that linked list of items would work pretty well for your concerns from what I can tell. That's roughly the internals of BTreeSet/Map, known for their excellent performance at scale. The main thing is tuning for total node size including allocator overhead: my best guess would be either just under 4kB or 16kB depending on os, but the only rules I've seen are basically "it depends". You can also run into weird things like cache line contention with the next pointers and other hard to predict effects, so it's not always obvious how to make things super fast once you're past fortran style big fat arrays. Well past my expertise here, so I'll only suggest looking around for other people doing things like this.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.