[u128] -> unique list of u128

Suppose we are given N u128's, where 2^20 < N < 2^40 and we want to compute a list where each u128 appears precisely once. NOT required in sorted order.

On a single thread, this is fairly easy: dump into a HashSet, dump back out to a Vec.

With multicore, is this problem expected to be faster ?

You can collect into multiple hashsets in parallel, and then join them. I think this is exactly that's what rayon does if you par_iter().collect() into Hashset. If you're going to to this this way, I suspect you should use BTreeSet instead HashSet, as joining btreesets should have more cache-locality than joining hashsets, and with many threads, you'll have a lot of joining.. But I guess this approach will only allow to speed things up when you have significant amount of duplication within your u128s, otherwise merging partial results will probably outweight the parallelized insertion.

At this sizes you might be memory-bound (or IO-bound even), so threads won't buy you much. 2^40 of u128s is 8TiB, so not sure if this HashSet will fit in your RAM :smiley: And if it will, you have a constant stream of cache-misses anyway. So, perhaps sorting your input might turn out to be fastest solution after all? :man_shrugging:

1 Like

Fair point, let's assume (1) everything fits in RAM or (2) we somehow got our hands on those 12 TB AWS EC2 instances.

You could use something like a Bloom filter to track which numbers you've already added to the result list. This works best if you don't expect there to be many duplicates: The Bloom filter has a small chance of flagging a false positive, which needs to be double-checked with the actual output list.

Alternatively, you can relax your requirements to allow a small proportion of non-duplicates to be improperly skipped. You can control the probability by changing the Bloom filter's parameters, but it can't ever be zero (cf this figure on Wikipedia).