Hello,
I have a sorted, normally distributed list of data that I need to find an element in. Interpolation search is the obvious answer, however, the current approach I'm using is VERY computationally expensive. Basically, for "normal" interpolation search, it's assumed that the data is uniformly distributed. Then, to get the "starting point", linear lookup is done, which is essentially plugging in the value that needs to be found into the inverse of the CDF (CDF is cumulative distribution function - for a uniform distribution, the CDF is linear. For interpolation search, the CDF is inverted and the desired value is plugged in to give an approximate address). However, the normal (gaussian) CDF is computationally expensive. Any way to fix that? I was thinking a piecewise linear approach could work - split the distribution into 4 or so "buckets", and within each bucket, it's approximately linear, and normal interpolation search can be used.
Also, this is more of an algorithms type question (not directly related to Rust), so should I move it somewhere else? the moment I move this to stackoverflow I'm going to get downvoted into oblivion
I guess you should try to find an approximation of the function that fits your needs. Inherently it's a tradeoff between accuracy and performance, so I'd make a representative benchmark and test multiple options.
Piecewise-linear seems to be a reasonable algorithmic starting point. Linear search forward/backward from the midpoint makes sense if the number of segments is non-trivial but reasonably small, because a normal distribution clusters around the mid-point with matches progressively less likely as one deviates from the mid-point.
You already sorted the data, so find the quantile is fast. At this point you mantains a list of quantiles that you are interested in (like divide the whole domain in n = 100 buckets) you pick the closest quantile you have to your search, and then you do a linear scan.
Which is basically what you are proposing as well...
Have you really confirmed that the search is doing so poorly? How many data points are there?
You'll take O(log N) time regardless of your approach so it'll be hard to beat bisection. But it sounds like you're not using bisection and that perhaps is your problem? Bisection makes no assumption about the distribution of the data and will perform no worse with any distribution so it's worth trying.
Is your data in an array or some linked data structure?
I'm confused by the scale of the problem you are dealing with:
You're considering splitting the distribution into buckets. So presumably, the data is large, and your performance issue boils down to cache misses. But then you say that computing the CDF of the gaussian distribution is expensive. Surely it is far less expensive than these cache misses?
The data is sorted. Somehow this is not more costly than the binary search? How frequently are you searching the same data? If it's on the order of N, then you have O(N log N) total work and would do far better (especially memory-access-wise) with a single O(N) scan of the sorted data for all values sought.
It's hard to beat a logarithmic search. I mean, you certainly can't beat it with just some prep-work; reducing the search space by a constant factor only subtracts a constant from the overall time.
This is starting to look like a tree, i.e. do a binary search over 100 quantiles, then within the range of the matching quantile, do a binary search of the underlying data.
Alternatively, a BTreeMap would do the same thing, just with a few more levels of tree. This is optimised for cache efficiency in Rust, so maybe also try that to see how it works out.
Are you sure about inverting the normal CDF being expensive? Wikipedia says that inverting the error function can be performed with Newton's method, which converges very fast.
Anyway, the first aspects that must be known are the size of the list and the number of searches that must be performed (as already indicated by @ExpHP ).
It may be also relevant that knowing the distribution of the data it should be possible to perform the sorting in less than O(n log n). In linear time under some conditions.
I'm splitting the data into buckets because it's already known that the data is approximately normally distributed. Within each of the 4 of so buckets, it's approximately linear, so linear interpolation search (If I were to use gaussian interpolation search, convergence would take very few iterations, but there's lots of more expensive math) can be used to quickly find a value.
If your data is sorted already, then the distribution does not matter, O(log n) is pretty much the fastest you can have in a "standard container". Not aware of something significantly faster than binary search here.
I'm not sure but I think I can read between your lines that you are looking for a quantile in a normal distributed array and that's why you sorted your array.
If you know the full parametrization (e.g. mean and variance) of your data, then you can use Halley's method to numerically invert the CDF reasonably quickly. For a starting point, I would have to experiment - there are polynomial approximations of degree n to the CDF which are thus n-times differentiable and here you can have high-order solving schemes again. If you have "some guess" on the variance, there are quantile tables for the ND. Take the values there, scale them up and do the rest with Halley. You should be there within two or three steps.
But all that just adds weight if you already have your data and it is sorted. And I agree with the other voices here - that is normally not the bottleneck in a typical numerical program. For calculating a quantile for normal distributions (or, let's say, for uni-modal continuous distributions with a known density) the Monte-Carlo approach you are describing is computationally inefficient. Sorting plus binary search yields O(n log(n)) in a good case - which is already covered by the sorting. So if it is just about the quantile of an analytically somewhat known distribution, do not create data via inverse transformation sampling just to calculate a quantile. Newton-based methods on the CDF will be faster. If you have the data due to some other process and it is sorted by chance already, then binary search or a really carefully tuned interpolation search. But that's micro-optimizing then in the most reasonably sized problems (e.g. array size smaller 10 million).
I know it's annoying but maybe describe the actual problem you want to solve. I bet we can find the bottleneck.
We are speaking about the big-O-notation which will give you an approximate speed when the dataset is infinite big (O(log N)).
You are talking about a specific use case, less than 128 elements. So yes, if you have less than 128 elements, you could be right and the SIMD code could be faster. But if we are talking about 1 million elements, binary search is definitely faster.
If the keys were a permutation of the first n natural integers then sorting them would create a vector v with v[i]=i. In such case knowing the distribution has made possible a O(1) search algorithm. Do you have some proof that in general knowing the distribution does not help? Because I think something similar should work for sensible distributions.
@nakacristo Yes, in that trivial case we have O(1), that's why I mentioned 'pretty much the fastest'
I couldn't come up with another case though. Doesn't mean that there is another one, just 'for a fairly general problem' O(log n) will be it, unless you know very much about the problem and can fall back to a very specialized sorting algorithm (some interpolation sorting type for example).
But as mentioned, I don't think that the search itself is the actual problem of @ndrewxie .
I'm familiar with complexity.
I just made a mistake and replied a message instead of OP.
BTW Proving that worst case scenario is O(log(n)) regardless of the info about the distribution is trivial, but is it proved that the average case complexity is O(log(n)) too?
[...] Doesnât mean that there is another one, just âfor a fairly general problemâ O(log n) will be it, unless you know very much about the problem and can fall back to a very specialized sorting algorithm (some interpolation sorting type for example). [...]
Yes. As mentioned, the more you know about your problem, the better you can choose your algorithm. For a fairly generic problem, log n will be it. For solving a quantile problem, you technically would not even need to sort at all if you knew enough about the underlying process.
In a huge array, every step of a binary search is a cache miss. So something tree-like would make better use of cache. The idea being that if you can make use of other values fetched in that same cache line, then the overall search would be faster. So effectively we want the first cache line fetched to narrow down the search to one of N subranges, then the next cache line to do the same. With binary search each cache line fetched only chooses between 2 subranges. Anyway I say it's at least worth trying BTreeMap with range over a binary search if the array is huge.