Yeah that is almost to consider as dense. 98% is 'already very sparse'.
But interesting! All the suggestions I had were of course quite normal distribution specific, if you want to generalize this to other distributions, then fair enough but also possible.
Good to see that your interpolation search works and is fast enough. If I understand you right you interpolate the CDF piecewise linearly (so every subpart of the CDF is linear, hence locally uniformly distributed) and apply a specialized search routine there.
On that base I can also see why you do the 'unnecessary' data vector access since you want the actual element and it's location, not the location of an element that could be in range 10e-15 to it.
Really just in case I underline it one last time - you can speed up your computation time in that part by several orders of magnitude if you only need an element close by but not the 'float-exact' element on that point of the sparse struct!
Now as it looks like that this isn't necessarily the case and you want/need to work with the data vector. You can still reduce the access time further with theoretical quantiles as starting point. Careful here as this can lead to unstable access (if the data is only 'somewhat close' to a ND) so keep that in mind. I think from your original post this might still be of interest for you.
Let N
be the CDF of the normal distribution and Q
be the quantile function, which is the inverse of N
. There are many very efficient implementations for both so I'll keep that as a given thing for now.
You should at one point do the effort and calculate mean m
and variance v
for your data vector. I'll pretend that m=0, v=1
but this applies to any values. Calculating this is an O(n)
operation and does not need to be updated all the time you insert or delete elements if your vector is large, maybe if 1% of your data changes.
Edit: If you're really sure that you're dealing with a normally distributed data set, you have a large enough sample (> 5.000-10.000) and you really need the speed, you can even save that O(n)
operation with the Chebychev rule if you're sorting your array anyways. Mean and median are the same, hence you just select the middle element, which is not surprising, but - variance can be just as well selected from the array! By bespoken rule we have that 68.2689492% of all values fall withing one 'sigma-interval around the mean', e.g. by denoting s = sqrt(v)
we can proof that 68.2689492% of all sampled values fall between m - s
and m + s
. This means we get s
by selecting the element at the position data_vec[(0.5 + 0.682689492) * data_vec.len()] - m
. Both are fast O(1)
operations which bring down the whole data complexity to O(log(log(n)))
.
Let's say you want to have the address of an element x
in that vector. If x > 0
we theoretically don't need to look in the lower half of the vector, that's what we know from the distribution. Same for the other case. That covers the so-called median (which is the same as the average for N
). Now depending on the size of your vector, let's take 2n
other search points with n
being somewhere in the range between 1 and 4 and distribute them on the left and right of the median. I'll assume n=1
, which is something what you did.
If we divide the data vector now into four chunks as you did we are effectively placing quartiles into our vector. As before with smaller/larger zero we can just as well deduct in which of the four resulting bins the element should lie, since we can approximately calculate N(0.25, m,v)
and thus deduct without a lookup the approximate value of the first border and then effectively decide without a single lookup in which of the four buckets we need to be.
So to speed up your routine, do the following:
- Calculate
m
and v
- Decide on the number of elements a number of quartiles (buckets) you want to place in your data vector.
- Calculate the approximate value and with that, also the address of those boundaries. First quarter in the data vector also means first quarter of the address vector. Easy-peasy with 'N'. Save them in another, much smaller vector. Do lookups on that vector first.
- Like that you have identified the bucket with one comparison. Now you can use a local and efficient lookup on that level.
Upside: random access is very fast and efficient. Downside: inserts might be a little bit more complex as you have to update your boundaries from time to time.