Faster alternative to binary search?

In this case, I really don't think so. @ndrewxie mentioned this:

I have a sorted, normally distributed list of data that I need to find an element in.

He basically knows already everything, this is a quantile-problem - unless I really misunderstand the thread creator but if he knows the distribution of his data already it sounds like he is overcomplicating the solution. With a quick quantile-approximation he can calculate analytically where in his list the element must be and start a search from there. If he has exact parameters he doesn't even need any data access at all to find the value of the element he is looking for, since this is an analytically very well traceable problem. (Unless he looks for the exact float-value and he is experimenting with a very specific random number generator and needs to build something on that level. Although that also doesn't sound like it.)

If you want to convert the list (Vec or something similar I guess) into a BTreeMap that should be a not-super-cheap O(n) operation or a somewhat okay O(n*log n) operation with lots of copies (Vecto BTreeMap needs clone if I'm not mistaken) - and then you still haven't started finding the element which is then much easier but you still have to do it. Calculating mean and variance is cheaper (if he doesn't have it yet) and that brings you directly to the close neighborhood of the element or on the element itself without needing to create a copy of the data (hence larger memory footprint) in a different data format.

1 Like

Alright. The thing I'm using the search routine in is for a sparse array - I have two vectors, one containing the data, and the other containing the addresses. I need the lookup routine to implement indexing. So, for example, the vector [1, 2, 0, 0, 2, 3] would have an address vector of [0, 1, 4, 5], and a data vector of [1, 2, 2, 3] (assuming that 0 is ignored). If I were to try to find the, let's say, 4th element, my program would run a search routine on the address vector and find 4 (4 is in index 2 in the address vector), then return the data in the corresponding address of the data vector (index 2 in the data vector is 2, so 2 is returned). The whole vec is used in a machine learning routine, so it needs to be pretty fast (I'm working with VERY, VERY sparse matrices here, so no point in wasting the memory storing all those zeroes). Right now, my code is about 2.5 times slower than Vec, which kind of works for the application, but is still too slow. The bottleneck is the sparse Vec (I used to use a normal vec, and the program would take 1.5 gigabytes in memory, even with a small dataset).

Binary search as O(log(n)) time, but interpolation search has O(log(log(n)) time, for linear data, at least. So I'm dividing the data into sections (currently 4), and making the linearity assumption within each section, for fast lookup.

Code is here

1 Like

How much sparse are we talking here? Like having a 1000 sided matrix with 3 nonzero elements per row or something like 100 nonzero elements in total?

Also, the typical when working with sparse matrices is not to employ a generic sparse vector structure like the one you have implemented but to store the nonzero elements in each row. Perhaps reading how GSL implement different kinds of sparse matrices (see Sparse Matrices ā€” GSL 2.7 documentation) can help you.

Since you mention machine learning I guess the fundamental operation is just multiplication of a matrix by a vector. Is this the case? Because that fits very well with the storage by rows.

Yeah, there about 100,000 total elements, and about 60% of them are zeroes.

You might be interested in things like the Eytzinger layout, or other options from this paper:

1 Like

Well, that is barely sparse, not "VERY, VERY sparse" as you claimed. If you are worried about memory, note that instead of the 100,000 total elements you can store just the 40,000 nonzeros plus some extra information to locate them. Regardless of the codification it seems that at at least an integer is required per nonzero entry. Thus, if the size of a data value and of index are of similar size then you can only reduce about 20% of the memory. And this is at the expense of increasing access time, in different degree depending on the chosen codification and the operation to perform.

Also, I am assuming that the zeroes are randomly located. Otherwise, some scheme might be implemented. Like being triangular, banded, or circulant. Other properties such as being symmetric are also relevant.

1 Like

yeah, I've heard of the eytzinger layout. Any way that can be extended to work with interpolation search (with has O(log(log(n))) complexity, vs the O(log(n)) complexity of binary search)? Or does the better cache efficiency of the eytzinger layout compensate for the increased complexity of binary search?

Binary search will use log(40,000) < 16 memory lookups. Since you want to beat binary search, how many lookups are you trying to reduce this to?

1 Like

Actually, using B-tree - Wikipedia , this can be done via 4 "cache line loads" as follows:

Assuming:

  • we are dealing with 32-bit (4 byte) floats
  • cache lines are 64 bytes // holds 16 floats

Each node contains 16 floats that divides the "range" into 16 equal sized sub-ranges.

Then, each node (cache line) lookup allows us to recover log_2(16) = 4 bits of info.

===

For the initial problem, we only need < 16 bits, so therefore we can locate the item after hitting at most 4 node (cache line) lookups.

Interpolation search is how I can reduce it - if the data is uniform, interpolation search takes 1 memory lookup. If it's not, then more is required.

Okay, this makes a fair bit more sense.

There is one thing I still don't fully understand: What are you doing that requires you to perform indexing? I've used sparse layouts for many things and have hand-rolled a number of such types, but I've never found that I needed a method like fn at(&self, usize) -> &T except for use by unit tests. (Normally all I need is something like a dot product for floats, or union/intersection operators for bools)

I can't help but suspect that a better solution can be found by attacking that problem instead.

1 Like

Yeah, it's for a ML routine, but it's used more as a datapool type thing - so I need very efficient random access and insert ability. Matrix multiplication, not so much.

1 Like

lmao it's not perfectly uniform, that's why there's the if. Locally, it's close enough though

Yeah that is almost to consider as dense. 98% is 'already very sparse'.
But interesting! All the suggestions I had were of course quite normal distribution specific, if you want to generalize this to other distributions, then fair enough but also possible.
Good to see that your interpolation search works and is fast enough. If I understand you right you interpolate the CDF piecewise linearly (so every subpart of the CDF is linear, hence locally uniformly distributed) and apply a specialized search routine there.

On that base I can also see why you do the 'unnecessary' data vector access since you want the actual element and it's location, not the location of an element that could be in range 10e-15 to it.
Really just in case I underline it one last time - you can speed up your computation time in that part by several orders of magnitude if you only need an element close by but not the 'float-exact' element on that point of the sparse struct!

Now as it looks like that this isn't necessarily the case and you want/need to work with the data vector. You can still reduce the access time further with theoretical quantiles as starting point. Careful here as this can lead to unstable access (if the data is only 'somewhat close' to a ND) so keep that in mind. I think from your original post this might still be of interest for you.

Let N be the CDF of the normal distribution and Q be the quantile function, which is the inverse of N. There are many very efficient implementations for both so I'll keep that as a given thing for now.
You should at one point do the effort and calculate mean m and variance v for your data vector. I'll pretend that m=0, v=1 but this applies to any values. Calculating this is an O(n) operation and does not need to be updated all the time you insert or delete elements if your vector is large, maybe if 1% of your data changes.

Edit: If you're really sure that you're dealing with a normally distributed data set, you have a large enough sample (> 5.000-10.000) and you really need the speed, you can even save that O(n) operation with the Chebychev rule if you're sorting your array anyways. Mean and median are the same, hence you just select the middle element, which is not surprising, but - variance can be just as well selected from the array! By bespoken rule we have that 68.2689492% of all values fall withing one 'sigma-interval around the mean', e.g. by denoting s = sqrt(v) we can proof that 68.2689492% of all sampled values fall between m - s and m + s. This means we get s by selecting the element at the position data_vec[(0.5 + 0.682689492) * data_vec.len()] - m. Both are fast O(1)operations which bring down the whole data complexity to O(log(log(n))).

Let's say you want to have the address of an element x in that vector. If x > 0 we theoretically don't need to look in the lower half of the vector, that's what we know from the distribution. Same for the other case. That covers the so-called median (which is the same as the average for N). Now depending on the size of your vector, let's take 2n other search points with n being somewhere in the range between 1 and 4 and distribute them on the left and right of the median. I'll assume n=1, which is something what you did.

If we divide the data vector now into four chunks as you did we are effectively placing quartiles into our vector. As before with smaller/larger zero we can just as well deduct in which of the four resulting bins the element should lie, since we can approximately calculate N(0.25, m,v) and thus deduct without a lookup the approximate value of the first border and then effectively decide without a single lookup in which of the four buckets we need to be.
So to speed up your routine, do the following:

  1. Calculate m and v
  2. Decide on the number of elements a number of quartiles (buckets) you want to place in your data vector.
  3. Calculate the approximate value and with that, also the address of those boundaries. First quarter in the data vector also means first quarter of the address vector. Easy-peasy with 'N'. Save them in another, much smaller vector. Do lookups on that vector first.
  4. Like that you have identified the bucket with one comparison. Now you can use a local and efficient lookup on that level.

Upside: random access is very fast and efficient. Downside: inserts might be a little bit more complex as you have to update your boundaries from time to time.

2 Likes