How could one make this search more cache friendly?


#1

I’ve written this “search” function(that can blow in somebody’s face with a stack overflow because of recursion), more akin to contains but returns the position of the “first found” element instead of a bool.
This function won’t be used anywhere, it’s written just for fun… Here it is:

extern crate rayon;

fn search<T>(slice: &[T], search_for: &T) -> Option<usize>
    where T: PartialEq + Sync
{
    let len = slice.len();

    if len == 0 {
        return None;
    }
    
    let it = len / 2;
    
    if &slice[it] == search_for {
        return Some(it);
    } else {
        if len == 1 {
            return None;
        } else {
            // __builtin_prefetch(&slice[it + ((len - it) / 2)], 0, 1);
            // __builtin_prefetch(&slice[it / 2], 0, 1);
            let (left, right) = slice.split_at(it);
            
            let (left_result, right_result) = if it > 64 {
                rayon::join(|| search(&left, &search_for),
                            || search(&right, &search_for))
            } else {
                (search(&left, &search_for),
                 search(&right, &search_for))
            };
            
            match (left_result, right_result) {
                (Some(position), _) => return Some(position),
                (_, Some(position)) => return Some(it + position),
                _                   => return None,
            }
        }
    }
}
  1. I imagine the cache doesn’t like all those references. Because Rust moves by default you basically have to do it this way or ask for T: Clone(or T: Copy?). This will impose more limitations on who can use the function.

  2. Of course I would like to provide 2 functions called search, T: PartialEq + Sync and T: PartialEq + Sync + Clone but from what I know, Rust doesn’t allow that. Would it be possible to overpass that through traits or there is just no way of using the same function name in the same module?

  3. Is there a way in Rust to ask the compiler to put calls to prefetch? I’ve found llvmint but the prefetch function wants *mut i8 or int8_t *(C/C++) instead of intptr_t(C/C++).


#2

Ok, I will do a shameless bump too cause I really hoped somebody had answers to these questions(How cache friendly are all these pointers, do you need to copy instead then?)


#3

Is your input sorted in any way? If not, why do you do a recursive splitting into threads? You could split right away and then let the threads do their work.

If your input is sorted you could first check out binary trees and if that is not enough you could look at b-trees with slices that fit nicely into you cache. And do make use of cachegrind :slight_smile:.

Firstly I hope that I understood your situation correctly and secondly I hope I could give some useful advice :slight_smile:.


#4

It’s not sorted, otherwise I would of not searched in both directions and go only into one direction, towards the value I am searching for(binary search).
I wanted to play with rayon, that’s why I split the workload and search in parallel(until there are to few elements to use threads, and instead make it run on the same thread).

Rayon offers now another API. Now I could split the workload without doing it recursively.


#5

Yes, I would avoid doing it recursively. Try to detect a good size for slices that fit into your cache and send them to your search threads. That should work nicely :slight_smile:.


#6

While it is more efficient than jumping through a sorted array for binary search, I’ve seen somewhere that with __builtin_prefetch, jumping through a sorted array is the fastest and most reliable when the number of elements grow.
I could probably do that with llvmint but I don’t understand why it asks the address to be *mut i8


#7

I have no experience with llvmint, so I cannot help you in that respect. By I have a feeling (and of course I can be misled) that your algorithm could need some more improvement. Do you do benchmarking? What is your speed gain over plain old single-thread linear search? Or a 2-thread linear search? How does it behave in regard to your threshold for thread-splitting? If you do a lot of individual searches in the same array you should try to sort it in advance. (This is of course not benefical if you only search once in every array.)


#8

I will make a claim that the parallel version is considerably faster than single-threaded or 2-threaded search(depends on the size of the array) and now I will go and test… :slight_smile:
Btw, linear search you mean, 1-by-1 instead of a divide-and-conquer approach?
For very small arrays I expect a linear search to be faster. I could change on the small slice to do a linear search instead of splitting.


#9

Preliminary tests…

test benches::bench_par_dac_search_100                   ... bench:         727 ns/iter (+/- 38)
test benches::bench_par_dac_search_1_000                 ... bench:       7,335 ns/iter (+/- 457)
test benches::bench_par_dac_search_10_000                ... bench:      73,319 ns/iter (+/- 5,150)
test benches::bench_par_dac_search_100_000               ... bench:     736,736 ns/iter (+/- 36,512)
test benches::bench_par_dac_search_1_000_000             ... bench:   3,753,264 ns/iter (+/- 15,903,998)
test benches::bench_par_dac_search_10_000_000            ... bench:  27,317,432 ns/iter (+/- 22,288,705)
test benches::bench_par_dac_search_100_000_000           ... bench: 208,881,578 ns/iter (+/- 6,585,320)
test benches::bench_single_thread_dac_search_100         ... bench:         509 ns/iter (+/- 32)
test benches::bench_single_thread_dac_search_1_000       ... bench:       4,937 ns/iter (+/- 230)
test benches::bench_single_thread_dac_search_10_000      ... bench:      49,399 ns/iter (+/- 2,188)
test benches::bench_single_thread_dac_search_100_000     ... bench:     500,447 ns/iter (+/- 45,231)
test benches::bench_single_thread_dac_search_1_000_000   ... bench:   5,097,678 ns/iter (+/- 383,958)
test benches::bench_single_thread_dac_search_10_000_000  ... bench:  54,605,203 ns/iter (+/- 1,891,877)
test benches::bench_single_thread_dac_search_100_000_000 ... bench: 535,013,469 ns/iter (+/- 9,935,891)

First of all, that if there that chooses between parallel join or non-parallel seems to bring some overhead, probably the compiler can’t optimize something there(my guess would be that the match statement directly uses the temporary without the if statement).

Without __builtin_prefetch(for which this was written for initially) the cache doesn’t like at all the jumping around… Without prefetching I should really consider a linear search when the slices get smaller.


#10

And final results:

100_000_000 parallel divide-and-conquer + same thread dac on small slices:
208,881,578 ns/iter (+/- 6,585,320)

100_000_000 single-threaded divide-and-conquer:
535,013,469 ns/iter (+/- 9,935,891)

100_000_000 parallel divide-and-conquer + same thread linear search on small slices: 
26,680,350 ns/iter (+/- 20,480,476)

100_000_000 single-threaded linear search: 
68,723,322 ns/iter (+/- 6,588,568)

I will try to use par_iter with chunks that fit in L2, I bet it will outperform any of these, will see…
I should try to make prefetch from llvmint work for this, I have a feeling it will perform the best then.


#11

You might find this paper on “Array Layouts for Comparison-Based Searching” interesting.