How could one make this search more cache friendly?

LilianMoraru · February 28, 2016, 9:28pm

I've written this "search" function(that can blow in somebody's face with a stack overflow because of recursion), more akin to contains but returns the position of the "first found" element instead of a bool.
This function won't be used anywhere, it's written just for fun... Here it is:

extern crate rayon;

fn search<T>(slice: &[T], search_for: &T) -> Option<usize>
    where T: PartialEq + Sync
{
    let len = slice.len();

    if len == 0 {
        return None;
    }
    
    let it = len / 2;
    
    if &slice[it] == search_for {
        return Some(it);
    } else {
        if len == 1 {
            return None;
        } else {
            // __builtin_prefetch(&slice[it + ((len - it) / 2)], 0, 1);
            // __builtin_prefetch(&slice[it / 2], 0, 1);
            let (left, right) = slice.split_at(it);
            
            let (left_result, right_result) = if it > 64 {
                rayon::join(|| search(&left, &search_for),
                            || search(&right, &search_for))
            } else {
                (search(&left, &search_for),
                 search(&right, &search_for))
            };
            
            match (left_result, right_result) {
                (Some(position), _) => return Some(position),
                (_, Some(position)) => return Some(it + position),
                _                   => return None,
            }
        }
    }
}

I imagine the cache doesn't like all those references. Because Rust moves by default you basically have to do it this way or ask for T: Clone(or T: Copy?). This will impose more limitations on who can use the function.
Of course I would like to provide 2 functions called search, T: PartialEq + Sync and T: PartialEq + Sync + Clone but from what I know, Rust doesn't allow that. Would it be possible to overpass that through traits or there is just no way of using the same function name in the same module?
Is there a way in Rust to ask the compiler to put calls to prefetch? I've found llvmint but the prefetch function wants *mut i8 or int8_t *(C/C++) instead of intptr_t(C/C++).

LilianMoraru · March 13, 2016, 11:30am

Ok, I will do a shameless bump too cause I really hoped somebody had answers to these questions(How cache friendly are all these pointers, do you need to copy instead then?)

dawirstejeck · March 13, 2016, 12:16pm

Is your input sorted in any way? If not, why do you do a recursive splitting into threads? You could split right away and then let the threads do their work.

If your input is sorted you could first check out binary trees and if that is not enough you could look at b-trees with slices that fit nicely into you cache. And do make use of cachegrind .

Firstly I hope that I understood your situation correctly and secondly I hope I could give some useful advice .

LilianMoraru · March 13, 2016, 12:18pm

It's not sorted, otherwise I would of not searched in both directions and go only into one direction, towards the value I am searching for(binary search).
I wanted to play with rayon, that's why I split the workload and search in parallel(until there are to few elements to use threads, and instead make it run on the same thread).

Rayon offers now another API. Now I could split the workload without doing it recursively.

dawirstejeck · March 13, 2016, 12:22pm

Yes, I would avoid doing it recursively. Try to detect a good size for slices that fit into your cache and send them to your search threads. That should work nicely .

LilianMoraru · March 13, 2016, 12:31pm

While it is more efficient than jumping through a sorted array for binary search, I've seen somewhere that with __builtin_prefetch, jumping through a sorted array is the fastest and most reliable when the number of elements grow.
I could probably do that with llvmint but I don't understand why it asks the address to be *mut i8...

dawirstejeck · March 13, 2016, 12:40pm

I have no experience with llvmint, so I cannot help you in that respect. By I have a feeling (and of course I can be misled) that your algorithm could need some more improvement. Do you do benchmarking? What is your speed gain over plain old single-thread linear search? Or a 2-thread linear search? How does it behave in regard to your threshold for thread-splitting? If you do a lot of individual searches in the same array you should try to sort it in advance. (This is of course not benefical if you only search once in every array.)

LilianMoraru · March 13, 2016, 12:51pm

I will make a claim that the parallel version is considerably faster than single-threaded or 2-threaded search(depends on the size of the array) and now I will go and test...
Btw, linear search you mean, 1-by-1 instead of a divide-and-conquer approach?
For very small arrays I expect a linear search to be faster. I could change on the small slice to do a linear search instead of splitting.

LilianMoraru · March 13, 2016, 5:49pm

Preliminary tests...

test benches::bench_par_dac_search_100                   ... bench:         727 ns/iter (+/- 38)
test benches::bench_par_dac_search_1_000                 ... bench:       7,335 ns/iter (+/- 457)
test benches::bench_par_dac_search_10_000                ... bench:      73,319 ns/iter (+/- 5,150)
test benches::bench_par_dac_search_100_000               ... bench:     736,736 ns/iter (+/- 36,512)
test benches::bench_par_dac_search_1_000_000             ... bench:   3,753,264 ns/iter (+/- 15,903,998)
test benches::bench_par_dac_search_10_000_000            ... bench:  27,317,432 ns/iter (+/- 22,288,705)
test benches::bench_par_dac_search_100_000_000           ... bench: 208,881,578 ns/iter (+/- 6,585,320)
test benches::bench_single_thread_dac_search_100         ... bench:         509 ns/iter (+/- 32)
test benches::bench_single_thread_dac_search_1_000       ... bench:       4,937 ns/iter (+/- 230)
test benches::bench_single_thread_dac_search_10_000      ... bench:      49,399 ns/iter (+/- 2,188)
test benches::bench_single_thread_dac_search_100_000     ... bench:     500,447 ns/iter (+/- 45,231)
test benches::bench_single_thread_dac_search_1_000_000   ... bench:   5,097,678 ns/iter (+/- 383,958)
test benches::bench_single_thread_dac_search_10_000_000  ... bench:  54,605,203 ns/iter (+/- 1,891,877)
test benches::bench_single_thread_dac_search_100_000_000 ... bench: 535,013,469 ns/iter (+/- 9,935,891)

First of all, that if there that chooses between parallel join or non-parallel seems to bring some overhead, probably the compiler can't optimize something there(my guess would be that the match statement directly uses the temporary without the if statement).

Without __builtin_prefetch(for which this was written for initially) the cache doesn't like at all the jumping around... Without prefetching I should really consider a linear search when the slices get smaller.

LilianMoraru · March 13, 2016, 6:37pm

And final results:

100_000_000 parallel divide-and-conquer + same thread dac on small slices:
208,881,578 ns/iter (+/- 6,585,320)

100_000_000 single-threaded divide-and-conquer:
535,013,469 ns/iter (+/- 9,935,891)

100_000_000 parallel divide-and-conquer + same thread linear search on small slices: 
26,680,350 ns/iter (+/- 20,480,476)

100_000_000 single-threaded linear search: 
68,723,322 ns/iter (+/- 6,588,568)

I will try to use par_iter with chunks that fit in L2, I bet it will outperform any of these, will see...
I should try to make prefetch from llvmint work for this, I have a feeling it will perform the best then.

jonhoo · October 22, 2017, 10:13pm

You might find this paper on "Array Layouts for Comparison-Based Searching" interesting.

Topic		Replies	Views
How can i improve my rust code organization? help	5	342	May 30, 2023
Rust book chapter 13: difficulty understanding how to improve the Cacher type help	6	1847	October 29, 2019
Rust book chapter 13: try introducing more generic parameters to increase the flexibility of the `Cacher` functionality help	2	453	November 16, 2021
Freelancer needed for a function (pay $80) help	17	754	August 3, 2019
How to implement searching a given key in Bianry Search Tree? help	9	865	January 12, 2023

How could one make this search more cache friendly?

Related Topics