Optimising my binary search routine

I decided this week to look at BtreeMap::get performance for my BTreeMap, compared to the standard implementation, and see if there was any tuning I could do to make it faster. The key function is the search routine, my current code is this:

   pub fn search<Q>(&self, key: &Q) -> Result<usize, usize>
    where
        K: Borrow<Q> + Ord,
        Q: Ord + ?Sized,
    {
        unsafe {
            let (mut i, mut j) = (0, self.len());
            let p = self.p.as_ptr().cast::<K>();
            let mut m = (i + j) >> 1;
            while i != j {
                match (*p.add(m)).borrow().cmp(key) {
                    Ordering::Equal => return Ok(m),
                    Ordering::Less => i = m + 1,
                    Ordering::Greater => j = m,
                }
                m = (i + j) >> 1;
            }
            Err(i)
        }
    }

( From btree_experiment/src/vecs.rs at main · georgebarwood/btree_experiment · GitHub )

You can see I compute m = (i + j) >> 1 in two places, for some reason this makes it faster (by about 10%), but only for "small" BTreeMaps. I have a vague idea why it might be faster, but don't really know for sure.

I have tried unrolling the loop, that was slower, and various other things that didn't work. I would appreciate any ideas I could try. I haven't managed to look at the assembly code yet, even if I did I am not sure it would tell me much, but if you can get a disassembly and post it, I would appreciate it ( or I can work on this myself ).

Here are my currect criterion benchmark results for Get:

     Running benches\crit_bench.rs (target\release\deps\crit_bench-0d8ee347214130b6.exe)
Gnuplot not found, using plotters backend
Get/Exp/50              time:   [316.46 ns 317.26 ns 318.17 ns]
                        change: [-1.1394% +0.1916% +1.4191%] (p = 0.77 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  7 (7.00%) high severe
Get/Std/50              time:   [373.89 ns 376.35 ns 379.24 ns]
                        change: [-3.9764% -0.5858% +2.7481%] (p = 0.75 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
Get/Exp/100             time:   [743.96 ns 748.37 ns 754.08 ns]
                        change: [-5.3205% +0.2289% +5.7422%] (p = 0.94 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high severe
Get/Std/100             time:   [897.95 ns 901.67 ns 906.53 ns]
                        change: [-4.9771% -1.2141% +2.8100%] (p = 0.55 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
Get/Exp/200             time:   [2.1381 µs 2.1501 µs 2.1623 µs]
                        change: [-7.1591% -1.5093% +4.2636%] (p = 0.66 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
Get/Std/200             time:   [2.0529 µs 2.0653 µs 2.0819 µs]
                        change: [-5.4777% -0.7770% +5.5166%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe
Get/Exp/500             time:   [10.405 µs 10.471 µs 10.545 µs]
                        change: [-11.610% -8.1001% -4.1150%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
Get/Std/500             time:   [11.559 µs 11.596 µs 11.637 µs]
                        change: [-8.5353% -5.6441% -2.9301%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe
Get/Exp/1000            time:   [35.821 µs 35.862 µs 35.906 µs]
                        change: [+0.0180% +0.9955% +1.9643%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
Get/Std/1000            time:   [35.607 µs 36.024 µs 36.596 µs]
                        change: [+0.2618% +1.3021% +2.4252%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

Summary: my code is faster for small maps, for bigger maps the times are pretty close.

The bench mark code is here. It is simply making maps of different sizes (50,100,200...), then doing a get for each key and checking the result.

m = (i + j) >> 1; is the classic way to binary search, but it's worse because it means two loop induction variables, and thus the compiler doesn't understand it. (It's also just straight-up wrong for ZSTs as the addition overflows.) Use a single width that shrinks instead as the loop variable.

If you want a good binary search, here's a 46-page paper to read all about it:

6 Likes

…of which the implementation is here.

I was dimly aware of the eytzinger approach, but it seemed to me it wouldn't work out for a general implementation of a Btree where elements may be inserted at any time? Or is that not the case?

Umm, i and j are just usize, and they are never at all big (maximum branch in my BTree is 512, so neither can be more than this ). So overflow is not a concern.

I did try to use w = j-i, but so far it has been slower rather than faster. Code:

        unsafe {
            let p = self.p.as_ptr().cast::<K>(); 
            let (mut i, mut w) = (0, self.len());           
            while w != 0 {
                let m = i + w / 2;
                match (*p.add(m)).borrow().cmp(key) {
                    Ordering::Equal => return Ok(m),
                    Ordering::Less => { 
                       w -= m + 1 - i; 
                       i = m + 1;                                            
                    }
                    Ordering::Greater => 
                    {
                       w = m - i; 
                    }
                }
            }
            Err(i)
        }

The "branch-free" code I am not sure how I can implement that in Rust, it seems to need subtraction of pointers, and I don't see how to do that. Also I am not sure how to get Rust to do a conditional move. Hmm...!

There's the last binary search I did:

I never fully finished that project, so don't necessarily trust it, but IIRC I confirmed that it let the compiler completely unroll the searches over the data slices, when the (i + j)/2 versions didn't.

Remember that if you write the search over slices, LLVM doesn't necessarily know that, and you can end up pessimizing things because it sees i + j and doesn't necessarily have a way to know that it can't overflow, and thus needs to optimize assuming that it might end up giving 0 sometimes, making the loop continue for longer than it otherwise looks like it might.

Ah, this seems to work quite well, by having less branching. Not sure if using multiplication is the best way here, but it seems to be working quite well...

        unsafe {
            let (mut i, mut j) = (0, self.len());
            let p = self.p.as_ptr().cast::<K>();
            let mut m = (i + j) >> 1;
            while i != j {
                let c = (*p.add(m)).borrow().cmp(key);
                if c == Ordering::Equal { return Ok(m); }
                let x1 = usize::from( c == Ordering::Less );
                let x2 = 1 - x1;
                i = x1 * ( m + 1 ) + x2 * i;
                j = x2 * m + x1 * j; 
                m = (i + j) >> 1;
            }
            Err(i)
        }

Benchmark results:

     Running benches\crit_bench.rs (target\release\deps\crit_bench-0d8ee347214130b6.exe)
Gnuplot not found, using plotters backend
Get/Exp/50              time:   [234.74 ns 235.35 ns 236.09 ns]
                        change: [-27.494% -26.574% -25.682%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
Get/Std/50              time:   [373.22 ns 374.89 ns 376.73 ns]
                        change: [-2.0734% +0.1124% +2.1867%] (p = 0.92 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Get/Exp/100             time:   [555.64 ns 568.52 ns 583.05 ns]
                        change: [-32.408% -28.048% -24.050%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe
Get/Std/100             time:   [901.58 ns 922.58 ns 964.23 ns]
                        change: [-2.7798% +0.0266% +2.7040%] (p = 0.99 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
Get/Exp/200             time:   [1.8307 µs 1.8392 µs 1.8487 µs]
                        change: [-19.544% -13.413% -6.3986%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Get/Std/200             time:   [2.0511 µs 2.0601 µs 2.0695 µs]
                        change: [-6.3215% +0.3292% +7.0143%] (p = 0.93 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
Get/Exp/500             time:   [16.188 µs 16.220 µs 16.254 µs]
                        change: [+46.847% +52.195% +56.657%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
Get/Std/500             time:   [11.607 µs 11.661 µs 11.730 µs]
                        change: [-4.8282% -2.3463% -0.0228%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
Get/Exp/1000            time:   [36.896 µs 37.135 µs 37.449 µs]
                        change: [-0.2234% +0.8199% +1.8450%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
Get/Std/1000            time:   [35.530 µs 35.601 µs 35.680 µs]
                        change: [-0.5364% +0.2184% +0.9229%] (p = 0.57 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

But oddly there is a regression for n=500 but not n=1,000 ( or n = 50, 100, 200 which all show a fairly dramatic speed-up ). That may just be some random event on my computer, will check.... nope, it seems to be real. That is pretty odd.

[ I wonder if the extra code has caused a relatively catastrophic change in what remains cached for n=500. If so, I think it would be fairly specific to a narrow range of cases. I will run the benchmark for different n to see what happens ]

... well, as seen below, it seems to affect quite a wide range of sizes. Hard to decide what to do on this basis... the performance "drops off a cliff" going from n=300 to n=350.

 Running benches\crit_bench.rs (target\release\deps\crit_bench-0d8ee347214130b6.exe)
Gnuplot not found, using plotters backend
Get/Exp/50              time:   [234.75 ns 236.04 ns 237.83 ns]
                        change: [-0.0699% +0.2927% +0.7159%] (p = 0.16 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe
Get/Std/50              time:   [375.37 ns 377.45 ns 379.74 ns]
                        change: [-3.8720% -0.8865% +2.1937%] (p = 0.58 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe
Get/Exp/100             time:   [541.31 ns 543.12 ns 545.04 ns]
                        change: [-1.3200% +1.9891% +6.4894%] (p = 0.37 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
Get/Std/100             time:   [901.38 ns 904.95 ns 909.08 ns]
                        change: [-3.4053% +0.4917% +4.6150%] (p = 0.82 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
Get/Exp/200             time:   [1.8206 µs 1.8271 µs 1.8343 µs]
                        change: [-10.650% -4.1413% +1.1593%] (p = 0.24 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
Get/Std/200             time:   [2.0540 µs 2.0838 µs 2.1224 µs]
                        change: [-6.1608% +0.1793% +6.8513%] (p = 0.96 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
Get/Exp/250             time:   [2.3760 µs 2.3862 µs 2.3968 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
Get/Std/250             time:   [2.6708 µs 2.6873 µs 2.7075 µs]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
Get/Exp/300             time:   [3.2298 µs 3.2893 µs 3.3520 µs]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Get/Std/300             time:   [3.3939 µs 3.4384 µs 3.5007 µs]
Found 22 outliers among 100 measurements (22.00%)
  1 (1.00%) high mild
  21 (21.00%) high severe
Get/Exp/350             time:   [6.6261 µs 6.7114 µs 6.8349 µs]
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
Get/Std/350             time:   [4.4833 µs 4.5178 µs 4.5534 µs]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Get/Exp/400             time:   [10.012 µs 10.046 µs 10.083 µs]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
Get/Std/400             time:   [7.0445 µs 7.0739 µs 7.1030 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
Get/Exp/450             time:   [13.382 µs 13.443 µs 13.514 µs]
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
Get/Std/450             time:   [9.3293 µs 9.5827 µs 9.8885 µs]
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) high mild
  12 (12.00%) high severe
Get/Exp/500             time:   [16.190 µs 16.239 µs 16.301 µs]
                        change: [-2.2310% -1.0622% -0.0654%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
Get/Std/500             time:   [11.561 µs 11.607 µs 11.668 µs]
                        change: [-1.3452% -0.4819% +0.2363%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
Get/Exp/550             time:   [18.932 µs 18.961 µs 18.993 µs]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
Get/Std/550             time:   [13.934 µs 13.977 µs 14.035 µs]
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe
Get/Exp/600             time:   [20.907 µs 20.938 µs 20.971 µs]
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
Get/Std/600             time:   [15.916 µs 15.965 µs 16.024 µs]
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
Get/Exp/650             time:   [23.142 µs 23.182 µs 23.226 µs]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
Get/Std/650             time:   [18.733 µs 18.815 µs 18.926 µs]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe
Get/Exp/700             time:   [25.260 µs 25.305 µs 25.357 µs]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
Get/Std/700             time:   [21.060 µs 21.094 µs 21.132 µs]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
Get/Exp/1000            time:   [36.762 µs 36.843 µs 36.930 µs]
                        change: [-0.2305% +0.4036% +1.1171%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
Get/Std/1000            time:   [35.573 µs 35.643 µs 35.722 µs]
                        change: [-0.6492% +0.0465% +0.7119%] (p = 0.90 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe


C:\Users\ano31\Rust\btree>

The way to get a conditional move is to calculate both things then do a x = if c { a } else { b };. The important thing is just to calculate them before the branch, so LLVM knows you're fine with calculating both (and you're not depending on the condition for correctness, somehow).

Don't make LLVM have to figure out that you're trying to fake a select; it just turns all these things back into a select anyway. There are probably some it doesn't know about yet, but don't use those, because they'll just fix them (`(-1 + b) & x` → `select b, 0, x` should work for `!range {0, 2}` too, not just for `i1` · Issue #63321 · llvm/llvm-project · GitHub).

1 Like

Ok, like this?

        unsafe {
            let (mut i, mut j) = (0, self.len());
            let p = self.p.as_ptr().cast::<K>();
            let mut m = (i + j) >> 1;
            while i != j {
                let c = (*p.add(m)).borrow().cmp(key);
                if c == Ordering::Equal { return Ok(m); }
                i = if c == Ordering::Less { m + 1 } else { i };
                j = if c == Ordering::Less { j } else { m };
                m = (i + j) >> 1;
            }
            Err(i)
        }

Performance is similar to my previous "branch-less" code, that is good for n < 350, worse for n = 500, similar for n = 1,000. Hmm, well, at least I have some ideas. The difficult question seems to be how to devise realistic benchmarks that correspond to what would be happening with a real useful program. Maybe there is no way, and the answer is to measure performance of real programs ( which is probably not easy unfortunately!! ).

Figure 7 here shows something a little similar, although the switch-over seems to be at a higher level for n than I am seeing: https://arxiv.org/pdf/1509.05053

My latest try is to duplicate the re-calculation of m , like this:

    #[inline]
    pub fn search<Q>(&self, key: &Q) -> Result<usize, usize>
    where
        K: Borrow<Q> + Ord,
        Q: Ord + ?Sized,
    {
        self.search_to(self.len as usize, key)
    }

    pub fn search_to<Q>(&self, mut j: usize, key: &Q) -> Result<usize, usize>
    where
        K: Borrow<Q> + Ord,
        Q: Ord + ?Sized,
    {
        unsafe {
            let mut i = 0;
            let p = self.p.as_ptr().cast::<K>();
            let mut m = j >> 1;
            while i != j {
                match (*p.add(m)).borrow().cmp(key) {
                    Ordering::Equal => return Ok(m),
                    Ordering::Less => {
                        i = m + 1;
                        m = (i + j) >> 1;
                    }
                    Ordering::Greater => {
                        j = m;
                        m = (i + j) >> 1;
                    }
                }
            }
            Err(i)
        }
    }

This produces a significant speed-up for n=500 and not much change otherwise.

Bench results:

     Running benches\crit_bench.rs (target\release\deps\crit_bench-0d8ee347214130b6.exe)
Gnuplot not found, using plotters backend
Get/Exp/50              time:   [299.24 ns 299.73 ns 300.29 ns]
                        change: [-22.674% -18.609% -14.370%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
Get/Std/50              time:   [374.90 ns 376.42 ns 378.01 ns]
                        change: [-1.8933% +0.9594% +3.8434%] (p = 0.53 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe
Get/Exp/100             time:   [712.81 ns 715.81 ns 719.22 ns]
                        change: [-4.7275% -0.9854% +2.0751%] (p = 0.62 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
Get/Std/100             time:   [903.31 ns 908.09 ns 913.56 ns]
                        change: [-4.2717% -2.3433% -0.6821%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
Get/Exp/200             time:   [2.1426 µs 2.1684 µs 2.2080 µs]
                        change: [-6.8339% -1.2516% +5.5570%] (p = 0.70 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
Get/Std/200             time:   [2.0615 µs 2.0800 µs 2.1048 µs]
                        change: [-2.9414% +2.2927% +8.5069%] (p = 0.50 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
Get/Exp/500             time:   [8.5478 µs 8.6176 µs 8.6949 µs]
                        change: [-9.3115% -4.4286% -0.1545%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Get/Std/500             time:   [11.614 µs 11.647 µs 11.681 µs]
                        change: [-3.1544% -1.0322% +1.5158%] (p = 0.38 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Get/Exp/1000            time:   [36.678 µs 36.760 µs 36.847 µs]
                        change: [-5.7959% -4.2036% -2.7871%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
Get/Std/1000            time:   [35.557 µs 35.642 µs 35.733 µs]
                        change: [-1.3522% -0.5770% +0.2180%] (p = 0.14 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

I managed to find the assembler for the search loop. Here it is searching for the constant key 1946 ( chosen to help me figure out where the code was... ). I am a little rusty on the x86 instruction set, it has changed a bit since I knew it about 40 years ago!

.LBB19_35:
	leaq	(%r11,%rax), %rsi
	movq	%rax, %r10
	movq	%rsi, %rax
	cmpq	%r10, %r11
	je	.LBB19_49
.LBB19_34:
	shrq	%rax
	xorl	%esi, %esi
	cmpl	$1946, (%rcx,%rax,4)
	setne	%sil
	cmovll	%r9d, %esi
	cmpb	$1, %sil
	je	.LBB19_35
	movzbl	%sil, %r11d
	cmpl	$255, %r11d
	jne	.LBB19_50
	leaq	(%rax,%r10), %rsi
	incq	%rsi
	incq	%rax
	movq	%rax, %r11
	movq	%rsi, %rax
	cmpq	%r10, %r11
	jne	.LBB19_34
.LBB19_49:

I think I understand most of what it is doing here. This corresponds to the last code I posted, but here it is again:

            let mut i = 0;
            let p = self.p.as_ptr().cast::<K>();
            let mut m = j >> 1;
            while i != j {
                match (*p.add(m)).borrow().cmp(key) {
                    Ordering::Equal => return Ok(m),
                    Ordering::Less => {
                        i = m + 1;
                        m = (i + j) >> 1;
                    }
                    Ordering::Greater => {
                        j = m;
                        m = (i + j) >> 1;
                    }
                }
            }
            Err(i)

Remember that LLVM is in SSA form; it'll run common subexpression elimination and make these use the same SSA variable anyway. So my bet is that what you're seeing from things like this are more evidence of unreliable benchmarking than any real difference.

If you're trying to instruction-peek, you might want to use llvm-mca - LLVM Machine Code Analyzer — LLVM 19.0.0git documentation instead of trying to time things, because μbenchmarking is nigh impossible to do well.

3 Likes

You may well be right, a lot of strange things go on, weird regressions for apparently no reason. You can get a kind-of local stability for a while, with bench runs giving consistent results for a while, but come back the next day, and everything has changed. Even the performance of the standard library BTreeMap is highly unstable from day-to-day, huge swings in performance, even though it hasn't changed.

Standard talk link: https://youtu.be/r-TLSBdHe1A?t=717

Turns out that things like how many environment variables you have can make more of a difference than the code permutations you're doing.

2 Likes

The latest strange thing I am seeing is when I run a particular test directly my code is slightly faster (by maybe 10%), but when I run the same test under criterion, criterion shows std code is slightly faster (by maybe 10%).

The test:

fn std_split_off_test(n: usize) {
        let mut m = std::collections::BTreeMap::<usize, usize>::new();
        let mut c = m.lower_bound_mut(Bound::Unbounded);
        for i in 0..n {
            c.insert_before(i, i).unwrap();
        }
        assert!(m.len() == n);
        let m2 = m.split_off(&(n / 2));
        assert!(m2.len() == n / 2);
        assert!(m.len() + m2.len() == n);
}

Criterion reports 390.97 µs for std (n=10,000), but when I run a direct test, I get more like 500µs. I guess this could be due to the inlining being done differently, due to the test suite having more tests.

If you haven't already, it's worth inspecting the assembly for those functions. Even if you don't have any particular skill at reading assembly, it will directly tell you which things are getting inlined.

cargo-show-asm is a good tool for this; and with the --rust option it will interleave the original Rust source code (as much as this is possible).

I have looked at some assembly. I was just able to confirm that by removing all the other tests the run-time of the standard library test being run (std_split_off_test) completely changes, by almost a factor of two ( from 55 micro-seconds to 30 microseconds ).

In general, benchmarking a generic container module like BTreeMap seems of questionable value, the times seem to depend not so much on what the test is doing or how it is coded but what other code is included in the executable (even if it isn't executed). Still, it has helped me think about the code and make some good changes. But the benchmark results themselves, well, what they actually mean is very dubious.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.