Macro to do binary search

Suppose I have a fixed, constant, sorted list of numbers, for example:

11,13,15,17, 19,23,27,31

and I want to have a function(x) ( x in 11..31 ) which computes the index of the last number <= x using binary search. I could write it out by hand:

fn find_index( x:u8 ) -> u8 {
  return
  if x< 19 { 
    if x < 15 {  if x < 13 { 0 } else { 1 } }
    else { if x < 17 { 2 } else { 3 } }
  }
  else {
    if x < 27 { if x < 23 { 4 } else { 5 } }
    else { if x < 31 { 6 } else { 7 } }
  }
}

What I was wondering is whether a Rust macro could automate this. My actual list is 30 numbers, and writing it out by hand is a little messy.

Not 100% sure, but I think this is basically the code that a match statement will generate?
(it can match ranges using a..=b => {})

Rather than code expansion for 30 numbers, why not just use slice::binary_search?

1 Like

Hopefully to go faster, also the standard binary search fails if the number is not in the list.

I should have mentioned: if the list has runs of contiguous numbers, ideally it should use arithmetic to speed things up. My actual list ( 29 numbers, not 30 as I stated ) is:

3,4,5,6, 7,8,9,10, 11,13,15,17, 19,23,27,31, 35,43,51,59, 67,83,99,115, 131,163,195,227, 258

But at this stage I was just wondering if it is theoretically possible for a Rust macro to do something this complicated.

Each type of Rust macro—macro_use and procedural macro—is Turing-complete, so of course it is possible. (Which does not mean that it would be easy. :grinning:)

4 Likes

I'd measure that, because code bloat can hurt too.

The "failure" of binary search is up to you -- the Err is an index where you could insert sorted, so the prior index would hold your number less than x.

2 Likes

I just tried this ( I hadn't thought of it ). It doesn't appear to do a binary search ( which doesn't mean it's sub-optimal necessarily ). Code:

fn get_off_code( x:u16 ) -> usize
{
  match x
  {
    3..=10 => x as usize - 3,
    11..=12 => 8,
    13..=14 => 9,
    15..=16 => 10,
    17..=18 => 11,
    19..=22 => 12,    
    23..=26 => 13,
    27..=30 => 14,
    31..=34 => 15,
    35..=42 => 16,
    43..=50 => 17,
    51..=58 => 18,
    59..=66 => 19,
    67..=82 => 20,
    83..=98 => 21,
    99..=114 => 22,
    115..=130 => 23,
    131..=162 => 24,   
    163..=194 => 25,
    195..=226 => 26,
    227..=257 => 27,
    _ => 28
  }
}   
1 Like

It’s not exactly a binary search, but on 1.46 (current beta), you can use constexprs to calculate a lookup table at compile time. I haven’t profiled it, but I expect it to be reasonably competitive performance-wise: (Playground)

const fn find_index(x: u8) -> u8 {
    const VALS: &[u8] = &[11, 13, 15, 17, 19, 23, 27, 31];
    const MAX: u8 = VALS[VALS.len() - 1];
    const LUT: [u8; MAX as usize] = {
        let mut lut: [u8; MAX as usize] = [0; MAX as usize];
        let mut i_val: usize = 0;
        let mut i_lut: usize = 0;
        while i_lut < lut.len() {
            while VALS[i_val + 1] <= i_lut as u8 {
                i_val += 1;
            }
            lut[i_lut] = i_val as u8;
            i_lut += 1;
        }
        lut
    };
    if x as usize >= LUT.len() {
        (VALS.len() - 1) as u8
    } else {
        LUT[x as usize]
    }
}
1 Like

I did a quick test, comparing match to binary search. Binary search was marginally faster ( 29 msec for 10 million iterations versus 31 msec ).

Playground.

I had no idea you could do that, cool!

I updated your playground link to include the const LUT, and it appears significantly faster (Playground):

time elapsed=28 milli sec.
time elapsed=26 milli sec.
time elapsed=4 milli sec.
t1=176250000 t2=176250000 t3=176250000

This is a pretty ideal scenario for it, though: the entire table fits in 5 cache lines and is unlikely to be evicted in such a tight loop.

2 Likes

Correct me if I'm wrong, but shouldn't the compiler inline the result of the const fn, eliminating the lookup table completely? At least while the parameter supplied is also const.

Yes, if the argument is const it will simply inline the result. In this case, I don’t know if the optimizer unrolled the loop enough to do that.

I think this will be hard to beat. I do have another list, where the lookup table might be hogging too much memory:

1,2,3,4, 5,7,9,13, 17,25,33,49, 65,97,129,193, 257,385,513,769, 1025,1537,2049,3073, 4097,6145,8193,12289, 16385,24577

[ These numbers relate to RFC 1951 - DEFLATE Compressed Data Format ]

A plain lookup table for that one could be cumbersome for memory footprint, but you might be able to come up with a hybrid approach that works well— Something like a lookup table for 0-255 and a search for the rest.

With such short lists, a sequential scan might be faster than a binary search: it has less potential to mess up the processor’s branch predictor.

It may also be possible to structure the problem so that SIMD can be applied, working on multiple comparison values in parallel.

1 Like

For the second list ( 1,2,3,4, 5,7,9,13, 17,25,33,49, 65,97,129,193, 257,385,513,769, 1025,1537,2049,3073, 4097,6145,8193,12289, 16385,24577) , based on regularities in the list of numbers, I came up with this:

fn get_dist_code( mut x0:u16 ) -> usize
{
  x0 -= 1;
  let x = x0 as usize;
  let mut bits = 16 - x0.leading_zeros() as usize;
  if bits < 3 { 
    x 
  } else { 
    bits -= 1;
    ( x + ( ( bits << 1 ) << ( bits - 1 ) ) - ( 1 << bits ) ) >> ( bits - 1 ) 
  }
}
1 Like

All rust macros can do this fairly easily, in fact as easily as a function can, by emulating that function's body.

However, the First Rule of macrology states: "Never use a macro when a function will do". The reasons for that are:

  1. Macro's don't compose as well as functions
  2. Sometimes people think of macros to avoid function calling overhead. The response to this is that it's probably premature optimization. But even if it's not, function calling is cheap, and if it's still too expensive then an inlined function is almost always the superior choice.
  3. Macros aren't typed like functions are, meaning that generally macro inputs and outputs are not typechecked as such (though the code generated by the macro is typechecked of course). Functions provide safety through that feature.
  4. Macros are also more difficult to write, comprehend and debug than functions. For example, using a debugger to debug a macro is like using a neutered WebView (think a browser UI widget without any fancy features like dev tools) to debug a web client: a lot of information is lost in the deployment process, and you can't really access that while trying to use the debugger / WebView.

The better solution here seems to be using an iterative function or method that does the binary search.

So what are syntactic macros useful for?

  1. Cutting boilerplate code. If you see a code pattern repeating over and over, and it can't be captured by a function then a macro is a good fit. This is especially the case when the code pattern involves control flow, which cannot be captured by functions.
  2. Compile time computation. Basically you compute a bunch of stuff when the macro is executing (I.e. macro-expansion time, which is during compilation), and then use the results in some form in the generated binary.
  3. Code generation. For example if you need to compute some types (not values of those types, but the types themselves), impl blocks (which happens often with derive macros), functions, methods etc then you'll almost immediately end up in macro land.
2 Likes

I haven’t tested the performance, but you could also try using a BTreeMap:

use lazy_static::lazy_static;
use std::borrow::Borrow;
use std::collections::BTreeMap;
use std::convert::TryInto;

fn make_table<T: Ord, Ix, E>(v: impl IntoIterator<Item = T>) -> Result<BTreeMap<T, Ix>, E>
where
    usize: TryInto<Ix, Error = E>,
{
    v.into_iter()
        .enumerate()
        .map(|(i, x)| Ok((x, i.try_into()?)))
        .collect()
}

fn find_index<'a, T, Ix, Q>(k: &Q, table: &'a BTreeMap<T, Ix>) -> Option<&'a Ix>
where
    T: Borrow<Q> + Ord,
    Q: Ord,
{
    table.range(..=k).next_back().map(|(_, ix)| ix)
}

lazy_static! {
    static ref TABLE1: BTreeMap<u8, u8> =
        make_table([11, 13, 15, 17, 19, 23, 27, 31].iter().copied()).unwrap();
}

fn find_ix1(x: u8) -> u8 {
    *find_index(&x, &TABLE1).unwrap()
}

lazy_static! {
    static ref TABLE2: BTreeMap<u16, u8> = make_table(
        [
            1, 2, 3, 4, 5, 7, 9, 13, 17, 25, 33, 49, 65, 97, 129, 193, 257, 385, 513, 769, 1025,
            1537, 2049, 3073, 4097, 6145, 8193, 12289, 16385, 24577
        ]
        .iter()
        .copied()
    )
    .unwrap();
}

fn find_ix2(x: u16) -> u8 {
    *find_index(&x, &TABLE2).unwrap()
}

fn main() {
    for &x in &[11, 12, 13, 14, 15, 29, 30, 31] {
        println!("{} has ix {}", x, find_ix1(x));
    }
    println!("--------------");
    for &x in &[1,2,1000,5000,10000,24576,24577] {
        println!("{} has ix {}", x, find_ix2(x));
    }
    
    /* OUTPUT:
    
    11 has ix 0
    12 has ix 0
    13 has ix 1
    14 has ix 1
    15 has ix 2
    29 has ix 6
    30 has ix 6
    31 has ix 7
    --------------
    1 has ix 0
    2 has ix 1
    1000 has ix 19
    5000 has ix 24
    10000 has ix 26
    24576 has ix 28
    24577 has ix 29
    
    */
}

I am a bit surprised that my arithmetic method on the second list seems to come out slower than binary search ( by about 50% ). I haven't figured out why yet, the assembly instructions look fairly ok to me, just goes to show it's very hard to predict what will be fast. Can anyone explain it?

Playground

Edit: I got it to go a bit faster, but still about 20% slower than standard binary search:

fn get_dist_code( mut x0:u16 ) -> usize
{
  x0 -= 1;
  let x = x0 as usize;
  if x < 5 { x }
  else
  {
    let bits = 15 - x0.leading_zeros() as usize;
    ( x + ( ( ( bits << 1 ) - 2 ) << ( bits - 1 ) ) ) >> ( bits - 1 ) 
  }
}

Revised playground