Function to compare bytes taking Closure parameter

I just wrote this function to find differences between two byte lists. Would appreciate any comments on the code in detail, is it correct for example, can it be improved, also on whether there is a better way ( perhaps avoiding a closure parameter, maybe write it as an iterator instead??? ). I haven't defined a function that takes a closure parameter before, so I am still a bit unfamiliar with this.

/// Function to compare bytes. Length is taken from a. Calls d for each range that is different.
/// Interior equal ranges less than min_eq are ignored.
pub fn diff<F>(a: &[u8], b: &[u8], min_eq: usize, mut d: F)
where
    F: FnMut(usize, usize),
{
    let mut i = 0;
    let n = a.len();
    while i < n && a[i] == b[i] {
        i += 1;
    }
    while i < n {
        let start = i;
        let mut end;
        loop {
            while i < n && a[i] != b[i] {
                i += 1;
            }
            end = i;
            // Check that following equal range is at least min_eq.
            while i < n && a[i] == b[i] {
                i += 1;
            }
            if i - end >= min_eq || i == n {
                break;
            }
        }
        if end > start {
            d(start, end - start);
        }
    }
}

( Published here: diff in rustdb::util - Rust )

Untested. Does this work:

pub enum Foo {
    Same_Started_At(usize),
    Diff,
}

impl Foo {
    fn process(&mut self, i: usize, t: bool, min_eq: usize, d: &fn(usize, usize)) {
        use Foo::*;
        match (t, &self) {
            (true, Diff) => *self = Same_Started_At(i),
            (false, Same_Started_At(start)) => {
                let start = *start;
                *self = Diff;
                if (i - start >= min_eq) {
                    d(start, i)
                };
            }
            _ => { /* no op*/ }
        }
    }
}

fn foo(a: &[u8], b: &[u8], min_eq: usize, d: &fn(usize, usize)) {
    let mut f = Foo::Diff;
    let n = a.len().min(b.len());
    for i in 0..n {
        f.process(i, a[i] == b[i], min_eq, d);
    }
    f.process(n, false, min_eq, d); // tg force process last chunk
}

The goal here is to refactor the logic into Foo, so the main driver loop just runs through the list, sending it a[i] == b[i].

I think you need a bit more state for Foo, the start of the output range. Also the parameter d should be a closure not a function.

[ Since posting I noticed that my condition if end > start is always true so can be removed. ]

Sorry, why is a usize not enough? What else do we need to store ?

As i increases, we need to remember where the current diff range began, and also where the current equal range began. This compiles, but doesn't work:

enum Foo {
    Init,
    Same(usize,usize),
    Diff(usize),
}

impl Foo {
    fn process<F>(&mut self, i: usize, eq: bool, min_eq: usize, d: &mut F) where
    F: FnMut(usize, usize), 
    {
        use Foo::*;
        match (eq, &self) {
            (false, Init) => *self = Diff(i),
            (true, Diff(dstart)) => *self = Same(i,*dstart),
            (false, Same(start,dstart)) => {
                let (start,dstart) = (*start,*dstart);                
                if i - start >= min_eq {
                    d(dstart, i);
                    *self = Diff(i);
                }
                else
                {
                    *self = Diff(dstart);
                }
            }
            _ => { /* no op*/ }
        }
    }
}

///
pub fn diff<F>(a: &[u8], b: &[u8], min_eq: usize, mut d: F)
where
    F: FnMut(usize, usize),
{
    let mut f = Foo::Init;
    let n = a.len().min(b.len());
    for i in 0..n {
        f.process(i, a[i] == b[i], min_eq, &mut d);
    }
    f.process(n, false, min_eq, &mut d); // tg force process last chunk
}

Sorry, I mis read the problem. If d is called on regions that differ, what is the point of min_eq? (Why do we need to ignore those ranges, if d is only called on ranges that differ) ?

I think I understand the problem now. Sorry, I don't think I can make this cleaner by turning it into a state machine.

What I am doing with it is to minimise byte size of the representation of updates to the database.

If there is a short range of equality, it is not worth representing that range explicitly.
See here for where it is called:

I got it to work ( at least it passes my tests ), by having a special function to flush at the end:

enum Foo {
    Init,
    Same(usize,usize),
    Diff(usize),
}

impl Foo {
    fn process<F>(&mut self, i: usize, eq: bool, min_eq: usize, d: &mut F) where
    F: FnMut(usize, usize), 
    {
        use Foo::*;
        match (eq, &self) {
            (false, Init) => *self = Diff(i),
            (true, Diff(dstart)) => *self = Same(i,*dstart),            
            (false, Same(start,dstart)) => {
                let (start,dstart) = (*start,*dstart);                
                if i - start >= min_eq {
                    d(dstart, start-dstart);
                    *self = Diff(i);
                }
                else
                {
                    *self = Diff(dstart);
                }
            }
            _ => { /* no op*/ }
        }
    }
    fn flush<F>(&mut self, i: usize, d: &mut F) where
    F: FnMut(usize, usize), 
    {
        use Foo::*;
        match &self {
            Init => {},
            Diff(dstart) => d(*dstart, i-dstart),            
            Same(start,dstart) => {
                d(*dstart, *start-*dstart);
            }
        }
     }
}

///
pub fn diff<F>(a: &[u8], b: &[u8], min_eq: usize, mut d: F)
where
    F: FnMut(usize, usize),
{
    let mut f = Foo::Init;
    let n = a.len().min(b.len());
    for i in 0..n {
        f.process(i, a[i] == b[i], min_eq, &mut d);
    }
    f.flush(n, &mut d); // tg force process last chunk
}

Not entirely elegant though!

I agree. There is not much to gain by the stated machine inversion I proposed; I think your original code is cleaner.

1 Like