Performance help between c code and rust code

I try to port a c-implementation of a content defined chunking algorythm based on a rolling rabin checksum.

I have got it to work but I don't get the same performance as the c code. Maybe someone can look into my rust code and can give me some advises for a faster implementation.

If something is horrible wrong, so let me know.

You can find the c code at GitHub - silvio/rabin-cdc at master and the rust-code can find at GitHub - silvio/rabin-cdc: Fast implementation of Content Defined Chunking (CDC) based on a rolling Rabin Checksum in C.
The performance for the c-code with a 50MB file is real(0.38s) user(0.33s) sys(0.05s) and for the rust code real(0.58s) user(0.53s) sys(0.06s). Measured with time on a shell. The executable and testfile are on a tmpfs.

The C-code was optimized via -O2, the rust binary was compiled via --release.

Here are my rust implementation:

#![feature(test)]

use std::io::{stdin, Read};
use std::fmt;

const POLYNOMIAL: u64 = 0x3DA3358B4DC173;
const POLYNOMIAL_DEGREE: usize = 53;
const WINSIZE: usize = 64;
const AVERAGE_BITS: usize = 20;
const MINSIZE: usize = 512*1024;
const MAXSIZE: usize = 8 * 1024 * 1024;

const MASK: u64 = (1<<AVERAGE_BITS)-1;
const POLSHIFT: u64 = (POLYNOMIAL_DEGREE-8) as u64;

struct Table {
    modt: [u64;256],
    outt: [u64;256],
}

impl Table {
    fn new() -> Table {
        let mut t = Table {
            modt: [0u64;256],
            outt: [0u64;256],
        };

        t.outt = Table::generate_outt(POLYNOMIAL, WINSIZE);
        t.modt = Table::generate_modt(POLYNOMIAL);

        return t;
    }

    fn generate_outt(pol: u64, winsize: usize) -> [u64;256] {
        let mut outt = [0u64;256];

        for b in 0usize .. 256 {
            let mut hash = 0u64;

            hash = Table::append_byte(hash, b as u8, pol);
            for _ in 0 .. (winsize-1) {
                hash = Table::append_byte(hash, 0, pol);
            }
            outt[b as usize] = hash;
        }

        return outt;
    }

    fn generate_modt(pol: u64) -> [u64;256] {
        let mut modt = [0u64;256];

        let k = Table::deg(pol);

        for b in 0usize .. 256 {
            modt[b] = Table::modulo(((b << k) as u64), pol);
            modt[b] |= (b << k) as u64;
        }

        return modt;
    }

    fn deg(p: u64) -> i64 {
        let mut mask = 0x8000000000000000u64;
        for i in 0 .. 64 {
            if (mask & p) > 0 {
                return 63-i;
            }

            mask >>= 1;
        }

        return -1;
    }

    fn modulo(x: u64, p: u64) -> u64 {
        let mut out = x;
        while Table::deg(out) >= Table::deg(p) {
            let shift = Table::deg(out) - Table::deg(p);
            out = out ^ (p << shift);
        }

        return out;
    }

    fn append_byte(hash: u64, b: u8, pol: u64) -> u64 {
        let mut out = hash.clone();

        out <<= 8 as u64;
        out |= b as u64;

        return Table::modulo(out, pol);
    }
}

struct Chunk {
    start: usize,
    length: usize,
    cutfp: u64,
}

impl Chunk {
    fn new() -> Chunk {
        Chunk {
            start: 0,
            length: 0,
            cutfp: 0,
        }
    }
}

impl fmt::Display for Chunk {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "Chunk(s:{:010} l:{:08} c:{:016x})", self.start, self.length, self.cutfp)
    }
}

struct Rabin<'a> {
    window: [u8; WINSIZE],
    wpos: usize,
    count: usize,
    pos: usize,
    start: usize,
    digest: u64,
    table: &'a Table,
}

impl<'a> Rabin<'a> {
    fn new(table: &'a Table) -> Rabin {
        let mut r = Rabin {
            window: [0u8; WINSIZE],
            wpos: 0,
            count: 0,
            pos: 0,
            start: 0,
            digest: 0,
            table: table,
        };

        r.reset();
        return r;
    }

    fn reset(&mut self) {
        for el in self.window.iter_mut() {
            *el = 0;
        }
        self.wpos = 0;
        self.count = 0;
        self.digest = 0;

        self.rabin_slide(1);
    }

    #[inline]
    fn rabin_slide(&mut self, b: u8) {
        let out = self.window[self.wpos];
        self.window[self.wpos] = b;
        self.digest ^= self.table.outt[out as usize];
        self.wpos = (self.wpos+1) % WINSIZE;
        self.rabin_append(b);
    }

    #[inline]
    fn rabin_append(&mut self, b: u8) {
        let index: u64 = self.digest >> POLSHIFT;
        self.digest <<= 8;
        self.digest |= b as u64;
        self.digest ^= self.table.modt[index as usize];
    }

    fn rabin_next_chunk(&mut self, buf: &[u8], start: usize) -> (Chunk, i64) {
        for i in start .. buf.len() {
            let b = buf[i];

            self.rabin_slide(b);

            self.count += 1;
            self.pos += 1;

            if ((self.count >= MINSIZE) && (self.digest & MASK) == 0) || self.count >= MAXSIZE {
                let c = Chunk {
                    start: self.start,
                    length: self.count,
                    cutfp: self.digest,
                };

                let pos = self.pos;
                self.reset();
                self.start = pos;
                self.pos = pos;

                return (c, (i+1) as i64);
            }
        }

        return (Chunk::new(), -1);
    }

    fn rabin_finalize(&self) -> Option<Chunk> {
        if self.count == 0 {
            return None;
        }

        Some(Chunk {
            start: self.start,
            length: self.count,
            cutfp: self.digest,
        })
    }
}

impl<'a> fmt::Display for Rabin<'a> {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "Rabin(wp:{}, c:{}, p:{} s:{} d:{:x})", self.wpos, self.count, self.pos, self.start, self.digest)
    }
}


fn main() {

    let t = Table::new();
    let mut r = Rabin::new(&t);

    let mut chunks = 0usize;

    let mut buf_obj: Vec<u8> = Vec::new();

    let mut bytes = 0usize;
    let mut stdin = stdin();
    let mut calc_length = 0usize;

    let len = match stdin.read_to_end(&mut buf_obj) {
        Ok(t) => { t },
        Err(why) => { panic!(why) },
    };

    if len <= 0 {
        panic!("read problem");
    }

    for buf in buf_obj.chunks(1024*1024) {
        bytes += buf.len();

        let mut start = 0usize;

        loop {
            let (chunk, remaining) = r.rabin_next_chunk(buf, start);
            if remaining < 0 {
                break;
            }

            println!("{:7} {:016x}", chunk.length, chunk.cutfp);

            start = remaining as usize;

            calc_length += chunk.length;

            chunks += 1;
        }
    }

    let mut last_chunk = false;
    let chunk = match r.rabin_finalize() {
        Some(c) => { chunks += 1; last_chunk = true; c },
        None => { Chunk::new() },
    };

    if last_chunk {
        println!("{:7} {:016x}", chunk.length, chunk.cutfp);
        calc_length += chunk.length;
    }

    let mut avg = 0usize;
    if chunks > 0 {
        avg = bytes / chunks;
    }

    println!("{} chunks, average chunk size {}, sum(chunk.len)={}", chunks, avg, calc_length);
}

#[cfg(test)]
mod test {
    extern crate test;

    use super::*;

    #[bench]
    fn bench_refimpl_outt(b: &mut test::Bencher) {
        let mut x = [0u64;256];
        b.iter(|| {
            x = ::Table::generate_outt(::POLYNOMIAL, ::WINSIZE);
        })
    }

    #[bench]
    fn bench_refimpl_modt(b: &mut test::Bencher) {
        let mut x = [0u64;256];
        b.iter(|| {
            x = ::Table::generate_modt(::POLYNOMIAL);
        })
    }
}

I must say that the difference is relatively small, but I see what you mean. I see that you are taking the input from stdin. Make sure that the measurements are not significantly affected by the IO speed. Other than that, I can't find anything obvious, except that you are using the indexing operator where you can use iterators. Array indexing is bounds checked and this will give a small overhead each time an array element is accessed. I don't know how big the impact on this particular program is, but I'll still give a couple of examples:

for b in 0usize .. 256 {
    modt[b] = Table::modulo(((b << k) as u64), pol);
    modt[b] |= (b << k) as u64;
}

can be changed to

for (b, modt) in modt.iter_mut().enumerate() {
    *modt = Table::modulo(((b << k) as u64), pol);
    *modt |= (b << k) as u64;
}

See that .enumerate() part? It will add an index counter, so what you get on each iteration is (index, &mut element) and it should be optimized to the same thing as a classic C loop. You can do more things like this. For example:

for i in start .. buf.len() {
    let b = buf[i];
    //...

can be rewritten as

for (i, b) in buf.iter().enumerate().skip(start) {
    //...

or

for (i, b) in buf[start..].iter().enumerate() {
    //...

The second variant takes a smaller slice from buf and may actually be better, depending how .skip is optimized.

An other thing I notices is in modulo, where you calculate Table::deg(..) twice for each loop. Once again, I have no idea how big the impact is, but still...

fn modulo(x: u64, p: u64) -> u64 {
    let mut out = x;
    let mut deg_out = Table::deg(out); //we will update this each time out changes
    let deg_p = Table::deg(p); //p doesn't change, so we will only do this once
    while deg_out >= deg_p {
        let shift = deg_out - deg_p;
        out = out ^ (p << shift);
        deg_out = Table::deg(out);
    }

    return out;
}

This is what I could find by looking at your code, like this. I haven't tested it, so take it with a grain (or maybe a scoop) of salt. It may, at least, give you some ideas :smile:

2 Likes

As @ogeon suggested I would make sure to take out IO as part of the problem. For example change both C and Rust version to read the file to memory so the program is bound on memory and CPU only (not on IO)

Next steps:

  • Try to do a bit more fine graned performance testing of the code by using timers around the various code parts
  • I like to look at disassembly to make sure the compiler is doing what I expect it to do. I'm not sure how much experience you have with that but sometimes the compiler may generate something unexpected that might slow down your code. This is true for C/C++ compilers also.

Hope this helps a bit.

1 Like

Yeah, I would recommend to do the timing from within the program and only measure the calculations and not the IO. It's also a good idea to make multiple measurements and average them, since it tends to fluctuate a little.