Performance issue with M1 using threadpools and reading binary files compared with linux

Hi everybody,

I'm having performance issue with a small test program that read 200 pcap files of 100MB each looking for a specific ipv4 address. If I run this code on my macbook pro M1 pro it takes 27.7sec while my system76 lemur pro with core i5 can run it in 3.4sec. I don't know what would cause such a big performance difference!

I'm using rayon for the thread pool. Here is a code sample:

fn search_packet(file_id: usize) -> Result<(), std::io::Error> {
    let mut file = File::open(&format!("{}/{}.pcap", PCAP_PATH, file_id))?;
    let mut buffer = [0; 24];
    let mut data = Vec::new();
    
    let mut psize: usize;
    let mut read_size: usize = 0;
    let mut packet_count: usize = 0;

    file.by_ref().take(24).read(&mut buffer)?;

    let t_init = SystemTime::now();
    loop {
        read_size = file.by_ref().take(16).read(&mut buffer)?;
        if read_size != 16 {
            break;
        }
        psize = BigEndian::read_u32(&buffer[12..16]) as usize;
        data.resize(psize, 0);
        file.read_exact(&mut data).unwrap();

        let mut packet = PacketRef::new(0, 0, 0, 0, false);
        packet.set_packet(&data);

        if packet.src_ip() == 0xc0a803e6 {
            packet_count += 1;
        }
    }
    println!("Packet count: {} time: {}", packet_count, t_init.elapsed().unwrap().as_secs_f32());
    Ok(())
}

fn main() {
    println!("Packet Reader testing");
    let t_init = SystemTime::now();

    (0..200).into_par_iter().for_each(|i| {
        search_packet(i).unwrap();
    });

    println!(
        "DB Execution time: {}s",
        t_init.elapsed().unwrap().as_secs_f32()
    );
    println!("Execution terminated");
}
  • Do you compile and run the program with optimizations enabled? (cargo run --release)
  • Do you verify that all iterations of the loop are executed on each platform? In particular, the following looks wrong:
    if read_size != 16 {
        break;
    }
    
    If the size you happen to read isn't exactly 16, this will exit the loop. You should read the documentation for read(): it isn't guaranteed to read as many bytes as the length of the destination slice. You are probably looking for read_exact() instead.
  • Did you look at the run times using a profiler? Are you sure that most of the additional time is spent in threading and it's not simply the filesystem syscalls that are much slower?
5 Likes

Definitely use BufReader. You're doing 2 OS calls per loop iteration.

Make sure you're using cargo build --release and testing - I get a 20x performance difference in micro-loops for a lot of my IO parsing code. I think you can run cargo test --release as well.

The data.resize(psize,0) I'm torn what's the best thing to do.. I THINK you're triggering dynamic allocation on EVERY pass - shrinking or growing. when shrinking it calls truncate internally which free's the tail. When growing it reserve which should be fine, but goes through a whole init-0 unecessarily given the subsequent read_exact. I'd say use a grow-only model if data.len() < psize { data.resize(psize,0) } that at least avoids the expensive truncation call and has the 'shunt' in the outer function.

I'm not 100% sure, but do you have a memory leak? Your resize will set the length to psize, retaining prior garbage and 0 extending. But then you read a ref to the Vec<u8> ?? That's a compiler error for me, you'd need &mut data[..] or equiv (which I assume you have).

I assume the packet.set_packet is efficiently parsing the buffer.

Just a stylistic change, move all your lets to the line they're used.. avoids them being mut, and years of SonarCube warnings can't all be wrong. :slight_smile:

2 Likes

Hi,

Thank you for your reply. I always do my perf test with the --release. I understand your point about the memory re-allocation. My previous version was using a fixed size slice that was big enough to load any packets and the performance hit was not there. Instead of using rayon I wrote a small function that would take a chunk of files equal to the number of cores and run the packet search on each threads. I ran the code on linux and the performance is excellent. But on the M1 running on all cores it's very slow like 15x slower than on my linux laptop. If we compare the two laptops the M1 Pro with 10 cores vs core i5 with 4 cores/8 threads the performance difference does not make sense. I notice on the M1 that as I increase the number of threads from 1 to 4 the performance is good but above that it's very sluggish.

Just for the fun of it I wrote a C version with pthreads and I have the same issue. The M1 is 10x slower but single thread the performance is equal!!! At first I taught Rust was the problem but now I think there is an issue with OSX and/or threads and/or I/Os.

Again thank you for taking the time to reply and I will try your BufReader suggestion.

.clear() (and .truncate(n) and similar) don't release allocated capacity. It's totally ok to use them in a loop.

It's https://doc.rust-lang.org/std/vec/struct.Vec.html#method.shrink_to_fit that can actually reduce capacity.

1 Like

I think I see my confusion (thanks for the comment)

pub fn truncate(&mut self, len: usize) {
        unsafe {
            if len > self.len {
                return;
            }
            let remaining_len = self.len - len;
            let s = ptr::slice_from_raw_parts_mut(self.as_mut_ptr().add(len), remaining_len);
            self.len = len;
            ptr::drop_in_place(s);
        }
    }

It's dropping the individual elements, not resizing the allocation; you're right. Buuut.. it looks like it's doing an effective noop for a &[u8] with a lot of code - would be curious to see the output assembly.

When the element type is Copy (and thus !Drop), Vec::truncate easily compiles down to just updating self.len: https://rust.godbolt.org/z/97bzjTcba

3 Likes