Reading from stdin: performance

I've been looking to improve Rust's performance on the Shootout benchmarks, and I'm currently looking at the reverse-complement test. It essentially boils down to reading in a large amount of data from stdin, and iterating over it backwards. The first thing I've noticed is that Rust is getting creamed in significant part just due to the amount of time taken to read in the data. Compare the following programs:

C (cut down from the shootout):

    int main() {
       size_t buflen = 1024, len, end = 0;
       char *buf = malloc(1024);
           
       int in = fileno(stdin);
       while ((len = read(in, buf + end, buflen - end))) {
          end += len;
          if (end < buflen) break;
          buf = realloc(buf, buflen *= 2);
       }
       printf("%lu", end);
    }

Rust

    fn main() {
        let mut stdin = std::io::stdin();
        let mut data = Vec::with_capacity(1024);
        stdin.read_to_end(&mut data).unwrap();
        println!("{}",data.len());
    }

With both compiled with optimisations on, run over a 250MB file, the C code runs in 0.16s, and the Rust code in 0.5. If I increase the buffer on the Rust program to 300MB to avoid any allocations, it still takes 0.36s.

If, on the other hand, I initialise a small buffer early on, and then just overwrite its contents (excuse the poor error handling)...

    fn main() {
        let mut stdin = std::io::stdin();
        let mut data = [0u8; 100000];
        let mut len = 0;
        while let Ok(n) = stdin.read(&mut data) {
            if n == 0 {
                break;
            }
            len += n;
        }   
        println!("{}", len);
    }

...it takes just 0.08s to read the file.

My conclusion from this is that the work needed to zero-initialise the buffer is hurting us. Indeed, if I run the following program, it takes 0.6s to run, without even reading from stdin!

    fn main() {
        let mut stdin = std::io::stdin();
        let mut data = Vec::with_capacity(1024*1024*250);
        data.extend(iter::repeat(0).take(1024*1024*250));
        println!("{}", data.len());
    }

So, all this said, is there a quick way to read a large buffer from stdin (or from a file)? Reusing the same buffer repeatedly as in my third code block isn't practical here, because I need to work backwards from the end of the file.

If there isn't such a way, is something likely to get added to the stdlib to do so? Or possibly to allocate a large zeroed buffer very fast? It would seem like something that would be a substantial benefit in this case.

Vec::with_capacity just sets a capacity, it doesn't put anything in the Vec<T>, so there's nothing to zero. However, it is heap-allocated, and your array is stack allocated. Maybe that's the actual issue here?

1 Like

Oh, and in the array version, you're reading 100000 bytes at a time, rather than 1024 bytes at a time, right?

1 Like

Yeah, I realise it doesn't set a capacity - sorry, I should have been more clear. Zero initialisation happens under the covers in the read_to_end method. Read_to_end also adaptively increases the number of bytes read per iteration as it goes along - so it's not just reading 1024 bytes at a time.

edit: without any unsafe trickery, my understanding is that there's no way around this - because you can't get an array of uninitialised data in Rust. While that's obviously right and proper under normal circumstances, I was hoping for some pre-canned faster way for common performance sensitive operations like this.

edit2: I'm pretty confident that stack vs heap is not the issue here - the C buffer is heap allocated. Increasing the size of the Vec as things go along clearly has a cost, but even with an entirely preallocated buffer, the program still takes twice as long as the C one.

Can you test your hypothesis that this is indeed because of zeroing memory? (It seems plausible to me, but testing it is a good idea.)

e.g.,

const BUF_SIZE: usize = 250 * 1024 * 1024;
let mut buf = Vec::with_capacity(BUF_SIZE);
unsafe { buf.set_len(BUF_SIZE); }
// loop to copy `stdin` bytes into `buf`

(Of course, this is not safe unless you know a priori that stdin has at least BUF_SIZE bytes, but it should at least confirm your suspicions.)

Also, you'll want to make sure that your buf.extend(iter::repeat(0).take(BUF_SIZE)) is actually compiling down to a memset. (It should.)

uninitialized in std::mem - Rust is there when you absolutely need it. This case might be one of those times.

(I would probably do this rather than set_len(), what do you think @burntsushi?)

1 Like

OK, I tried the following:

let mut stdin = std::io::stdin();
let mut data = Vec::with_capacity(BUF_SIZE);
unsafe { data.set_len(BUF_SIZE); }
let mut len = 0;
while let Ok(n) = stdin.read(&mut data[len..len+100000]) {
    if n == 0 {
        break;  
    }       
    len += n;
}       
unsafe { data.set_len(len); }
println!("{}", len);

It gave a runtime of 0.21s, which is a substantial improvement!

@steveklabnik: yeah, I'm aware that unsafe rust allows us to get uninitialised buffers - my concern here is that reading data in from a file/stdin is common enough that I would think there ought to be a fast, safe way to do it.

See Uninitialized memory - Rust Internals for discussion about uninitialised memory and IO.

1 Like

Thanks, that's helpful - so my digestion of that is that one could reasonably expect a safe fast read at some point in the future, just not right now?

@BurntSushi Any chance you could tell me how to check if the 0-initialisation is compiling down to memset? I kinda suspect it probably isn't for the standalone iterator example I gave, as 0.6s seems suspiciously high - and it would be nice to learn how to check this out.

I think this should is mostly an ambition, not an expectation? Has it compiled to memset before?

You can write code like this in isolation, then check what it compiles to: playpen gist

Make sure release mode is enabled, hit the asm button and try to find out what it does. It looks like it does not compile to memset let alone a vectorized loop.

If you want to know something funny, I made a quick contest of three safe ways to fill a vector, and guess which won on the benchmark? The push while loop. Then I tested two unsafe blocks too. Code here.

running 5 tests
test fillvec_extend ... bench:   2,609,647 ns/iter (+/- 60,692)
test fillvec_memset ... bench:     273,759 ns/iter (+/- 44,029)
test fillvec_push   ... bench:   1,031,940 ns/iter (+/- 14,751)
test fillvec_resize ... bench:   2,611,206 ns/iter (+/- 52,026)
test fillvec_setlen ... bench:     270,985 ns/iter (+/- 32,862)

There's a tension here between pragmatism and trying to do it right. The right thing would be to make the rust code that we want to write, compile down to what we hope it would produce :smile: Pragmatism would have us add more specialized less generic API.

Ah - I was thinking that as memset is (afaik?) a libc function rather than an asm instruction, maybe there was a way to look at some LLVM output and get something interesting out of that.

It's a pretty striking difference in the benchmarks! Having read through huon's link I can certainly understand the desire not to pass uninitialised memory to safe code - but do you think it would be a reasonable compromise to alter the read_to_end function to contain a bit of unsafe code that uses memset instead of extend? Making the assumption here that it's a much heavier job to make the iterator correctly compile down to memset..

edit: that said, vector 0-initialisation in general is a pretty serious performance concern, so perhaps (again, assuming the iterator optimisation is tough) it might make sense to have a method on Vec specifically for zero-initialisation? That does mean committing to an API change rather than just altering a bit of internal code though.

It is a libc function but it is kind of lower level than that — llvm must have it present, so if you don't have libc you will have to supply memset yourself anyway. The gain of calling it is of course that you call into a well optimized routine that will do all the platform appropriate optimizations like vectorization. The call to memset will be visible in the asm output.

You can see that llvm recognizes the for loop that sets the element to zero (testcase fillvec_setlen) and replaces it by a call to memset.

Oh I agree. I think we can definitely be a lot more pragmatic when it comes to libstd internals. No reason to not use memset (or equivalent for loop) here.

It could be as simple as a private function like this, which at least turned into memset for T=u8. T: Copy makes it safe to use (panic safety).

pub fn fill<T: Copy>(v: &mut Vec<T>, element: T) {
    unsafe {
        let cap = v.capacity();
        v.set_len(cap);
        for ptr in &mut v[..] {
            *ptr = element;
        }
    }
}

I think I was just being overly hopeful. :slight_smile:

Ah we have an official (but unstable) interface to memset. It's std::intrinsics::write_bytes (unsafe, of course), but also std::slice::bytes::MutableByteVector::set_memory (safe, for &[u8]).

Why not just override read_to_end/read_to_string for File and Stdin like we did for nth, count, and last for various Iterators?

That's a good idea actually, to remove the initialization where we know the client is well behaved. Do the traits allow this?

I made a PR just to fix the inefficient zeroing.

Unfortunately, it requires copying quite a bit of logic around (or somehow plumbing through a helper function). An alternative is to provide:

trait Read {
    // ...
    unsafe fn read_uninitialized(buf: *mut u8, len: usize) -> io::Result<usize> {
        let slice = slice::from_raw_parts_mut(buf, len);
        for ptr in &mut slice[..] {
            *ptr = 0;
        }
        self.read(slice)
    }

And override this to not zero for well-behaved types.

Fab, thank you!

So, a side note here: realloccing seems to be really expensive compared to the C code. If I set up the Rust code so that it's performing as many allocations as the C code (and not memsetting), the Rust runtime is still twice as long as the C code. On profiling, half of the time is spent in jemalloc reallocing. In the C code, 1ms is spent in realloc.

Is this something that just happens to be a bad case for jemalloc over the default allocator (I'm on OSX fyi), or is there something else going on here? Short of changing allocator is there any way to fix it?

Just a guess that it can be related to this issue allow disabling dirty page purging via an API, and consider disabling it by default #18236, it does at least provide more background on jemalloc.