Fast I/O in Rust

In a different thread the question arose how to perform fast line-based or token-based I/O, with a focus on standard input/output in Rust, as would be required in a competitive programming setting. In such settings, problems must be solved within a limited time budget (often 1s with inputs/outputs of 1MB or more), so even a few 100's of ms spent in unnecessary I/O operations can matter.

This is a common problem in many languages. Typical problems to be addressed include:

  • Input + Output: ensuring that efficient buffering and locking is used
  • Input: ensuring that line endings are efficiently recognized
  • Input + Output: ensuring that stdout and stdin aren't unnecessarily tied, i.e., that reading from stdin doesn't cause stdout to be flushed.
  • Input + Output: reducing unnecessary encoding/decoding from chars to bytes and vice versa
  • Input: fast parsing of numbers and strings using delimiters while avoiding the use of backtracking-based regular expression engines

For instance, in a C++ program, you may find cin.tie(nullptr); ios::sync_with_stdio(false);; in a Java program, you may find people avoiding the use of java.util.Scanner in lieu of special-purpose templates, and wrapping System.out with a PrintWriter() wrapper to avoid the write(2) system call that would otherwise result from each println() call (or using a StringBuffer to collect the output in memory).

How does Rust fare in this regard? Do the default facilities work "well enough," or is additional plumbing required? Does it make a significant difference?

6 Likes

So to follow up on my own post, a first experiment with "out of the box" Rust shows that on Linux it doesn't seem to follow the usual convention to not flush stdout on a newline unless stdout is connected to a terminal. This program

fn main() {
    println!("Hello");
    println!("World");
}

issues 2 write() system calls when standard output does not refer to a terminal :frowning:
(By comparison,

#include <stdio.h>
int
main()
{
    printf("Hello\n");
    printf("World\n");
}

does a single system call.)

This means that calling println!() in a loop will be very slow when the amount to be output is large. This defect isn't mentioned in the documentation - rather, it's recommended there to be used for the "primary output of the program."

Following the documentation to std::io::stdout also doesn't show it, and this version:

use std::io::{self, Write};
  
fn main() {
    let stdout = std::io::stdout();
    let mut stdout = stdout.lock();
    stdout.write_all(b"Hello\n");
    stdout.write_all(b"World\n");
}

still suffers from a lack of buffering.

Using a BufWriter like so:

use std::io::{self, Write, BufWriter};

fn main() {
    let stdout = std::io::stdout();
    let mut stdout = BufWriter::new(stdout.lock());
    stdout.write_all(b"Hello\n");
    stdout.write_all(b"World\n");
}

on the other hand buffers unconditionally.

ps: I will note that the convention of fully buffering "if and only if the stream can be determined not to refer to an interactive device" in ISO C and POSIX is at least one decade older than Rust and has been taught in textbooks since 2003. :older_man:

1 Like

There's some related discussion in this issue:

5 Likes

With respect to the fastest way to iterate over lines, depending on your requirements, the default approach in std is likely not the fastest. The bstr crate has some routines for that, which will get you pretty close to an optimal generic answer for that problem.

3 Likes

crates aren't available in competitive programming. It has to be compilable just with rustc. If additional logic is required to obtain decent performance, the contestant has to type it in, which costs valuable time. An exception are online (virtual) contests where contestants can make use of canned code, but external crates are off limit even here.

So we're looking for a solution that's based only on the std:: namespace (I assume, though I'm not quite certain how rustc works.)

Sounds pretty weird to me. You're going to have a pretty rough time, given that std (I believe) doesn't use a fully vectorized memchr implementation to find newline bytes. That's only available in the crate ecosystem.

If competitive programming bans all crates, then either Rust probably isn't a good fit for performance oriented competitions beyond certain tasks, or the competition should revise its rules.

4 Likes

Those are the rules for all languages, Rust is no exception.

Ok, here's a naive attempt at a classic problem, which is to read a graph and perform a DFS. The problem is Where is my Internet; here's my naive attempt at it:

// Solution to https://open.kattis.com/problems/wheresmyinternet
// @author godmar
//
// In naive Java (using java.util.Scanner, ArrayDeque<Integer>, and System.out.println()), 
// this problem times out with the 1sec limit.
//
// How will it perform in naive Rust?
//
use std::io::{self, BufRead};

// would love to generalize into n ints
fn read2ints<R: BufRead>(stdin: &mut R) -> (usize, usize)
{
    let mut line = String::new();
    stdin.read_line(&mut line).unwrap();
    let line = line.trim(); // trim trailing newline

    // how do I destructure the iterator returned by split?
    let lv : Vec<usize> = line.split(" ").map(|x| x.parse().unwrap()).collect();
    match lv[..] {
        [n, m] => return (n, m),
        _ => panic!("wrong input")
    };
}

fn main() {
    let stdin = io::stdin();
    let mut stdin = stdin.lock();   // only the locked version implements BufRead
    let (n, m) = read2ints(&mut stdin);

    // allocate n vectors
    let mut adj : Vec<Vec<usize>> = vec![vec![]; n];

    // read adjacency lists
    for _i in 0..m {
        let (a, b) = read2ints(&mut stdin);
        adj[a-1].push(b-1);
        adj[b-1].push(a-1);
    }

    // DFS
    let mut marked = vec![false; n];
    let mut dfs = vec![0];
    marked[0] = true;
    let mut left = n - 1;
    while !dfs.is_empty() {
        let u = dfs.pop().unwrap();
        for v in &adj[u] {
            if !marked[*v] {
                left -= 1;
                marked[*v] = true;
                dfs.push(*v);
            }
        }
    }

    // naive, repeated calls to println!
    if left == 0 {
        println!("Connected");
    } else {
        for i in 0..n {
            if !marked[i] {
                println!("{}", i+1);
            }
        }
    }
}

I couldn't figure out how to destructure the iterator returned by the split into a tuple or array, hence the awkward read2ints function. I also couldn't find a token-based reader (a la scanf or cin or java.util.Scanner or java.util.StringTokenizer) in the std library.

The good news is that this naive implementation passes in 0.28s (the limit is 1s). By contrast, an equally naive Java implementation (via java.util.Scanner and System.out.println(), which similarly to Rust fails to buffer) times out due to I/O overhead.
So this is tentative good news, though I'm curious how to write the I/O code more elegantly.

I'd probably skip the collect (which allocates) and write it like this. You also want to have a reusable buffer instead of allocating a new String every time.

use std::io::BufRead;

fn read2ints<R: BufRead>(stdin: &mut R, buffer: &mut String) -> Option<(usize, usize)> {
  buffer.clear();
  stdin.read_line(buffer).ok()?;

  let mut words = buffer.split(" ");
  let first = words.next()?.parse().ok()?;
  let second = words.next()?.parse().ok()?;

  if words.next().is_some() { todo!("Do we care about extra bits?"); }
  
  Some((first, second))
}

Correct. That sort of parsing is usually left to 3rd party crates like regex or nom. Although you can get pretty far using std::str::FromStr (the thing powering some_string.parse()) and could write your own token-based parser in about 100 lines.

Another version. I tested not at all. The main gist is the same as above, reuse your buffers and don't allocate things you don't need.

Mainly I concentrated on the I/O function, but see the change to writeln! as well. There are a couple other minor adjustments, where I left the original code commented. You could maybe gain a tiny bit by preallocating dfs some space (or maybe not).

You could write wildly unsafe code that ignores the possibility of bad input by assuming UTF8 without checking and doing unchecked Vec accesses. (Probably your C competitors aren't doing bounds checks.)

This one is 0.23s instead of 0.28s.

Generally, microoptimizations like reusing buffers in a language that does uses explicit memory management with a general purpose allocator (or the mentioned vectorized memchr, etc.) are not needed in the sense that they won't tilt the scale. It's the major things - like using a backtracking regex engine for detecting line endings (java.util.Scanner), or running with line-buffered stdout, etc. etc. that impact the competitive programmer to the degree that they must code around it.

I would still have hoped for an equivalent to Python's

n, m, k = map(int, input().split())

in the standard library.

For cases where you care more about conciseness than micro-optimization, you can do things like this:

fn read2ints<R: BufRead>(stdin: &mut R) -> Option<[usize; 2]> {
    stdin.lines().next()?.ok()?
        .trim().split(" ").map(str::parse)
        .collect::<Result<Vec<_>,_>>().ok()?
        .try_into().ok()
}

Certainly not as elegant as the Python version, since Rust's explicit error handling and noisy generics still get in the way a bit.

I know nothing of organised competitive programming.

But on a couple of occasions I have gotten involved in informal coding battles. Mostly spurred by friendly "language war" discussions. Some task would be thought up that would allow making comparisons of performance, readabillty, code size, portability, etc, etc.

Very soon the informal rules of our challenge included not using any libraries that were not a standard part of the language. This arises naturally as the idea is to compare the languages. If one can just pull in a library that does the job and end up writing only a couple of lines of code to use it that does not say very much about actually writing code in whatever language.

It also ensures the code being compared is actually written in the language in question. For example using a module in Python or Javascript that is actually written on C is not making the comparison required.

Later I discovered things like Rosetta Code, same idea there.

Of course. There is more nuance to what I'm saying. But I'm not going to derail this thread by turning it into a list of complaints about the rules of competitive programming.

1 Like

It's been around far longer than that, and it's been causing bugs the entire time. It's not a great idea to have your code behave differently in testing and in production.

It can be error prone, correct. I'm not sure what it has to do with "testing vs production" though.

I would have stuck to the convention. The default case would have worked efficiently, and errors would have been limited to certain scenarios.

Also, keep in mind that the #1 source of errors (failing to flush before a newline) is retained in Rust's default line buffering.

That's another problem. To solve it we need to disable the buffering at all by default or flush the buffer after the every print!() which would makes it slower. I'm confused, isn't it the opposite direction of your argument in this thread?

I think that droundy's argument was that the traditional way of dealing with stdio (as it's known from C) is error prone, and they are correct. That's why it has its own section in CSApp3e, for instance. A lot of things can go wrong/not match the user's expectations, for instance, ./prg > file &; tail -f file works differently than some users may expect. Other examples include when a program does not expect to be used as part of a pipe, which can lead to deadlock, etc. Or users use printf style debugging and then wonder why their statements appear to execute out of order (they didn't include a newline).

The traditional choice - line buffer if standard output is connected to a terminal where a human is watching, and do efficient full buffering otherwise - still seems an ok choice to me, particularly in a Unix environment, see Kernighan. The alternative Rust chose - always line buffer - imposes an unexpected performance penalty in pretty much most scenarios (programs whose standard output goes to an interactive terminal are in the extreme minority, and I doubt anyone does testing in such an environment either, other than in playground situations). But at least it's consistent, and not making a bad choice in the other direction as Python 2 did (which changed its I/O encoding based on whether the program was connected to an interactive terminal or not, which truly led to chaos.)

So assuming it's not just noise, an 18% improvement (5% of your total budget). It's definitely not order-of-magnitude scale savings, so I guess the answer to your question is that the default facilities work well enough, and this additional plumbing doesn't make a significant difference (at least for the given problem).

I realized later that you could let _ = writeln!(...) to avoid a potential unwrap branch (but that won't change the scale of the improvement).

If you can assume your output is small enough to buffer entirely to memory before outputting, and think it would be significantly better than line buffering, you could avoid the line buffering with:

    // 🙂
    buffer.clear();
    let mut buffer = buffer.into_bytes();
    // ...
    let _ = writeln!(&mut buffer, "...");
    // ...
    let _ = stdout.write_all(&buffer);

(See PR 72808.)

There has been some work recently on making stdout/stderr line buffering configurable, see PR 78515.

3 Likes