How to do low friction text input in Rust?


#1

Hi! What is the most concise way of reading text data from the standard input, if I am allowed to use only the standard library? Suppose I need to read a matrix, in the following format (height, width, elements)

2 3
92 42 62
0  1  2

In C++, I would do the following:

#include <iostream>
#include <vector>
#include <cstdint>

std::vector<std::vector<int32_t>> read_matrix() {
    size_t n;
    std::cin >> n;
    size_t m;
    std::cin >> m;

    std::vector<std::vector<int32_t>> result(n, std::vector<int32_t>(m, 0));
    for (size_t i = 0; i < m; ++i) {
        for (size_t j = 0; j < n; ++j) {
            std::cin >> result[i][j]; 
        }
    }

    return result;
}

In Java, there are Scanner and StreamTokenizer. In Python, there is .split_lines(), .split() and map(int, line). What should I do in Rust?

I’ve came up with

use std::io;
use std::io::prelude::*;

fn read_matrix() -> Vec<Vec<i32>> {
    let stdin = io::stdin();
    let handle = stdin.lock();
    let mut lines = handle.lines();

    let dimensions_line = lines.next().unwrap().unwrap();
    let mut dimensions = dimensions_line.split_whitespace();
    let n: usize = dimensions.next().unwrap().parse().unwrap();
    let m: usize = dimensions.next().unwrap().parse().unwrap();

    let mut result = vec![vec![0i32; m]; n];
    for row in 0..n {
        let line = lines.next().unwrap().unwrap();
        let mut words = line.split_whitespace();
        for col in 0..m {
            result[row][col] = words.next().unwrap().parse().unwrap();
        }
    }
    result
}

But this is horribly ugly because of all double unwraps and “useless” temporaries. Is there a better way?

For the context, I want to use Rust to solve typical problems from the algorithms class, so I am constrained to only use the standard library, but I can trust the input and just panic! if it does not conform to the expected format.


#2

First I’d separate reading and parsing if you can. Implementing FromStr will let you call .parse() on any string.
You could do something like:

let matrix: Vec<Vec<i32>> = s.lines().map(|line| {
    line.split_whitespace().map(|n| n.parse().unwrap()).collect()
}).collect();
// check the size of your matrix...
// return a Result

#3

I’d use scan-rules, the crate. The reason that I’m answering with a crate is that libstd doesn’t have a low-friction way to do this.


#4

Yep, there are nice solutions for parsing in the ecosystem, but unfortunately it is not that easy to add external crates to the testing systems :frowning:


#5

I do use whiteread (shameless plug). It has a single file implementation, so one can just copy-paste it. (It’s quite long, but mostly because of comments).

There’s also a short macro implementation rust-si, which is also single file and even less than 100 lines. The small disadvantages are runtime parsing of format string and allocating a String on each scanned value. Usually it’s not a problem, I guess, but when you have millions of ints to read it could start being one.


#6

Hm, actually splitting everything on whitespace is a great idea!

It is possible to write a 15 lines library to do this!

use std::io::{self, Read};
use std::str::FromStr;

struct TextReader<'a> {
    tokens: std::str::SplitWhitespace<'a>
}

impl<'a> TextReader<'a> {
    fn new(text: &str) -> TextReader {
        TextReader { tokens: text.split_whitespace() }
    }

    fn read<T: FromStr>(&mut self) -> T {
        let token = self.tokens.next().expect("EOF reached");
        match token.parse() {
            Ok(x) => x, 
            Err(_) => panic!("Failed to parse {:?}", token),
        }
    }
}

fn read_matrix() -> Vec<Vec<i32>> {
    let mut buffer = String::new();
    io::stdin().read_to_string(&mut buffer).unwrap();
    let mut r = TextReader::new(&buffer);

    let n = r.read::<usize>();
    let m = r.read::<usize>();

    let mut result = vec![vec![0; m]; n];
    for row in 0..n {
        for col in 0..m {
            result[row][col] = r.read::<i32>();
        }
    }
    result
}

I wonder if it is somehow possible to simplify API to just

let r = TextReader::from_stdin()

If I try to do it in the straightforward way I run into “store a value and a reference to it inside the same struct” problem…


#7

That’s a really interesting problem whose solution I would love to know.

What we really need is for BufRead to implement split_at_whitespace(). It already has a split which splits on only a single specific byte and enumerates Vec<[u8]>s, which are easily convertible to String::from_utf8() without additional allocations. Adding split_at_whitespace here would be far more memory efficient than loading a whole file to memory.


#8

I tried some ways to make (owned, self-borrowed) structs generically usable, but I found myself at a loss after various attempts. So I gave up and did the simple thing specialized for your case. TextReader below should be safely moveable into a different context than where you read your data to a string.

struct TextReader {
    source: String,
    tokens: std::str::SplitWhitespace<'static>
}

impl TextReader {
    fn new(text: String) -> TextReader {
        let mut tr;
        unsafe {
            tr = TextReader {
                source: text,
                tokens: std::mem::uninitialized(),
            };
            std::ptr::write(
                &mut tr.tokens,
                transmute::<&str, &'static str>(&tr.source).split_whitespace());
        }
        tr
    }

    // ...
}

Playground

Safety

  • This technique is safe only if moving the owned value does not invalidate the borrowed reference. Specifically, this is not safe if instead of String you had [T; N] or some other string representation using short-string optimization, since moving the TextReader struct would invalidate the (unsafe) borrow.
  • Also, for safety, it is important also that SplitWhitespace not implement Drop (which it has absolutely no need to here).