Gripes with filling variables from input

The plan is to write a simple method which does exactly what std::cin >> from the C++ standard library does. Here is what was come up with.

use std::io::BufRead;
pub fn input<T: std::str::FromStr>(handle: &std::io::Stdin) -> Result<T, T::Err> {
    let mut x = String::new();
    let mut guard = handle.lock();
    loop {
		let mut trimmed = false;
		let available = guard.fill_buf().unwrap();
        let l = match available.iter().position(|&b| !(b as char).is_whitespace()) {
            Some(i) => {
				trimmed = true;
				i
			}
            None => available.len()
		};
		guard.consume(l);
		if trimmed {
			break;
		}
    }
    let available = guard.fill_buf().unwrap();
    let l = match available.iter().position(|&b| (b as char).is_whitespace()) {
        Some(i) => i,
        None => available.len()
    };
    x.push_str(std::str::from_utf8(&available[..l]).unwrap());
    guard.consume(l);
    T::from_str(&x)
}

Let me help you understand. The loop block is meant to trim away all the whitespaces before valid input begins. The match block outside loop is where the length of the valid input (that is, before trailing whitespaces begin or EOF is reached) is calculated.

Here is an example using the above method.

let handle = std::io::stdin();
let x: i32 = input(&handle).unwrap();
println!("x: {}", x);
let y: String = input(&handle).unwrap();
println!("y: {}", y);

When I tried a few simple tests, the method works as intended. However, when I use this in online programming judges like the one in codeforces, sometimes I get a complaint telling that the program stays idle or that the wrong input has been taken, among other issues, which leads to suspecting that I missed a corner case or something like that. This usually happens when the input is a few hundreds of lines long.

Okay, I am asking here hoping one of you knows of an input which is going to break the method, or even better, one of you points a correction. Help me?

Edit:
Replacing the loop containing fill_buf() + consume() with something like a single read_to_string() or a loop containing read_line() seems to solve the issue, but I am not certain why.

Not sure it's the reason of the failing but I see some bugs in your code. Have you tested some non-ASCII input?

This treats bytes of UTF-8 representation of non-ascii character as a separate unicode character between 0x80 and 0xFF. Maybe some characters in this range are considered whitespace by the unicode framework?

1 Like

You talking about this?

    #[stable(feature = "rust1", since = "1.0.0")]
    #[inline]
    pub fn is_whitespace(self) -> bool {
        match self {
            ' ' | '\x09'..='\x0d' => true,
            c => c > '\x7f' && unicode::White_Space(c),
        }
    }

Hmm.. makes sense, but I don't think online judges give anything but ASCII as input and it's only alphabets, numbers, spaces, newlines and sometimes tabs, as far as I know.

Also, here is part of the impl str block.

    #[must_use = "this returns the trimmed string as a slice, \
                  without modifying the original"]
    #[stable(feature = "rust1", since = "1.0.0")]
    pub fn trim(&self) -> &str {
        self.trim_matches(|c: char| c.is_whitespace())
    }

You see, even trim() uses is_whitespace() internally. So I don't know if it makes a difference.

Obviously, @Hyeonu didn't mean that is_whitespace() contains a bug.

He meant that you are using it wrong. You are iterating over the bytes of the string, and treating each byte as an ASCII character. You are not iterating over the code points, which are described by potentially more than one byte. That would be the .chars() iterator of str.

Of course str::trim() uses is_whitespace() internally, but it iterates over the string correctly, i.e. it takes each complete code point, not the raw bytes. So it's not doing the same thing as your code above.

1 Like

Since you said you are using it on codeforces, I will give you some advice that is specific to solving problems like that on codejudges.

  1. Read the entire input into memory.
  2. Use split_ascii_whitespace on that string.

This advice is useful because it is rather fast, and telling the difference between line breaks and spaces is virtually never necessary on codeforces. Here's an example that reads some lines:

use std::io::Read;

let mut s = String::new();
std::io::stdin().lock().read_to_string(&mut s).unwrap();

let mut iter = s.split_ascii_whitespace();

let len: usize = iter.next().unwrap().parse().unwrap();
let mut input = Vec::with_capacity(len);
for _ in 0..len {
    let line: u32 = iter.next().unwrap().parse().unwrap();
    input.push(line);
}

In the real world you would of course want to handle unicode, but that is not necessary with code-judge problems.

3 Likes

I kind of forgot about utf-8 not always fitting in a single byte. Here is the corrected code.

pub fn input<T: std::str::FromStr>(handle: &std::io::Stdin) -> Result<T, T::Err> {
    let mut x = String::new();
    let mut guard = handle.lock();
    loop {
		let mut trimmed = false;
		let available = std::str::from_utf8(guard.fill_buf().unwrap()).unwrap();
        let l = match available.chars().position(|c| !c.is_whitespace()) {
            Some(i) => {
				trimmed = true;
				i
			}
            None => available.len()
		};
		guard.consume(l);
		if trimmed {
			break;
		}
    }
    let available = std::str::from_utf8(guard.fill_buf().unwrap()).unwrap();
    let l = match available.chars().position(|c| c.is_whitespace()) {
        Some(i) => i,
        None => available.len()
    };
    x.push_str(&available[..l]);
    guard.consume(l);
    T::from_str(&x)
}

The issue persists though.

@alice, I agree with what you are saying. I am just wondering what causes the program to "stay idle", as the judge complains.

Text processing is hard. Why don't you use existing stdlib functionalities?

In competitive programming, inputs are usually small enough (under few megabytes) so I read them as a whole using std::io::stdin().read_to_string().unwrap() and make an iterator using input.trim().split_whitespace(), and parse it token by token. If the input is too large I do so in line-by-line with .read_line().unwrap().

The process blocks forever on read_to_string() when reading from the standard input. Is this intended?

It blocks until the input stream is closed. If you pipe in a file, it will close the stream when the file ends. You can typically write an EOF symbol with ctrl+D.

On some sites they include common crates, a great one for this is text_io, which makes it easy to write code like this:

use text_io::read;

fn main() {
    let i: i32 = read!();
}

You still have two UTF-8 related bugs:

  1. available.chars().position() does not return a byte index suitable for indexing into available because it counts characters instead of bytes.
  2. The buffer returned by fill_buf() is not guaranteed to start or end on a UTF-8 character boundary.

I suggest going back to .iter() instead of .chars() and just using u8::is_ascii_whitespace instead of char::is_whitespace. But to completely solve problem #2 you also need to accumulate the bytes in a Vec<u8> and convert it to a String only at the end, because you can't call from_utf8 on a buffer containing partial chars.

As for your actual bug -- shouldn't you be saving the bytes you've consumed inside the loop? I haven't tested it but it looks to me as if, whenever you call guard.consume(l) inside the loop, the bytes consumed are just lost. That would cause input to sometimes skip (potentially large) chunks of the input data, which might be the cause of your problem. Sorry, no, I misunderstood that part. The next post has a better guess as to what the problem might be.

Isn't the real problem that this code doesn't consume the entire input? Copy-pasta from the BufRead::fill_buf docs:

An empty buffer returned indicates that the stream has reached EOF.

This code is not looking for EOF, it just returns anything in the buffer and calls it a day.

Given this, the sample usage code is invalid; these can retrieve partial contents, or worse; could be unable to convert the byte slice to a valid string:

2 Likes

Yes, that's what I intended: to get a reference for the whole buffer using fill_buf(), then consume() only what's required for one variable.

This code doesn't do that, though. It's conceivable that the buffer will contain only part of the actual input, and not enough to satisfy your requirement to populate an entire variable. It could be as little as 1 byte, when you actually expect a full paragraph of text.

1 Like

I've seen the implementations of some of the methods from the Read and BufRead traits. In fact, I wrote the method based on these.

So from what I understand about the standard input stream, hitting Return signals the end of buffer whereas hitting Ctrl+D signals EOF. Is this correct?

Here's a taste of what's going wrong...

use std::io::BufRead;

fn main() {
    let handle = std::io::stdin();
    let mut guard = handle.lock();
    let available = guard.fill_buf().unwrap();
    let length = available.len();
    guard.consume(length);

    println!("Read {} bytes", length);
}

You want this to consume the entire input, right? So if hypothetically the input was infinite sized, this command would never finish. Well, we have /dev/zero which produces zero bytes forever, so we can test it!

$ cat /dev/zero | cargo run --release
Read 8192 bytes

So at least on my system, the input function you wrote will only read up to 8 KiB. (And if that 8 KiB boundary happens to fall within a multi-byte UTF-8 sequence, it will panic.)

The usual way to write code that will consume from a buffered stream forever is to loop over the stream until you reach EOF. Or until you have consumed enough bytes to fully deserialize your message or packet.

1 Like

How about this?

use std::io::BufRead;

fn main() {
    let handle = std::io::stdin();
    let mut guard = handle.lock();
    let mut total_length = 0;
    loop {
        let available = guard.fill_buf().unwrap();
        let length = available.len();
        guard.consume(length);
        total_length += length;
        if <some_condition> {
            break;
        }
    }
    println!("Read {} bytes", length);
}

One more thing (not related to the above code), I AM taking care not to end the buffer in the middle of a UTF-8 character by doing a consume() only until a known, valid character (like a whitespace or a valid character before it). This is also why I used chars() instead of something like iter() or something like that.

Let me break it down a little further.

pub fn input<T: std::str::FromStr>(handle: &std::io::Stdin) -> Result<T, T::Err> {
    let mut x = String::new();
    let mut guard = handle.lock();
    loop {
		let mut trimmed = false;
		let available = std::str::from_utf8(guard.fill_buf().unwrap()).unwrap();

This is your first problem. Suppose that the input stream contains "Hello 😃" and the buffer only has room for 8 characters. fill_buf() will read "Hello " followed by the first two bytes of 😃 (which is a four-byte character), then stop, and return the buffer containing a partial character (leaving the last two bytes of 😃 in the input stream). str::from_utf8 will find the partial character and return Err and the unwrap() will panic. You never even get to call chars() on it so it makes no difference whether you tried to break on a character boundary or not.

        let l = match available.chars().position(|c| !c.is_whitespace()) {

This is the second problem. Forget about the buffer size for the moment; let's just say it's big enough to contain the whole input, which is " world". But there's a catch: the space in that string is a U+00A0 NO_BREAK SPACE character, which is 2 bytes in size and for which char::is_whitespace returns true.

The line above technically works fine. position keeps a count of characters which starts at 0. It looks at the non-breaking space and does nothing. Then the counter increments to 1, position looks at w and, because it is not whitespace, returns 1: the position of the first non-whitespace character in the iterator, but not the byte index at which it started in the original string...

            Some(i) => {
				trimmed = true;
				i
			}
            None => available.len()
		};
		guard.consume(l);

... which sets this guard.consume up to consume only 1 byte of the non-breaking space, which means the second byte will remain in the buffer...

		if trimmed {
			break;
		}
    }
    let available = std::str::from_utf8(guard.fill_buf().unwrap()).unwrap();

... and because it's not correctly encoded UTF-8, str::from_utf8 will return Err again which you immediately unwrap().

    let l = match available.chars().position(|c| c.is_whitespace()) {

If you manage to avoid the above problem, it can actually happen again here because if the input stream contains multi-byte non-whitespace characters, the position will be wrong...

        Some(i) => i,
        None => available.len()
    };
    x.push_str(&available[..l]);

... which may lead to panicking here when you index on a non-character-boundary, or worse, successfully return the wrong amount of bytes.

    guard.consume(l);
    T::from_str(&x)
}

To fix the problem of position, you could use char_indices instead of chars and find instead of position. However, you can still end up with problems where the stdin buffer does not begin or end on a character boundary, and it's quite tricky to solve those correctly (don't forget that the buffer may not have enough free space for even a single multibyte character). The simplest thing is to restrict your "whitespace" characters to ASCII-only, which makes it much easier to prove that str::from_utf8 will always succeed.

2 Likes

Thanks for taking the time, @trentj. The documenting community could use more people like you (the others too). I totally forgot to adjust the code after changing iter() to char().