Splitting a byte string into words

I have a bytestring of output from a subprocess, that consists of a series of whitespace-separated words and I want to split that up into words for processing. Ideally, I don't want to convert to a string as I can't be 100% sure the data is valid UTF-8 (I'm 99% sure, but why take the risk?) It wasn't too hard to write my own iterator for this:

#[derive(Debug)]
struct Words<'a> {
    str: &'a [u8],
    start: usize,
    end: usize,
}

impl<'a> Iterator for Words<'a> {
    type Item = &'a[u8];

    fn next(&mut self) -> Option<Self::Item> {
        self.start = self.end;
        while self.start < self.str.len() && self.str[self.start].is_ascii_whitespace() {
            self.start += 1;
        }
        if self.start >= self.str.len() {
            return None;
        }
        self.end = self.start;
        while self.end < self.str.len() && !self.str[self.end].is_ascii_whitespace() {
            self.end += 1;
        }
        Some(&self.str[self.start..self.end])
    }
}

but it feels like this is something I should be able to get from the standard library. I did spend a bit of time before writing my own, but I couldn't find anything that looked straightforward.

Is there an easy way of doing this that I should have found? My reasons are two-fold:

  1. I'm just learning Rust, and I want to learn to do things the right way, which to me means using the features available rather than rewriting things from scratch where possible.
  2. While I'm reasonably sure my implementation is OK, I'd much rather use an existing, tested answer instead of adding an extra potential source of bugs by writing my own implementation.

You could use split:

fn main() {
    let bytes = b"hello world";
    
    let words = bytes.split(|b| b.is_ascii_whitespace());
    
    for word in words {
        println!("{}", ::std::str::from_utf8(word).unwrap());
    }
}

Playground.

3 Likes

I like @jofas' answer if it works for you, but it won't handle the case of multiple adjacent whitespace characters as nicely. The bstr crate provides a [u8]::fields method that probably does what you want. It also provides a number of other routines for treating &[u8] as if it were a string. That is, it provides a library for conventionally UTF-8 encoded strings, where as a &str is for mandated UTF-8 encoded strings. On top of that, [u8]::fields will handle Unicode whitespace for you, even if your data isn't all valid UTF-8.

Popping up a level, you can even do full blown Unicode word segmentation on possibly-invalid UTF-8 with [u8]::words, also in bstr.

4 Likes

Thanks. I did find this, and it does nearly what I want, but it returns empty "words" when there's multiple spaces:

let txt = b"  a    bbb   ccc d e";
let h: Vec<_> = txt.split(|b| b.is_ascii_whitespace()).map(|b| std::str::from_utf8(b).unwrap()).collect();
println!("{:?}", h);

-->

["", "", "a", "", "", "", "bbb", "", "", "ccc", "d", "e"]

I guess I could filter out entries where the length is zero, which I didn't think of originally, so that's a good solution. In my head I think of the problem as "split on runs of whitespace" rather than "split on space characters and throw away empty strings", but I guess that's a minor detail.

I did look at that, and it was definitely an option. But it splits based on the Unicode whitespace property, which means it does a bunch of work I don't need - I know in this case that the only word separators I'm interested in are ASCII whitespace. So I think the split/filter solution is probably better for what I need (it fits my mental model of only doing the bare minimum of string processing that I need, to avoid adding risk associated with funky Unicode stuff).

To be clear, I do realise that I'm being very obsessive here about very trivial details. All of these solutions work just fine. To an extent, what I'm doing is working through the biases I've acquired through working with Python for years (learning to care enough, but not too much, about performance, and making sure I don't make invalid assumptions around Unicode). So apologies for that!

Thanks to both of you, though - this discussion has been very informative for me, even though it probably seems pretty basic to you. I appreciate your time helping me :slightly_smiling_face:

It's probably cheaper than you expect, but that's fair. You can do fields_with(|ch| ch.is_ascii_whitespace()) instead if you really just want to restrict to ASCII whitespace.

FWIW though, bstr is designed in a way where as long as you can assume "conventionally UTF-8" (i.e., files on Unix systems), then there shouldn't be much you can do to misuse it. But you're right to worry about Unicode in terms of its cost model, because it almost always has some kind of overhead somewhere.

2 Likes