How to parse command output and split it into OsStr slices?

I am trying to parse the output of a Command::new("cmd").output()?.stdout, splitting it into lines of whitespace-separated fields. (Definition of whitespace similar to the C locale: tab, space, \n) Basically a Vec (lines) of Vec (fields) of OsStr (or any other practical type) slices that point to the initial buffer.

I would like to avoid converting back and from Strings, since I can't be sure about the encoding of the data, and it's just data that I will most likely feed into other operating system utilities. So I decided that the best type is OsStr, but I am open to other possibilities.

How can I do this best? I have had quite a few hurdles on my way:

  • working with OsStrs is a pain since they don't provide all String utilities (like split_whitespace()).
  • what is the idiomatic way to write this cross-platform? ATM I convert stdout to OsStr using OsStr::from_bytes() but this apparently only works on UNIX.

Thank you in advance.

  1. to_str only checks that the OsStr is valid UTF-8, if it is - you get a Option<&str>, this doesn't allocate.
  2. If the pointer you give OsStr::new is a &[u8], you can simply
    do something like this:
    fn main() {
        let mystr: &[u8] = b"This (could be) a [u8] slice from FFI";
        mystr
            .split(|n: &u8| n.is_ascii_whitespace())
            .for_each(|s: &[u8]| {
                // do something with this part
                dbg!(std::str::from_utf8(s));
            });
    }
    

Regarding cross-platform, AFAIK in Linux/Unix strings are utf-8 - meaning this should work, but so should converting to a &str. On Windows, depending on the API - strings are either ascii (which will work, utf-8 is backwards compatible with ascii), or UTF16 which will require converting to/from UTF-8...
Note that the OS usually requires a null-terminated string, so this doesn't always save you an allocation (Rust strings are not null-terminated).

1 Like

OsStr doesn't really seem right for this. If you truly don't know the encoding of the data, then you're kind of stuck there. That is, if the output could be UTF-8 or it could be UTF-16, then I think you really need to address that explicitly. One simple way to do that is with encoding_rs_io, which will wrap any io::Read and automatically handle transcoding from UTF-16 to UTF-8 for you. For invalid UTF-16, you'll get the Unicode replacement codepoint. The wrapper is effectively zero cost for non-UTF-16 or ASCII compatible text, which is probably what you're dealing with if this is just a generic CLI utility.

If the data you're dealing with is ASCII compatible but not necessarily valid UTF-8, then bstr is probably what you want. This is what it was designed for.

1 Like

Thank you. I've given up on working with OsStr slices since they don't provide many utility functions. I am now mostly trying to split the command output, working with &[u8] slices right out of stdout. I'll get back to you when I have it working. But it still looks much harder than using strings. Some notes and further questions:

  • I do not know the encoding: not all UNIX encodings are utf8 by default, not all paths are valid encoded. In the end I do not care about the encoding (for most fields), I want to store them as bytes.
  • I am now using split() with is_ascii_whitespace(), as @naim suggested, but this splits even on sequential whitespace. How would I go skipping all sequential whitespace?
  • I find it a pity being so hard to work with OsStr or &[u8] slices. Shouldn't they support most of String methods via a trait?

And a different but related topic. After I split the first line of output (headers), I should know the offset of each word on the line and I could use it to directly split the next lines. How do I get that offset number? I.e. after doing line.split().collect() and end up with a vector of words, how do I get the index of each word on line?

P.S. I haven't looked yet in external crates like bstr or osstrtools. Hoping I'll perform this easy task using std. :slight_smile:

That's exactly what bstr gives you:

use bstr::ByteSlice;

fn main() {
    let line: &[u8] = &b"foo\t\tbar    baz quux\n"[..];
    for field in line.fields() {
        println!("{:?}", field.as_bstr());
    }
}

Output:

"foo"
"bar"
"baz"
"quux"

You do though. If you're OK using is_ascii_whitespace, then that's a declaration that you at least care that the encoding is ASCII compatible. So to that end, this is exactly what bstr was designed for.

Sure, I guess the simplest way is with a little bit of pointer arithmetic:

use bstr::ByteSlice;

fn main() {
    let line: &[u8] = &b"foo\t\tbar    baz quux\n"[..];
    for field in line.fields() {
        let start = field.as_ptr() as usize - line.as_ptr() as usize;
        let end = start + field.len();
        println!("({:?}, {:?}): {:?}", start, end, field.as_bstr());
    }
}

Output:

(0, 3): "foo"
(5, 8): "bar"
(12, 15): "baz"
(16, 20): "quux"

The bstr crate docs address this somewhat. It's not clear how widely applicable "ASCII-compatible encoding agnostic but conventionally UTF-8" strings actually are. They definitely apply at least to CLI tooling on Unix, where one often sees a mix of latin-1 and UTF-8 floating around.

Note that you can disable some of bstr's features to reduce the number of dependencies it brings in.

1 Like