Splitting string on white space but preserving quoted substring

I'm trying to split a string that contains quoted text so I can use it as command args later in a separate process.

// this: "bash -c \"rm -f tmp/pids/server.pid && bundle exec rails s -p 3000 -b '0.0.0.0'\""

// to this:
args = [
    "bash",
    "-c",
    "rm -f tmp/pids/server.pid && bundle exec rails s -p 3000 -b '0.0.0.0'",
]

I've tried using regex for this but have not been successful in splitting the string on other parts of the string not inside quotes. The only way I found to achieve the above result was to first split on the quote, then using a peekable iterator to exclude the last element in the previous iterator, splitting every other element in the iterator on whitespace using a temporary vector, and then flattening the temporary vector and pushing the last element without splitting on it. This works well for instances such as the above example where everything unquoted can be an arg and the last part of the command is the "command" being passed to bash with the -c flag, but would break in other scenarios.

This is what I've tried so far:

use regex::Regex;
fn main() {
    let cmd = "bash -c \"rm -f tmp/pids/server.pid && bundle exec rails s -p 3000 -b '0.0.0.0'\""
        .to_string();
    // This doesn't work because it will split on everything inside parenthesis, returning
    // an iterator with everything else but the quoted text.
    let re = Regex::new(r#""[^"]*"|\s+"#).unwrap();
    let args: Vec<&str> = re.split(&cmd).collect();
    dbg!(args);

    // This doesn't work because I'm splitting on spaces so the quoted text also
    // gets split.
    let re = Regex::new(r#"\s+"#).unwrap();
    let args: Vec<&str> = re.split(&cmd).collect();
    dbg!(args);

    // This works as desired but seems overly engineered. Maybe I'm just overthinking?
    // Perhaps this is almost good enough and I'm just missing some more idiomatic way
    // of expressing this?
    let cmd_split_quotes: Vec<&str> = cmd.split_terminator('"').collect();
    let mut cmd_split_spaces = cmd_split_quotes.iter().peekable();
    let mut arg_vec_temp = vec![];
    let mut arg_vec = vec![];
    while let Some(chunk) = cmd_split_spaces.next() {
        if cmd_split_spaces.peek().is_some() {
            let split_chunk: Vec<String> = chunk.split_whitespace().map(String::from).collect();
            arg_vec_temp.push(split_chunk);
        } else {
            arg_vec = arg_vec_temp.iter().flatten().map(String::from).collect();
            arg_vec.push(chunk.to_string());
        }
    }
    dbg!(arg_vec);

    println!("{}", cmd);
}

(Playground)

How would you guys go about doing this? Is this possible?

(Cross posted on reddit for more visibility, so here's the link for that just in case: https://www.reddit.com/r/rust/comments/tlvf5h/splitting_string_on_white_space_but_preserving/)

You're writing a thing called lexer or tokenizer. It is one of the core components of the programming language implementations like compilers and interpreters. The shell script itself is a programming language.

And why you can't use regex here? Chomsky hierarchy strikes again. The lexer should behave differently only within the string literal. So the lexer syntax here is not a context-free grammar, which is a superset of the regular grammar which the regular expression(regex) can parse. I'm not sure the PCRE can technically parse it as it can do more than regular expression, but please avoid to write dozen lines of regex code nobody can read.

In general lexer impls takes a sequence of bytes or characters and produces a sequence of tokens. You can refer to xshell's implementation which supports bash-like syntax.

7 Likes

So, in the end I used something pretty similar to what you suggested. I used the implementation from shell-words/lib.rs at master · tmiasko/shell-words · GitHub with some minimal changes here and there to use it as a trait at the call site:

use core::mem;
use std::fmt::{Display, Formatter, Result as FmtResult};

#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub struct ParseError;

impl Display for ParseError {
    fn fmt(&self, f: &mut Formatter) -> FmtResult {
        f.write_str("missing closing quote")
    }
}

impl std::error::Error for ParseError {}

enum State {
    /// Within a delimiter.
    Delimiter,
    /// After backslash, but before starting word.
    Backslash,
    /// Within an unquoted word.
    Unquoted,
    /// After backslash in an unquoted word.
    UnquotedBackslash,
    /// Within a single quoted word.
    SingleQuoted,
    /// Within a double quoted word.
    DoubleQuoted,
    /// After backslash inside a double quoted word.
    DoubleQuotedBackslash,
}

pub(crate) trait IntoArgs {
    fn try_into_args(&self) -> Result<Vec<String>, ParseError>;
}

impl<S: std::ops::Deref<Target = str>> IntoArgs for S {
    fn try_into_args(&self) -> Result<Vec<String>, ParseError> {
        use State::*;

        let mut words = Vec::new();
        let mut word = String::new();
        let mut chars = self.chars();
        let mut state = Delimiter;

        loop {
            let c = chars.next();
            state = match state {
                Delimiter => match c {
                    None => break,
                    Some('\'') => SingleQuoted,
                    Some('\"') => DoubleQuoted,
                    Some('\\') => Backslash,
                    Some('\t') | Some(' ') | Some('\n') => Delimiter,
                    Some(c) => {
                        word.push(c);
                        Unquoted
                    }
                },
                Backslash => match c {
                    None => {
                        word.push('\\');
                        words.push(mem::take(&mut word));
                        break;
                    }
                    Some('\n') => Delimiter,
                    Some(c) => {
                        word.push(c);
                        Unquoted
                    }
                },
                Unquoted => match c {
                    None => {
                        words.push(mem::take(&mut word));
                        break;
                    }
                    Some('\'') => SingleQuoted,
                    Some('\"') => DoubleQuoted,
                    Some('\\') => UnquotedBackslash,
                    Some('\t') | Some(' ') | Some('\n') => {
                        words.push(mem::take(&mut word));
                        Delimiter
                    }
                    Some(c) => {
                        word.push(c);
                        Unquoted
                    }
                },
                UnquotedBackslash => match c {
                    None => {
                        word.push('\\');
                        words.push(mem::take(&mut word));
                        break;
                    }
                    Some('\n') => Unquoted,
                    Some(c) => {
                        word.push(c);
                        Unquoted
                    }
                },
                SingleQuoted => match c {
                    None => return Err(ParseError),
                    Some('\'') => Unquoted,
                    Some(c) => {
                        word.push(c);
                        SingleQuoted
                    }
                },
                DoubleQuoted => match c {
                    None => return Err(ParseError),
                    Some('\"') => Unquoted,
                    Some('\\') => DoubleQuotedBackslash,
                    Some(c) => {
                        word.push(c);
                        DoubleQuoted
                    }
                },
                DoubleQuotedBackslash => match c {
                    None => return Err(ParseError),
                    Some('\n') => DoubleQuoted,
                    Some(c @ '$') | Some(c @ '`') | Some(c @ '"') | Some(c @ '\\') => {
                        word.push(c);
                        DoubleQuoted
                    }
                    Some(c) => {
                        word.push('\\');
                        word.push(c);
                        DoubleQuoted
                    }
                },
            }
        }

        Ok(words)
    }
}

then use it like so:

let cmd = match &service_config.command {
    Some(Command::Simple(cmd)) => cmd.try_into_args().ok(),
    None => None,
    _ => panic!("Unsupported command"),
};

You're absolutely right that in the end this was a not the best fit for regex and this solution feels a lot more robust.
Thank you for your help and guidance!