How to remove useless space in Rust string without using regex

Is there a way in Rust to in a string:

  1. reduce double spaces to one and
  2. remove spaces before and after \n, \r\n and tabs?

As you can imagine this is all text coming from form inputs like text and textarea.

All this:

  1. without using regex
  2. with unicode chars

Some tests to satisfy are:

#[test]
fn test() {
    assert_eq!(magic("  ".to_string()), "");
    
    assert_eq!(
        magic("     a l    l   lower      ".to_string()),
        "a l l lower"
    );
    
    assert_eq!(
        magic("     i need\nnew  lines \n\nmany   times     ".to_string()),
        "i need\nnew lines\n\nmany times"
    );
    
    assert_eq!(magic("  à   la  ".to_string()), "à la");
}

In golang I'm using:

func Magic(s string) string {
	return strings.ReplaceAll(strings.Join(strings.FieldsFunc(s, func(r rune) bool {
		if r == '\n' {
			return false
		}

		return unicode.IsSpace(r)
	}), " "), " \n", "\n")
}

SO question: How to remove useless space in Rust string without using regex - Stack Overflow

This test appears to be inconsistent with the others, which remove all spaces from the beginning and end of the string. From the textual description, either this should be the empty string or the others should retain a single blank space at the beginning and end.

2 Likes

As this is possible to write as a regex, it's also possible to write it as a simple finite automaton. One possible solution would be something along these lines:

use std::iter::Peekable;

struct S<I:Iterator>(Peekable<I>);

impl<I> Iterator for S<I> where I: Iterator<Item=char> {
    type Item = char;
    fn next(&mut self)->Option<char> {
        loop {
            match self.0.next()? {
                ' ' => {
                    match self.0.peek()? {
                        ' ' | '\r' | '\n' | '\t' => (),
                        _ => return Some(' ')
                    }
                }
                w @ ('\r' | '\n' | '\t') => {
                    while let Some(' ') = self.0.peek() {
                        self.0.next();
                    }
                    return Some(w)
                }
                c @ _ => return Some(c)
            }
        }
    }
}

fn magic(chars: String)->String {
    let ret = S(chars.trim().chars().peekable()).collect();
    return if ret == "" { String::from(" ") } else { ret }
}

How strict are your particulars, like the "retain single space" thing mentioned above?

Because this seems reasonable to me.

fn magic(input: &str) -> String {
    let mut output: String = input
        .trim()
        .lines()
        .flat_map(|line| {
            line.split_whitespace()
                .intersperse(" ")
                .chain(std::iter::once("\n"))
        })
        .collect();

    // Remove trailing '\n' (optional...)
    output.pop();
    output
}

If you are more particular and/or don't want to wait for itersperse:

fn magic(istr: &str) -> String {
    let mut output = String::new();
    for line in istr.lines() {
        let mut blank = true;
        output.extend(
            line.split_whitespace()
                .inspect(|_| blank = false)
                .flat_map(|word| [word, " "])
        );

        if !blank {
            // Remove extra trailing ' '
            output.pop();
        } else if !line.is_empty() {
            // For the "   " => " " case
            output.push(' ');
        }
        output.push('\n');
    }
    
    // Remove trailing '\n'
    output.pop();

    output
}

Why don't you want to use regexes?

2 Likes

I think a literal translation into Rust would look something like this:

fn magic(s: &str) -> String {
    s.replace(|c: char| if c == '\n' { false } else { c.is_whitespace() }, " ")
     .replace(" \n ", "\n")
}
1 Like

You're perfectly right! I updated the question.

These are amazing!

I've benchmarked

The case " " => " " was wrong. How can I fix this? Should I only remove the else branch?

I never liked them and I don't find them readable. Also I am very afraid that they may be slower, but I have to benchmark to learn more.

Do you have any idea what a regex might look like for this case?

Thanks but this doesn't handle case like:

assert_eq!(magic("   ".to_string()), "");

Here is a solution using regex.

use regex::Regex;

fn magic(s: &str) -> String {
    let re = Regex::new("(?m)^ +| +$| +( )").unwrap();
    re.replace_all(s, "$1").into_owned()
}

It doesn't handle tabs because no test cases had tabs, but it should be easy to add tabs as appropriate.

2 Likes

I believe so, yeah.

Reading RegEx is always a challenge and you are not the only one who finds it hard. However you can use regex parsers/explainers to see what it's doing (something like this, for example https://regex101.com/ ). You also should probably write comment what regex should do.
As for performance, regular expressions are often faster than doing String search and replace methods. Regular expressions are a great tool and you shouldn't avoid them, especially when making a RegEx would save you writing a ton of code.

1 Like

I created this SO question to get a good regex with these requirements, can you help me @sgrey, @tczajka, @jkugelman?

  1. two or more spaces not in a string removed

  2. two or more spaces in a string reduced to one (ex: " text other " -> "text other")

  3. one or more spaces removed after and before characters such as:

    1. \n
    2. \r\n
    3. \t
  4. replace \r\n with \n

I tried with +|\\n +|\t +\\r\n .+ but obviously this doesn't work totally.

We can use the below patterns to check it's working:

assert_eq!(not_useful_space("   "), "");
assert_eq!(not_useful_space("    a l    l   lower      "), "a l l lower");
assert_eq!(not_useful_space("    i need\n new lines\n\n many times     "), "i need\nnew lines\n\nmany times");
assert_eq!(not_useful_space("    i need  \n new lines \n\n many times     "), "i need\nnew lines\n\nmany times");
assert_eq!(not_useful_space("  i need \r\n new lines\r\nmany times   "), "i need\nnew lines\nmany times");
assert_eq!(not_useful_space("    i need \t new lines\t \t many times     "), "i need new lines many times");
assert_eq!(not_useful_space("  à   la  "), "à la");

This doesn't handle all the test cases: How to remove useless space in Rust string without using regex - #14 by frederikhors

Thanks. I'm benchmarking both.

On that SO question a user answered with this:

function not_useful_space(str) {
  return str.replace(/^[ \t]+|[ \t]+$|\r|([ \t]){2,}/mg, '$1');
}

which apparently works in javascript but not in Rust: Rust Playground

Do you know why?

weren't you already given many solutions here? I think that this function by @tczajka works for your proposed cases, no?

actually, isn't this incorrect? You mentioned before you want to leave tabs in the string, but this one will remove tabs.

I found a solution: regexp replace - Regex for useless space in form's inputs - Stack Overflow.

Thank you all. I'm trying to benchmark solutions now. Thank you all!