How to faster remove line breaks and spaces

Hello all,

I do have to remove all new lines and all spaces between > < from a 2MB text file. My solution is naive and as following and takes about 400ms on my laptop

let RE_SPACES: Regex = Regex::new(r">\s+?<").unwrap();

// remove all new lines and CRs
let text : String = text.replace('\n', "").replace('\r', "");

// remove all whitespace 
let text: String = RE_SPACES.replace_all(&text, "><").to_string();

I am seeking a better and faster solution without creating so many copies. Does an expert have a hint on how to make this faster?

Thanks in advance,

Maciej

This should be close to fastest you can do, assuming Unicode support:

let mut s = std::fs::read_to_string(path)?;
s.retain(|c| !c.is_whitespace());

or if you only mean ASCII:

let mut s = std::fs::read(path)?;
s.retain(|c| !c.is_ascii_whitespace());

First: your code doesn't even do what you claim you want it to do: it's replacing all \r and \n, even if they're not between ><.

Second: are you compiling with --release?

Third: You can handle newlines with the regex by putting it in multi-line mode.

With a simple benchmark:

Benchmark
#![feature(test)]

use regex; // 1.5.4
extern crate test;

use regex::Regex;
use std::iter;
use test::{black_box, Bencher};

#[bench]
fn with_regex_amortized(b: &mut Bencher) {
    let re = Regex::new(r"(?m)>\s+?<").unwrap();
    let s: String = iter::repeat(
        iter::once('>')
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars())
            .chain(iter::once('<'))
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars()),
    )
    .take(2000)
    .flatten()
    .collect();

    b.iter(|| re.replace_all(black_box(&s), "><").to_string())
}

#[bench]
fn with_regex(b: &mut Bencher) {
    let s: String = iter::repeat(
        iter::once('>')
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars())
            .chain(iter::once('<'))
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars()),
    )
    .take(2000)
    .flatten()
    .collect();

    b.iter(|| {
        let re = Regex::new(r"(?m)>\s+?<").unwrap();
        re.replace_all(black_box(&s), "><").to_string()
    })
}

#[bench]
fn maciej(b: &mut Bencher) {
    let re = Regex::new(r"(?m)>\s+?<").unwrap();
    let s: String = iter::repeat(
        iter::once('>')
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars())
            .chain(iter::once('<'))
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars()),
    )
    .take(2000)
    .flatten()
    .collect();

    b.iter(|| {
        let text: String = black_box(&s).replace('\n', "").replace('\r', "");
        re.replace_all(&text, "><").to_string()
    })
}

On my machine I get

test maciej               ... bench:   5,020,080 ns/iter (+/- 749,409)
test with_regex           ... bench:   3,456,975 ns/iter (+/- 288,334)
test with_regex_amortized ... bench:   3,306,730 ns/iter (+/- 91,617)

That is, your implementation takes about 5.0ms (±0.7ms) to clean my very silly approx. 2MB test string, and just using (?m)>\s+?< takes about 3.3ms (±0.1ms) (3.5ms (±0.3ms) if you include Regex creation time).

I suspect this is a case of not measuring what you think you are.

2 Likes

Hello CAD97,

thank you very much. You are right, I tested it in debug mode. The hint using multiline helped me a lot to remove the unneeded replacements.

Thank you very much, this function was new to me and helped to increase performance.