How to faster remove line breaks and spaces

Hello all,

I do have to remove all new lines and all spaces between > < from a 2MB text file. My solution is naive and as following and takes about 400ms on my laptop

let RE_SPACES: Regex = Regex::new(r">\s+?<").unwrap();

// remove all new lines and CRs
let text : String = text.replace('\n', "").replace('\r', "");

// remove all whitespace 
let text: String = RE_SPACES.replace_all(&text, "><").to_string();

I am seeking a better and faster solution without creating so many copies. Does an expert have a hint on how to make this faster?

Thanks in advance,

Maciej

This should be close to fastest you can do, assuming Unicode support:

let mut s = std::fs::read_to_string(path)?;
s.retain(|c| !c.is_whitespace());

or if you only mean ASCII:

let mut s = std::fs::read(path)?;
s.retain(|c| !c.is_ascii_whitespace());

First: your code doesn't even do what you claim you want it to do: it's replacing all \r and \n, even if they're not between ><.

Second: are you compiling with --release?

Third: You can handle newlines with the regex by putting it in multi-line mode.

With a simple benchmark:

Benchmark
#![feature(test)]

use regex; // 1.5.4
extern crate test;

use regex::Regex;
use std::iter;
use test::{black_box, Bencher};

#[bench]
fn with_regex_amortized(b: &mut Bencher) {
    let re = Regex::new(r"(?m)>\s+?<").unwrap();
    let s: String = iter::repeat(
        iter::once('>')
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars())
            .chain(iter::once('<'))
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars()),
    )
    .take(2000)
    .flatten()
    .collect();

    b.iter(|| re.replace_all(black_box(&s), "><").to_string())
}

#[bench]
fn with_regex(b: &mut Bencher) {
    let s: String = iter::repeat(
        iter::once('>')
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars())
            .chain(iter::once('<'))
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars()),
    )
    .take(2000)
    .flatten()
    .collect();

    b.iter(|| {
        let re = Regex::new(r"(?m)>\s+?<").unwrap();
        re.replace_all(black_box(&s), "><").to_string()
    })
}

#[bench]
fn maciej(b: &mut Bencher) {
    let re = Regex::new(r"(?m)>\s+?<").unwrap();
    let s: String = iter::repeat(
        iter::once('>')
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars())
            .chain(iter::once('<'))
            .chain(iter::repeat(' ').take(500))
            .chain("\r\n".chars()),
    )
    .take(2000)
    .flatten()
    .collect();

    b.iter(|| {
        let text: String = black_box(&s).replace('\n', "").replace('\r', "");
        re.replace_all(&text, "><").to_string()
    })
}

On my machine I get

test maciej               ... bench:   5,020,080 ns/iter (+/- 749,409)
test with_regex           ... bench:   3,456,975 ns/iter (+/- 288,334)
test with_regex_amortized ... bench:   3,306,730 ns/iter (+/- 91,617)

That is, your implementation takes about 5.0ms (±0.7ms) to clean my very silly approx. 2MB test string, and just using (?m)>\s+?< takes about 3.3ms (±0.1ms) (3.5ms (±0.3ms) if you include Regex creation time).

I suspect this is a case of not measuring what you think you are.

2 Likes

Hello CAD97,

thank you very much. You are right, I tested it in debug mode. The hint using multiline helped me a lot to remove the unneeded replacements.

Thank you very much, this function was new to me and helped to increase performance.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.