I do have to remove all new lines and all spaces between > < from a 2MB text file. My solution is naive and as following and takes about 400ms on my laptop
let RE_SPACES: Regex = Regex::new(r">\s+?<").unwrap();
// remove all new lines and CRs
let text : String = text.replace('\n', "").replace('\r', "");
// remove all whitespace
let text: String = RE_SPACES.replace_all(&text, "><").to_string();
I am seeking a better and faster solution without creating so many copies. Does an expert have a hint on how to make this faster?
First: your code doesn't even do what you claim you want it to do: it's replacing all\r and \n, even if they're not between ><.
Second: are you compiling with --release?
Third: You can handle newlines with the regex by putting it in multi-line mode.
With a simple benchmark:
Benchmark
#![feature(test)]
use regex; // 1.5.4
extern crate test;
use regex::Regex;
use std::iter;
use test::{black_box, Bencher};
#[bench]
fn with_regex_amortized(b: &mut Bencher) {
let re = Regex::new(r"(?m)>\s+?<").unwrap();
let s: String = iter::repeat(
iter::once('>')
.chain(iter::repeat(' ').take(500))
.chain("\r\n".chars())
.chain(iter::once('<'))
.chain(iter::repeat(' ').take(500))
.chain("\r\n".chars()),
)
.take(2000)
.flatten()
.collect();
b.iter(|| re.replace_all(black_box(&s), "><").to_string())
}
#[bench]
fn with_regex(b: &mut Bencher) {
let s: String = iter::repeat(
iter::once('>')
.chain(iter::repeat(' ').take(500))
.chain("\r\n".chars())
.chain(iter::once('<'))
.chain(iter::repeat(' ').take(500))
.chain("\r\n".chars()),
)
.take(2000)
.flatten()
.collect();
b.iter(|| {
let re = Regex::new(r"(?m)>\s+?<").unwrap();
re.replace_all(black_box(&s), "><").to_string()
})
}
#[bench]
fn maciej(b: &mut Bencher) {
let re = Regex::new(r"(?m)>\s+?<").unwrap();
let s: String = iter::repeat(
iter::once('>')
.chain(iter::repeat(' ').take(500))
.chain("\r\n".chars())
.chain(iter::once('<'))
.chain(iter::repeat(' ').take(500))
.chain("\r\n".chars()),
)
.take(2000)
.flatten()
.collect();
b.iter(|| {
let text: String = black_box(&s).replace('\n', "").replace('\r', "");
re.replace_all(&text, "><").to_string()
})
}
On my machine I get
test maciej ... bench: 5,020,080 ns/iter (+/- 749,409)
test with_regex ... bench: 3,456,975 ns/iter (+/- 288,334)
test with_regex_amortized ... bench: 3,306,730 ns/iter (+/- 91,617)
That is, your implementation takes about 5.0ms (±0.7ms) to clean my very silly approx. 2MB test string, and just using (?m)>\s+?< takes about 3.3ms (±0.1ms) (3.5ms (±0.3ms) if you include Regex creation time).
I suspect this is a case of not measuring what you think you are.