Questions about performance in regards to my generic regular expressions learning project

I have made a crate for regular expression finding and replacing as a learning project for my first week with Rust.

Now I can compare it against the available regex crate. I do such by taking the benchmark in https://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexredux&lang=rust&id=1 as reference, using an input of 5000000 elements.

With regex it takes 3.51 seconds, and with my own one it takes 22.84 seconds.

Some notes of the two crates

  • Mine works over generic types. It has a few particular methods for some optimizations for char and u8.
  • regex uses SIMD instructions.
  • Both use an algorithm with lineal complexity on the input size.

Some questions.

  • Does the use of generics reduce the performance?
  • Is the difference in performance explained just by the use of SIMD instructions by the regex crate?
  • How is it the current state of SIMD instructions on Rust?
  • Does my crate have any interest for anyone else? I have just made it to learn about Rust's intricacies, but it could have some purpose.

Also, I want to thank this community for attending all questions I had during the development of this crate.

Creating a regular expression engine for fun is an awesome thing to do. I'd encourage anyone with interest to do it. Rust's regex crate started out as an insatiable itch of mine, and has grown quite a bit since then. :slight_smile: I'll do my best to answer your questions, but the lack of details provided on your part basically makes it impossible to give you anything concrete.

I think this question is too broad to answer, and in particular, it's impossible to give even a reasonably accurate answer if we can't see your code.

If your code uses parametric polymorphism to be generic, then that, in and of itself, won't cause performance to suffer. However, if you're using a generic algorithm that isn't specialized to the specific properties of the thing you're searching, then you will miss out on potential optimizations. For example, if you're searching a &[u8] or a &str, then you can apply memchr, which typically uses vector instructions. But if you limit yourself to searching an Iterator<Item=char>, then you can't apply memchr to that directly, as one example.

Since you didn't share any details of your performance comparison, this question is impossible to answer. The regex crate uses two different types of SIMD optimizations today:

  1. In some cases, it applies memchr on a byte that the regex crate knows must appear in a match. This tends to be vectorized, but ultimately depends on how your platform's libc is implemented and whether the memchr crate uses it or not. Again, since you didn't provide any details about your comparison, it's impossible to even know which implementation your comparison uses.
  2. In some other cases, specifically when there is a small number of small literals where at least one must match, then it's possible for the Teddy algorithm to be applied, which also uses a vectorized algorithm. But this only works on nightly Rust and when you provide the requisite compile time flags to the regex crate. Once again, since you didn't share the details of your comparison, it's impossible to say whether this optimization is being applied in your case. One thing worth noting is that since the benchmark game exclusively uses stable Rust (justifiably so), it is impossible for this particular optimization to have any effect.

What the regex crate does today is just barely scratching the surface of what SIMD can bring to regex engines.

With all of that said, focusing on SIMD here is a classic example of missing the forest for the trees. For these optimizations to even apply in the first place, your regex library needs to be capable of literal extraction. And even then, it's still missing details about the regex engines themselves. For example, Rust's regex engine actually has several different engines, where the split is mostly motivated by performance. There are more details here and here.

It requires nightly Rust at the moment. There is an RFC recently submitted that is the result of more than a year of work and appears close to merging. The code powering this is in the stdsimd repo and is on its way to getting into std as an unstable feature.

Previously, most SIMD code in Rust (including the regex crate, currently) used a combination of the simd crate (which only works on nightly Rust) and Rust's unstable platform intrinsics feature.

I don't see how anyone could answer this without being able to see it. In practice, building a production grade regex engine is a lot of work, not just in optimizations, but providing all the features that most come to expect from regex engines. I think there might be some demand for a regex crate that is faster to compile, even if it means runtime performance drops. This is something that's on my mind to do, but the regex crate internals must first be refactored (which I've already been working on for a year).

Note that the regex-redux benchmark only covers a very small slice of what a regex engine can do. To get a clearer picture (but still, perhaps, not completely accurate), you might hook into the regex crate's benchmark harness, which already handles PCRE1, PCRE2, D's regex library, RE2, Tcl and Oniguruma. You can see an example of how the benchmarks are defined here: https://github.com/rust-lang/regex/blob/9ee9943ec81c560978a51159d08ef670bb6d5d37/bench/src/sherlock.rs

4 Likes

Great to have here the developer of the regex crate. And yes, it has been a lot of fun to create it.

I did not want to overwhelm people with details in the first post.

The source to my engine is an Iterator, even when Item is char or u8, so I am not exploiting the vectorization of memchr.

I am using stable Rust, so the use of SIMD is not a factor.

You are right, I have been missing the forest. My problem must be on using an iterator as source, which makes impossible to apply vectorization.

The friend of mine that introduced me to Rust showed me the regex-redux benchmark (and the others of that site). Indeed, that gave me the idea of making the engine. I already know that it demonstrates a very small piece of features. They do not even require parenthesis (which I include, by the way). Also, it has been just one week; it cannot have everything.

I am not sure what other details could be of interest. The generic work flow is as follows. First, a regular expression is parsed into a tree of enums. Then the expression tree can be compiled into a state machine. Finally, one can give an iterator (two for replacing) to the state machine for finding/replacing instances of the regular expression.

I do not see this crate as a production grade regex engine, nor as a competitor of the regex crate. I have a very small set of features implemented. It can compile in about 9 seconds. Perhaps working over generic types has some interest? Or maybe I should just leave it alone and spent my time into something more constructive?

1 Like

I have made a few optimizations using your ideas, namely:

  • Allow to receive a '&Vec<T>'
  • Optimize the search of the first elements possible on the regular expression

Again, he regex-redux benchmark with the regex crate

        Command being timed: "cargo run --release"
        User time (seconds): 3.51
        System time (seconds): 0.08
        Percent of CPU this job got: 163%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.19
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 206740
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 58447
        Voluntary context switches: 137
        Involuntary context switches: 238
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

And using mine

        Command being timed: "cargo run --release"
        User time (seconds): 18.02
        System time (seconds): 0.10
        Percent of CPU this job got: 243%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.43
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 251444
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 69180
        Voluntary context switches: 78
        Involuntary context switches: 2134
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

I am not yet using memchr. It seems hard to apply for generic types. Is it possible to execute some code depending on if a type is char/u8?, something as

	fn next(&mut self) -> Option<T>
	{
		...
		if T==u8
		{
			fast_forward_with_memchr(...);
		}
		else
		{
			generic_fast_forward(...);
		}
	}

Yes, with traits. See the book: Traits: Defining Shared Behavior - The Rust Programming Language

I must be missing something. Looking at a sea of RFCs I found that the following works

#![feature(specialization)]
trait Searchable
{
	fn talk();
}

default impl<T> Searchable for T
{
	fn talk()
	{
		println!("Default talk");
	}
}
impl Searchable for u8
{
	fn talk()
	{
		println!("u8 talk");
	}
}

fn main()
{
	char::talk();
	u8::talk();
}

Is this what you meant? I do not see it in the book. Also the compiler says that "specialization is unstable", so I am in doubt about using it.

Uh, I don't see why you're trying to use specialization. Try impl Searchable for char instead. But, I don't really know how to guide you because you still haven't shared any code.

Very well, here is the code of the finder iterator that uses vectors as input. I hope it is clear enough.

impl<'a,T:'a+Clone,S:StateMachine<Item=T>+'a> Iterator for FinderVecIt<'a,T,S>
{
	type Item=Vec<T>;
	fn next(&mut self) -> Option<Vec<T>>
	{
		//println!(">>FinderVecIt::next");
		loop
		{
			if self.source_index==self.source.len()
			{
				return None;
			}
			if self.state.index==0
			{
				//Fast forward
				while !self.machine.zero_state_may_be_first(self.source[self.source_index].clone())
				{
					self.source_index+=1;
					if self.source_index==self.source.len()
					{
						return None;
					}
				}
				self.source_flush=self.source_index;
			}
			//let elem=&self.source[self.source_index];
			//let (news,action)=self.machine.trans(self.state.index,elem.clone());
			let (news,action)=self.machine.trans(self.state.index,self.source[self.source_index].clone());
			self.source_index+=1;
			self.state.index=news;
			//println!("action={:?}",action);
			match action
			{
				Action::Keep(v) =>
				{
					let nkept=self.state.kept.len();
					let added_cand=v.len()>nkept;
					let amount=self.state.kept[0];
					let mut i=0;
					for j in 0..nkept
					{
						if v[j]
						{
							self.state.kept[i]=self.state.kept[j]+1;
							i+=1;
						}
					}
					self.state.kept.resize(i,0);
					let new_amount=self.state.kept[0];
					if added_cand
					{
						self.state.kept.push(if v[v.len()-1] {1} else {0});
					}
					if new_amount<amount+1
					{
						//println!("leaking some data {}<{}+1",new_amount,amount);
						let flushing=amount-new_amount+1;
						self.source_flush+=flushing;
					}
				},
				Action::Flush =>
				{
					//println!("::ReplacerIt::next begin flush");
					let flushing=self.state.kept[0]+1;
					self.state.kept.clear();
					self.state.kept.push(0);
					self.source_flush+=flushing;
				},
				Action::Match(cand) =>
				{
					//self.state.store.clear();
					let amount=self.state.kept[cand]+1;
					let flushing=self.state.kept[0]-self.state.kept[cand];
					self.source_flush+=flushing;
					self.state.kept.clear();
					self.state.kept.push(0);
					//return (0..amount).map(|_|self.source_flush.next()).collect();
					let r=self.source[self.source_flush..self.source_flush+amount].to_vec();
					self.source_flush+=amount;
					return Some(r);
				},
			};
		}
	}
}

When self.state.index==0 we are before matching any element of the regular expression, so we may skip elements of the input until we find one which can be the first one on the regular expression. The question is, to use a special method for u8 (calling memchr). If there were a boolean expression "T is u8" I could write this.

			if self.state.index==0
			{
				//Fast forward
				if T is u8
				{
					let x=use_memchr_to_find_next();
					self.source_index+=x
				}
				else
				{
					while !self.machine.zero_state_may_be_first(self.source[self.source_index].clone())
					{
						self.source_index+=1;
						if self.source_index==self.source.len()
						{
							return None;
						}
					}
				}
				self.source_flush=self.source_index;
			}

With specializations is clearly possible. I could define the default method to return false and to u8 to return true. Then it would be simply "T::is_u8()".

You are starting with answer ("I want a boolean expression in code") and working your way backwards. Instead, start with your problem ("I want to execute different code based on the type") and work your way forwards.

For example:

fn do_something<T: Searchable>(t: T) {
    t.dosomething();
}

trait Searchable {
    fn dosomething(&self);
}

impl Searchable for u8 {
    fn dosomething(&self) {
        println!("u8: {:?}", self);
    }
}

impl Searchable for char {
    fn dosomething(&self) {
        println!("char: {:?}", self);
    }
}

fn main() {
    do_something(5);
    do_something('a');
}

Link to playground: Rust Playground

Since you are likely searching a &[u8], then you probably need to put your impls on &[u8] and maybe &[char] (or an iterator of char) instead.

Again, you didn't actually share all of your code. I can't compile it. I can't run it. I certainly don't know all of the types in your code, so I will repeat: it is extraordinarily hard to help someone when they don't provide the code.

1 Like

The only bad part of this is that anyone who would want to use my crate in a particular type of theirs would need to implement first the trait on their type. I suppose is not much of a problem, anyway.

Yes, I understand that, but the code is relatively large (1800 lines, but many commented out). Perhaps it would be easier if I just upload my crate. I did not want to upload it until knowing if it is useful, to not pollute the repository of crates.

You don't need to use crates.io to share code. Any git hosting service, or even a tar archive, will do.

I decided to upload the crate. Here https://crates.io/crates/generic_regex

I thought that it would create also a github repository, but it does not seem to be the case. I have null experience on uploading packages.

Of course, I appreciate any comments about it.

The only thing crates.io uses github for is logins; it won't do anything else for you.