I am maintaining libraries for fast fuzzy string matching in Python and C++. This includes implementations of well known algorithms like the Levenshtein distance. In most cases these are bitparallel algorithms to allow matching multiple characters in parallel. Over the last couple of months I had a couple people reach out to me asking for a rust version of the library. Since I heard a lot about the benefits of rust over the last years, I used this as en excuse to finally give it a try ![]()
Since I haven't touched the language before, some kind of review would certainly be helpful. I am especially interested in review of the public API to reduce breaking changes down the line, but am interested in any suggestions for improvements.
The code can be found here: GitHub - rapidfuzz/rapidfuzz-rs: Rapid fuzzy string matching in Rust using various string metrics. I currently have a couple of clippy warnings around wraparounds / precision in casting disabled, since I do not see a particularly good way of solving them. Essentially I need to sometimes convert usize to isize/f64 and back. This would only be an issue when someone tries to match sequences with a length requiring around min(52 bits, usize bits - 1), which in practice shouldn't be a problem.
API description
I try to have a very similar API for most of the algorithms which looks like this:
pub fn distance<Iter1, Iter2, Elem1, Elem2, ScoreCutoff, ScoreHint>(
s1: Iter1,
s2: Iter2,
score_cutoff: ScoreCutoff,
score_hint: ScoreHint,
) -> Option<usize>
where
Iter1: IntoIterator<Item = Elem1>,
Iter1::IntoIter: DoubleEndedIterator + Clone,
Iter2: IntoIterator<Item = Elem2>,
Iter2::IntoIter: DoubleEndedIterator + Clone,
Elem1: PartialEq<Elem2> + HashableChar + Copy,
Elem2: PartialEq<Elem1> + HashableChar + Copy,
ScoreCutoff: Into<Option<usize>>,
ScoreHint: Into<Option<usize>>,
there are always multiple versions of this API to calculate a distance, similarity or normalized versions of the two. In addition there is a struct to allow performing One x Many comparisons:
pub struct BatchComparator<Elem1> {...}
impl<Elem1> BatchComparator<Elem1>
where
Elem1: HashableChar + Clone,
{
pub fn new<Iter1>(s1_: Iter1) -> Self
where
Iter1: IntoIterator<Item = Elem1>,
Iter1::IntoIter: Clone;
pub fn distance<Iter2, Elem2, ScoreCutoff, ScoreHint>(
&self,
s2: Iter2,
score_cutoff: ScoreCutoff,
score_hint: ScoreHint,
) -> Option<usize>
where
Iter2: IntoIterator<Item = Elem2>,
Iter2::IntoIter: DoubleEndedIterator + Clone,
Elem1: PartialEq<Elem2> + HashableChar + Copy,
Elem2: PartialEq<Elem1> + HashableChar + Copy,
ScoreCutoff: Into<Option<usize>>,
ScoreHint: Into<Option<usize>>;
...
}
Some functions take a couple of extra arguments for algorithm dependent settings like weights for the Levenshtein distance. The only algorithm that falls a bit out of the line here is the hamming distance which currently returns an Error<Option<usize>>, since it would fail when providing sequences of different length are passed.
Pain points in the API
There are a couple of things I am not super happy with in this API:
- the function returns an
Option, since it returns None when the result is worse that the cutoff value passed into it. This makes sense when a cutoff value was provided. However when the cutoff isn't used this just adds an extraunwrap/expect - hamming returning an
Error<Option<usize>>is even more ugly to me, since the user might tell the function to pad the shorter string, so strings of different length are not an error in which case he would have to callunwrap/expecton the Error as well. One option would be to join both option and error into just an error, but this would still make it behave differently than anything else. Maybe still better than the current version though. - I am not really super happy with the combination of PartialEq and HashableChar. The algorithms have two requirements for any input:
-
it has to be comparable
-
I need to perform my own custom hash function on it to convert it to a value in the range
i64::MIN - u64::MAX. This hash is basically just an identity function and solves two issues I had in rust:- char requires
ch as u32 as u64to convert to a u64 - signed and unsigned types need special handling, since I can't just cast i64 to u64 without getting collisions in the internal hashmaps (In C++ this is handled using templated function overloads)
- char requires
-
However having them stored in an enum I have to be super careful how they are used, so the optimizer still largely optimizes them out again. E.g. I did a quick patch at some point to get rid of the PartialEq requirement which would allow comparing things like Vec<u32> with &str, since I already define equality via the identity hash anyways. However this was significantly slower (didn't have the time to look into the reason yet and I might just have butchered something making it a lot slower). In the C++ implementation these are simply templates all the way through and this has never been an issue there. Not sure whether I am missing an obvious simpler way to achieve this though.
I think most users will never have the need to compare sequences that rust does not provide PartialEq, but one of my personal uses for this is the Python wrapper of the C++ implementation. In there I directly compare for example internal memory views of Python strings which could be u8/u16or u32 (even though in this specific case I would be the end user and could just live with implementing PartialEq for all these types)
Performance
I get largely the same performance as in the C++ implementation using LLVM when comparing byte strings. Utf8 strings which I expect from most users in rust are a lot slower to iterate though. I made a couple of obvious improvements to reduce the amount of iterations, since the C++ implementation was basically only used with random access iterators. Even with these improvements they still cause quite a large slowdown for short texts. For long texts the algorithms have to perform a lot more work and so it doesn't really matter. E.g. for the Levenshtein distance:
- around 1.5-2x the runtime of
CharvsVec<char>for texts with <64 chars - around 1.05x the runtime of
CharvsVec<char>for texts with around 100k chars
Some of this is alleviated by people comparing many short texts likely being able to use the BatchComparator which can cache quite a bit of work and only shows around a 5% overhead even for texts with <64 chars