Fast fuzzy string matching

I am maintaining libraries for fast fuzzy string matching in Python and C++. This includes implementations of well known algorithms like the Levenshtein distance. In most cases these are bitparallel algorithms to allow matching multiple characters in parallel. Over the last couple of months I had a couple people reach out to me asking for a rust version of the library. Since I heard a lot about the benefits of rust over the last years, I used this as en excuse to finally give it a try :slight_smile:

Since I haven't touched the language before, some kind of review would certainly be helpful. I am especially interested in review of the public API to reduce breaking changes down the line, but am interested in any suggestions for improvements.

The code can be found here: GitHub - maxbachmann/rapidfuzz-rs: Rapid fuzzy string matching in Rust using various string metrics. I currently have a couple of clippy warnings around wraparounds / precision in casting disabled, since I do not see a particularly good way of solving them. Essentially I need to sometimes convert usize to isize/f64 and back. This would only be an issue when someone tries to match sequences with a length requiring around min(52 bits, usize bits - 1), which in practice shouldn't be a problem.

API description

I try to have a very similar API for most of the algorithms which looks like this:

pub fn distance<Iter1, Iter2, Elem1, Elem2, ScoreCutoff, ScoreHint>(
    s1: Iter1,
    s2: Iter2,
    score_cutoff: ScoreCutoff,
    score_hint: ScoreHint,
) -> Option<usize>
    Iter1: IntoIterator<Item = Elem1>,
    Iter1::IntoIter: DoubleEndedIterator + Clone,
    Iter2: IntoIterator<Item = Elem2>,
    Iter2::IntoIter: DoubleEndedIterator + Clone,
    Elem1: PartialEq<Elem2> + HashableChar + Copy,
    Elem2: PartialEq<Elem1> + HashableChar + Copy,
    ScoreCutoff: Into<Option<usize>>,
    ScoreHint: Into<Option<usize>>,

there are always multiple versions of this API to calculate a distance, similarity or normalized versions of the two. In addition there is a struct to allow performing One x Many comparisons:

pub struct BatchComparator<Elem1> {...}

impl<Elem1> BatchComparator<Elem1>
    Elem1: HashableChar + Clone,
    pub fn new<Iter1>(s1_: Iter1) -> Self
        Iter1: IntoIterator<Item = Elem1>,
        Iter1::IntoIter: Clone;

    pub fn distance<Iter2, Elem2, ScoreCutoff, ScoreHint>(
        s2: Iter2,
        score_cutoff: ScoreCutoff,
        score_hint: ScoreHint,
    ) -> Option<usize>
        Iter2: IntoIterator<Item = Elem2>,
        Iter2::IntoIter: DoubleEndedIterator + Clone,
        Elem1: PartialEq<Elem2> + HashableChar + Copy,
        Elem2: PartialEq<Elem1> + HashableChar + Copy,
        ScoreCutoff: Into<Option<usize>>,
        ScoreHint: Into<Option<usize>>;


Some functions take a couple of extra arguments for algorithm dependent settings like weights for the Levenshtein distance. The only algorithm that falls a bit out of the line here is the hamming distance which currently returns an Error<Option<usize>>, since it would fail when providing sequences of different length are passed.

Pain points in the API

There are a couple of things I am not super happy with in this API:

  • the function returns an Option, since it returns None when the result is worse that the cutoff value passed into it. This makes sense when a cutoff value was provided. However when the cutoff isn't used this just adds an extra unwrap/expect
  • hamming returning an Error<Option<usize>> is even more ugly to me, since the user might tell the function to pad the shorter string, so strings of different length are not an error in which case he would have to call unwrap/expect on the Error as well. One option would be to join both option and error into just an error, but this would still make it behave differently than anything else. Maybe still better than the current version though.
  • I am not really super happy with the combination of PartialEq and HashableChar. The algorithms have two requirements for any input:
    1. it has to be comparable

    2. I need to perform my own custom hash function on it to convert it to a value in the range i64::MIN - u64::MAX. This hash is basically just an identity function and solves two issues I had in rust:

      • char requires ch as u32 as u64 to convert to a u64
      • signed and unsigned types need special handling, since I can't just cast i64 to u64 without getting collisions in the internal hashmaps (In C++ this is handled using templated function overloads)

However having them stored in an enum I have to be super careful how they are used, so the optimizer still largely optimizes them out again. E.g. I did a quick patch at some point to get rid of the PartialEq requirement which would allow comparing things like Vec<u32> with &str, since I already define equality via the identity hash anyways. However this was significantly slower (didn't have the time to look into the reason yet and I might just have butchered something making it a lot slower). In the C++ implementation these are simply templates all the way through and this has never been an issue there. Not sure whether I am missing an obvious simpler way to achieve this though.

I think most users will never have the need to compare sequences that rust does not provide PartialEq, but one of my personal uses for this is the Python wrapper of the C++ implementation. In there I directly compare for example internal memory views of Python strings which could be u8/u16or u32 (even though in this specific case I would be the end user and could just live with implementing PartialEq for all these types)


I get largely the same performance as in the C++ implementation using LLVM when comparing byte strings. Utf8 strings which I expect from most users in rust are a lot slower to iterate though. I made a couple of obvious improvements to reduce the amount of iterations, since the C++ implementation was basically only used with random access iterators. Even with these improvements they still cause quite a large slowdown for short texts. For long texts the algorithms have to perform a lot more work and so it doesn't really matter. E.g. for the Levenshtein distance:

  • around 1.5-2x the runtime of Char vs Vec<char> for texts with <64 chars
  • around 1.05x the runtime of Char vs Vec<char> for texts with around 100k chars

Some of this is alleviated by people comparing many short texts likely being able to use the BatchComparator which can cache quite a bit of work and only shows around a 5% overhead even for texts with <64 chars


didn't look into the implementation yet, as the implementation usually can limit how the public API could be structured. but just some quick thoughts about the APIs purely based on your description:

this signature alone give me a sense of over engineering. for a fuzzy finder, personally I sould just take two slices as input. (for utf codepoints, I would do the iteration internally instead). but without knowing your implementation, I can't say for sure DoubleEndedIterator is best suited in this case.

I see there's multiple Option parameters, which indicates the implementation probably work in different "modes", and the return types are not identical in all "modes", and the None return value only makes sense in certain "mode", is my understanding correct?

I'm not familiar with the C++ version of your library, but I would guess this API probably mimics the C++ implementation, which probably provide those Option arguments with default values (or maybe dispatched through overloaded functions) so the caller doesn't have to specify all the None arguments all the time.

although Option types as function parameters is perfectly valid usage, in many cases, it is more idiomatic to provide multiple public APIs for different "modes", especially if there's multiple "mode" selector parameters. the the public entry point functions can pre-process the arguments for the implementation (and post process the return value
when appropriate). rust doesn't support function overloading or default argument values, because the type checker works differently (say, than the C++ type system). people might have subjective opinions on this, but it's a deliberate design decision of the language.

if some algorithm has many configurable knobs, and they have default values (i.e. they are optional), a common trick to reduce the public API surface area is to use the "builder pattern", with a Builder type encapsulating the arguments (providing good default for optional ones) which allows the caller to build the arguments incrementally. builder patterns are similar to partial function applications in functional languages.

the builder pattern can be combined with typestate pattern in case different configuration "modes" have different type signatures. for example, one way to use a builder type (let's call it Comparator) for your distance() function above can be:

let s1 = "kitten";
let s2 = "sitting";
// default configuration with no cutoff returns usize
assert_eq!(Comparator::default().distance(s1, s2), 3);
// with_score_cutoff(value) returns different typestate
assert_eq!(Comparator::default().with_score_cutoff(2).distance(s1, s2), None);
// there are different styles of builder patterns. one alternative syntax could be:
assert_eq!(distance().between(s1, s2), 3);
assert_eq!(distance().with_score_cutoff(2).between(s1, s2), None);
// not using builder pattern, but different functions for different modes
assert_eq!(distance(s1, s2), 3);
assert_eq!(distance_with_cutoff(s1, s2, 2), None);

I'm assuming by Error you mean Result. again, I'm guessing you probably want to provide different entry points for different "modes" or "configurations". reducing multiple functions into single function but with more "knob parameters" doesn't reduce the API complexity, but often increases the chance of misuse for the user.

without looking into your implementation, this description is very confusing. why the hash need to be in the range i64::MIN - u64::MAX? what kind of hashmap are you using?


The solution to these is either to provide two separate functions (with and without the cutoff, with and without the flag for padding the shorter string) or to make the API generic and use different types for signalling whether a cutoff and/or a padding is desirable, then make the function return type depend on that via an associated type. So signatures would be

  • fn levenshtein<NoCutoff>(...) -> usize
  • fn levenshtein<Cutoff>(...) -> Option<usize>

Also, in all honesty, this "cutoff" business seems to violate single responsibility big time. If you can return a score, why isn't it the caller's responsibility to compare it to whatever cutoff they want? This doesn't feel like it should be part of the API at all.

char requires ch as u32 as u64 to convert to a u64

I'm not sure why that is an issue that requires a custom trait. There is impl From<char> for u64.

It's not obvious from your brief description how these differ or how the Rust solution is inferior; one uses Rust's generics and trait system in the same situations when one would use templates in C++. (Save for some nasty C++-specific template hacks such as CRTP, which I do not think applies to primitives.)

I suspect your C++ version simply doesn't care about UTF-8 validity, and just assumes validity then compares raw bytes directly. In contrast, when you have a Rust string-like type and you ask for an iterator over its chars (=Unicode code points), then the iterator will decode the UTF-8 for you, so comparing strings by iterating through their chars does not just compare bytes. Hence, it's expected to be a lot slower than eg. memcmp().

If all you care about is bytes, you should probably operate on byte slices directly.

Also, more generally, you should probably set optimization flags such as target-cpu=native so that LLVM feels free to codegen in a way that is best for your machine; without this, very portable but usually non-vectorized code will be emitted.


The idea behind the IntoIter on a DoubleEndedIterator is that a user can use the API with any type of integer sequence:

let s1 = "kitten";
let s2 = "sitting";
distance(s1.chars(), s2.chars())
distance(s1.bytes(), s2.bytes())

In the specific case of chars vs bytes this gives the user the option to iterate over the bytes if he uses ASCII text while using chars for unicode texts which is slower but required in this case. In C++ the Python wrapper uses this with the C++ implementation to pass in a range to the underlying memory of Python strings which is u8, u16 or u16 (Python uses this to store strings differently depending on the highest char value used inside a string)

The cutoff score allows significant performance improvements in a lot of algorithms, since it allows exiting early if a score is higher than the cutoff score. E.g. with the Levenshtein distance it allows the Ukkonen optimization to calculate only parts of the Levenshtein table or early exits when the length differences are already too big. The hint basically does the same, but keeps on doubling if it was set too low. This can be helpful if you have a general idea of the similarity of texts and want the algorithm to be improved for this expectation (while possibly being worse if the score is higher than the hint).

The C++ implementation uses them as default arguments. However it always returns a score. It simply guarantees the score to be worse that the cutoff score if it is worse instead of calculating the correct score. The reason for this is that the Python wrapper stores result matrixes in numpy arrays and there is not really a fast + memory efficient way to store optionals in there.

Yes so far I am definitely not used to this yet. I absolutely wanted to avoid something like distance_with_score_cutoff. The builder pattern sounds like a reasonable solution though. I would certainly need to look into how this is implemented in rust, but it would probably be more user friendly, since:

  • the parameters are named
  • only necessary parameters need to be set
  • the return type can be adjusted as appropriate

Ah yes I absolutely did mean Result

Hm I must have missed this trait. I think char as u64 did not work though. Could just be me hallucinating though.

I have some kind of hybrid hashmap, that uses a lookup table for any key in the range 0-255, since extended ASCII is by far the most common. For characters outside of the ASCII range I use a simple perturbing hashmap with extremly limited features. Essentially they always have a u64 key, have a fixed size of 128 elements (they can never have more than 64 distinct elements in them). Now this has the issue that at least in C++ a user can compare a sequence if i64 with one of u64, so I use two of these perturbing hashmaps along with the lookup table -> Essentially I just need to find out if the value is:

  • in the range 0-255 -> lookup table
  • < 0 to to use one of the hashmaps
  • > 255 use the other hashmap

Currently this is achieved by adding a trait to all basic types to give them this sort of identity hash function which simply tells me whether they are signed/unsigned + either a i64 or u64 value. In C++ this just automatically works since all these types can be compared, but I couldn't figure out a simpler solution in rust.

I wouldn't necessarily say inferior and yes the C++ implementation certainly uses CRTP and SFINAE in a few places :shushing_face: The difference is that I would only pass the enum result of the "hash" function into some functions, while in C++ I just have them as templates. With the enum version the compiler might decide that it is not worth it to inline the function and then would always have the branches around the enum in my hot loops. With the template on the other hand it can always deduce e.g. that a lot of branches are unneeded when comparing two sequences of u8. In practice this has largely worked out, but requires me to look at the assembler a lot more. E.g. the missing guaranteed NRVO in rust already hit me hard with a memcpy it introduced when returning a struct from a new function that made some of the algorithms 40-50% slower.
One other thing that is quite a hack in my rust implementation is that I internally sometimes use structs with members that should only exist if certain conditions are met and set them in if constexpr branches in C++. In rust I currently store them in arrays of size 0 or 1 to make them go away. This seems to work except for possible changes, but is a bit of a pain.

I leave this decision to users. It's just that Python doesn't store strings as utf8 and in C++ most either have ASCII, store them as utf32 or screw it up ... In theory they could pass in an utf8 iterator as well, I just haven't seen this yet.
Expecting rust users to often pass chars (which is certainly the right option to take by default) I simply tried to optimize this case as well as possible and reduce the amount of iterations performed on each iterator.

I can't remember that I have ever been happy with the auto vectorization, but then again I probably only notice it when I am unhappy with it. I have a whole set of SIMD implementations for Many x Many text comparisions that I simply didn't port over yet. It certainly helps if it can use instructions like popcount though.

1 Like

I think when you want to restrict the Item type of an Iterator you don't need a generic for that Item. But maybe younalready know since you did it with the IntoIterator::IntoIter

I.e. you can do

fn foo<It>(a: It)
    It: IntoIterator,
    It::Item: Clone,
1 Like

An alternative to the Builder pattern is an Argument Struct. I don't know what the pattern is called but basically you provide all arguments through a struct that implements Default.

    a: arg1,
    b: arg2,
1 Like

I wrote this part of the API after looking at the API of a different crate in the beginning and never updated it. Good catch.
In a similar sense it should be pretty useless to specify both Iterator and DoubleEndedIterator as bounds. Just DoubleEndedIterator should be sufficient.

In your example you don't define Iterator and DoubleEndedIterator.

I do have the following in some of the internal functions:

Iter1: Iterator<Item = Elem1> + DoubleEndedIterator + Clone,

which really should just become

Iter1: DoubleEndedIterator + Clone,
1 Like

this needs some clarification. I don't understand what do you mean by:

In C++ this just automatically works since all these types can be compared

it can be compared doesn't mean it has the desired semantic. for instance, signed and unsigned equality behave differently for types smaller than int (char and short) and types at least int sized (long and long long):

depending on the expected behavior, there can be different solutions in rust.

I just look into the hashmap code, from what I can understand, because you want to compare strings with different character types, you need the super type (or disjunction) of the two character types as key, and the lookup is done by first inject the character type into the super type. for instance, (u8, u8) => u8; (u8, i8) => i16; (u8, char) => char, and the "hash" function you mentioned is this injection? is my understanding correct? if so, I don't you need the key type of:

unless you want to compare sequence of i64s with sequence of u64s, that is.

I see a comment in the code, are you referring to this one?

I don't think this should generate different code, since the new() function is most likely inlined and the backend should see essentially the same code. can you share more information how you measured?

well, the good news is, in rust it's as easy as (or as hard as) in C++ to use simd primitives manually. unfortunately the portable simd feature still has a long way to come.

on the other hand, rust compiler might favor different code patterns for auto vectorization than C++, so the result might get better as you get more fluent and idiomatic in rust. maybe.

popcount is available when you compile with proper flags. specifically, it's -C target-feature=+popcnt, but typically you don't select individual target features, but instead select specific target cpu model. for instance, popcnt feature is included in x86-64-v2 and beyond. unfortunately, rustc's default is super conservative. you can check which features are enabled for a specific target cpu:

Y:\>rem list supported target cpus with `rustc --print target-cpus`
Y:\>rustc --print cfg -C target-cpu=x86-64-v2

as suggested already, you can set the target cpu to native to get the most of the backend optimizer, but personally, I like to at least use x86-64-v2 even for generic release builds.


Honestly I am pretty sure there are at least some places where this might be an issue. I will certainly have to recheck whether my C++ implementation handles this appropriately everywhere.

I essentially want to be able to compare any two sequences of integer types. Albeit this would currently require the user to implement PartialEq between these types. So e.g. a sequence of u64 with a sequence of i64, or a sequence of char with a sequence of u8 ...
In C++ I do something like this:

    void insert_mask(size_t block, CharT key, uint64_t mask) noexcept
        assert(block < size());
        if (key < 0) {
            if (!m_map_signed) m_map_signed = new BitvectorHashmap[m_block_count];
            m_map_signed[block][static_cast<uint64_t>(key)] |= mask;
        } if (key >= 0 && key <= 255)
            m_extendedAscii[static_cast<uint8_t>(key)][block] |= mask;
        else {
            if (!m_map_unsigned) m_map_unsigned = new BitvectorHashmap[m_block_count];
            m_map_unsigned[block][static_cast<uint64_t>(key)] |= mask;

    void insert_mask(size_t block, char key, uint64_t mask) noexcept
        // this is a hack to make it possible to make access times faster for the common char string, since C++ doesn't define if it is signed or unsigned ...
        insert_mask(block, static_cast<uint8_t>(key), mask);

I couldn't really find a way to make these kinds of comparisons in rust since these types to do automatically case. So my idea was to extend them all to either i64 or u64 in this "hash" function and hope my compiler will inline this stuff and realize depending on the user provided iterator, which parts of the handling are actually required. If you have a any better idea how to achieve this I am very much open to it.

To a certain extent yes. Back when I had this issue I inlined the new functions of this style everywhere before looking deeper into this issue. I since figured out that it's only a problem for the PatternMatchVector, which zeroes the array + inserts the string as bitflags in the new function. If I only zero the array in a default function and then perform insertions separately it's no issue.

I simply looked at the assembler output to see how close I am to the assembler output of the C++ version. You can see the assembler here: Compiler Explorer. The second version produces an unnecessary memcpy of the struct.

In C++ I have my own abstraction over simd vectors of different sizes as well (currently only sse2 and avx2). In C++ portable SIMD in the standard lib is not there yet either and I didn't really need a lot of complex features, so I didn't want to add a huge external dependency for it.

Is it more conservative than C++ in this regard? I would have expected the default guarantee to be similar to the one in C++ that the binary by default is compiled to run on any targets supporting x64. For Rust / C++ (beyond benchmarking code) this is really an issue of the user. For Python where I ship binaries on PyPI I compile for x64 + AVX2 and then perform tag dispatching at runtime to select the appropriate implementation depending on the used CPU.

I did think about this a bit more.

  1. The first version wouldn't work, since these extra setting have different semantics + types depending on whether a distance, similarity or a normalized distance similarity is calculated (e.g. whether the cutoff score is between 0.0 and 1.0 or 0 and usize::MAX). So it has to be applied after deciding the function type. This would mean I would have to go with the second version if I go with the builder pattern.
  2. The third version would work for the cutoff and so would somewhat improve the return types. However for hamming this would already be a mess, since you would end up with something like distance_with_padding_with_cutoff
  3. The second version would work, but I am not super sure what would be the best way to make it work with the batched implementations. For them I need to make sure the cached component doesn't get copied. While still allowing the user to call e.g. both distance and normalized_distance. I guess this would either pass around references to the data in the initial struct, or pass around Arc<data>.

I created a quick patch rewriting one of the metrics to the builder + type state pattern in experiment with builder pattern by maxbachmann · Pull Request #3 · maxbachmann/rapidfuzz-rs · GitHub. The API with this is:

// calculate distance and returns just a distance
osa::distance().compare(s1.chars(), s2.chars())

// calculate distance and returns None if distance > cutoff else Some(distance)
osa::distance().with_score_cutoff(5).compare(s1.chars(), s2.chars())

// cached works similar
let scorer = osa::BatchComparator::new(s1.chars());
// calculate distance and returns just a distance

// calculate distance and returns None if distance > cutoff else Some(distance)

From my perspective:

+ return type can be adjusted which is a lot more user friendly
+ optional parameters are used using keywords, which is less error prone (e.g. the last two parameters score_cutoff and score_hint are both using the same type and could easily be swapped by mistake)
+ optional parameters do not need to be specified when the default is desired
- at least for me it looks like a lot more complex API. This could just be because I am coming from languages where the builder pattern is used relatively rarely.
+/- somewhat more code, but nothing complex since it's basically all boilerplate.

1 Like

The typestate and builder patterns are generally a pain to implement in every language. I can only second the recommendation that you should instead use a simple config/argument type with a sensible Default impl if possible, marking optional arguments as Option<T>. You will need a lot less boilerplate (it's basically just the list of default values in the constructor), and it will make the API simpler (AFAICT there's no need for typestate here, as the particular order in which you call the builder methods doesn't seem to be important).

Yes the order doesn't matter. Basically the only advantage over the struct approach would be that I can change the return type depending on the arguments the user sets. So e.g. something like the following works

hamming::distance().enable_padding().compare(...) -> returns usize
hamming::distance().enable_padding().with_score_cutoff(5).compare(...) -> returns Option<usize>
hamming::distance().compare(...) -> returns Result<usize>
hamming::distance().with_score_cutoff(5).compare(...) -> returns Result<Option<usize>>

which would save the user from unwrapping Options / Results that are guaranteed to be Some/Ok.
Now what I am not really convinced about is whether that is really better from a user perspective. I do not particularly care about the pain of implementing it in the library. It's largely copy and paste boilerplate that wouldn't be an issue if it makes the interface better for users.

In regards to the struct implementation I assume it would be something like the following and you would just live with the fact, that unwrapping is sometimes necessary

pub struct Args<T>
    score_cutoff: Option<T>,
    score_hint: Option<T>

pub fn distance<Iter1, Iter2>(
    s1: Iter1,
    s2: Iter2,
    args: Args<usize>
) -> Option<usize>
    Iter1: IntoIterator,
    Iter1::IntoIter: DoubleEndedIterator + Clone,
    Iter2: IntoIterator,
    Iter2::IntoIter: DoubleEndedIterator + Clone,
    Iter1::Item: PartialEq<Iter2::Item> + Copy,
    Iter2::Item: PartialEq<Iter1::Item> + Copy

fn main()
    let s1 = "test";
    let dist = distance(s1.chars(), s1.chars(), Args{
        score_hint: Some(1),
    }).expect("no score_cutoff used");

No, you could still make two (or more) separate argument types, or even make the argument struct generic over the fields affecting the return type.

1 Like

That sounds interesting. The generic version would probably make sense in so far, that otherwise I would need four versions of the struct (with/without score_cutoff and with/without padding). For C++20 I could write this as something like the following:

template<typename ScoreCutoff=std::nullopt_t>
struct Args {
    ScoreCutoff score_cutoff = std::nullopt;

template<typename ScoreCutoff>
requires (std::is_same_v<ScoreCutoff, std::nullopt_t> || std::integral<ScoreCutoff>)
auto distance(Args<ScoreCutoff> args)
    std::size_t score = 1;
    if constexpr (std::is_same_v<ScoreCutoff, std::nullopt_t>) {
        return score;
    } else {
        return std::optional<std::size_t>(score);

int main()
    std::cout << distance(Args{}) << std::endl;
    std::cout << *distance(Args{.score_cutoff=1}) << std::endl;

However I have absolutely no idea how this can be done in Rust :sweat_smile:

To make one type depend on another, you need to use an associated type on a trait. Playground.

1 Like

haha, if you think about it, the Args struct and Builderare essentially the same thing. you can make the function take the Builder as argument, or you can give the Args struct builder-style methods. you can support both at the same time.

    // specify arguments using struct literal syntax
    let dist = distance(s1.chars(), s1.chars(), Args{
        score_hint: Some(1),
    }).expect("no score_cutoff used");
    // using chained builder methods:
    let dist = distance(
    // or you can mix and match
    let dist = distance(
        Args {
            score_hint: Some(1),
    // or perhaps the difference is only in the name?
    let dist = Args {
        score_hint: Some(1),
    .distance(s1.chars(), s2.chars())
1 Like

Thanks. I suppose for hamming where there is anything in between usize and Result<Option<usize>> you would just chain these output types like I did here: Rust Playground