Problems implementing jaro

My problem is less about Rust specifically, and more about problems implementing a specific algorithm in Rust.

I'm attempting to implement the jaro and jaro-winkler similarity algorithms in Rust, as a learning exercise. I've attempted to implement based on the wikipedia article and other resources I've found on the web. The jaro-winkler paper was more dense than I could stomach, but I took the test cases from there, to augment my own testing.

The code is available in my jaro-winkler github repo

It turns out my implementation is not passing those tests. It seems clear that I've misunderstood something, but without knowing what the s1, s2, m, t are supposed to be in the given instances...I've just gotten stuck trying to determine where I'm going wrong.

Some example output from my failing test:

[src/jaro.rs:80] s1 = 11.0
[src/jaro.rs:80] s2 = 11.0
[src/jaro.rs:80] m = 11.0  
[src/jaro.rs:80] t = 0.0                      
[src/jaro.rs:81] ((m / s1) + (m / s2) + ((m - t) / m)) / 3.0 = 1.0
[src/jaro.rs:101] &p = 0.1
[src/jaro.rs:102] &l = 4.0 
[src/jaro.rs:103] sim_j + (p * l * (1.0 - sim_j)) = 1.0
[src/jaro.rs:227] s1 = "SHACKLEFORD"
[src/jaro.rs:227] s2 = "SHACKELFORD"
[src/jaro.rs:228] &score_round = 1.0
[src/jaro.rs:228] score = 0.982    

I think what is tripping me up is that a lack of clarity regarding transpositions vs matches within the allowed range, maybe. In this example, I've clearly not caught the E/L transposition somehow.

Anyway, this isn't anything particularly important, I've just gotten really frustrated that what seemed like it was going to be a simple exercise has ended up turning into a brick wall :-/

If you see rust specific problems with my code, I'm interested in learning about that as well, of course!

First, I haven't considered your code, just attempted the algorithm. If you have any specific questions, let me know. I based it off of this C code, which is the original implementation according to this other article. It seems to do other comparisons for similarity as well, which I just ignored.

I'll assume you understand the "search radius" for what constitutes a match. Furthermore,

  • The algorithm further tracks which character positions matched.
  • Each position in either string can only be part of one match.
  • For transpositions, the independently ordered matches from both strings are compared:
    • If the are not the same, this is one half of a transposition.
    • The total number of transpositions is thus half the number of non-matches.

Here it is on the playground. Edit: See next post for corrected link.

I commented out the 0.000 test cases as they don't seem sensible.

I took a look at your code. I think it would be hard to match up the transposition logic within a single pass. For example, consider the pair ("abcdeghiJkl", "abcdefJghilk"). The first time you'll detect a transposition is at position 10 in the first string (J in the first string), but by the algorithm's definition, there are 3 transpositions due the ghi matches all being shifted out of place. So I don't think there's a way to avoid tracking match positions -- one match can "cause" multiple transpositions depending on the number of other matches between the two positions in the strings.

Another note, your max_trans calculation needs to check for underflow.

When trying to suss out an in-loop approach to transposition counting, I discovered a bug in my original version, so here's an updated link. (The original didn't adjust total search span size when the positions were close to the beginning of the string.)

Yeah. That makes sense. I'll abandon my attempt to do a single pass.

I'll fix max_trans as well.

Thanks!