How to accidentally miss checking errors

So I was chatting with my AI friend about determining if two English words were anagrams of each other. Along the way it came up with the usual solutions of comparing the sorted letters of each word, counting the frequencies of letters in each word, removing matching letters until there is something or nothing left. Eventually I got it to use the fundamental theorem of arithmetic to do the job. All good idiomatic Rust solutions but dog slow compare to mine:

pub fn are_anagrams_crust(word_1: &str, word_2: &str) -> Result<bool, &'static str> {
    fn word_product(word: &str) -> Result<u128, &'static str> {
        const PRIMES: [u128; 26] = [
            2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83,
            89, 97, 101,
        ];
        let mut pw1 = &raw const *word as *const u8;
        let mut product: u128 = 1;
        for _ in 0..word.len() {
            let mut c = unsafe { *(pw1) };

            if c.is_ascii_lowercase() {
                c -= 32;
            }
            if c.is_ascii_uppercase() {
                let prime = PRIMES[(c - b'A') as usize];
                product = product.checked_mul(prime).ok_or("First word too long")?;
            } else {
                return Err("Non alphabetic character");
            }
            pw1 = pw1.wrapping_add(1);
        }
        Ok(product)
    }
    Ok(word_product(word_1) == word_product(word_2))
}

Which has a smdgen of unsafe and is written in a more like C fashion (Crust).

But can you spot the error above? It all compiles so it must work, right? Clippy does not complain, Miri does not complain. What could it be?

Well I missed a couple of ? in returning the final result. So we can end up comparing an error with an error and producing OK even if there is errors. Which Clippy thinks is checking the errors I guess.

Took me a while to spot that.

Anyway, I'd be interested if anyone can do this as fast or faster in safe and idiomatic Rust? Just now Cargo bench puts it well in the lead.

4 Likes

Nice, very creative! And indeed, the compiler can't detect everything.

Note that you don't really need unsafe:

pub fn are_anagrams_crust(word_1: &str, word_2: &str) -> Result<bool, &'static str> {
    fn word_product(word: &str) -> Result<u128, &'static str> {
        static PRIMES: [u128; 26] = [
            2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83,
            89, 97, 101,
        ];
        let mut product: u128 = 1;
        for c in word.as_bytes() {
            let c = c.to_ascii_uppercase();
            if c.is_ascii_uppercase() {
                let prime = PRIMES[(c - b'A') as usize];
                product = product.checked_mul(prime).ok_or("Word too long")?;
            } else {
                return Err("Non alphabetic character");
            }
        }
        Ok(product)
    }
    Ok(word_product(word_1)? == word_product(word_2)?)
}

(The compiler output will be exactly the same)

2 Likes

Well, when I had the idea to use primes to find anagrams years ago I thought I had an original idea. Silly me. On suggesting to my AI fried that its was possible they did quite a good job of it. And revealed that people have been doing this in different languages all over the net.

Given that the product of primes produces a kind of hash of a word that is guaranteed to be the same as a whole set of words (the anagrams) I have been wondering if this technique might have uses elsewhere. Where we need to quickly test if a thing is a member of some set or not.

Brilliant. Thanks.

Reminds me of perfect hash functions, but your set here is the frequency of letters.

I wouldn't bother with something clever like this normally of course, but it's always good to get a reminder of the tools we have if we do want to boost performance.

If you wanted to support longer words without overflowing, you could go with this approach: Rust Playground

The original approach calculates the actual product of primes (which overflows u128 at ~20 characters), while the modular approach calculates the product modulo several large primes instead. This keeps all intermediate results bounded and prevents overflow, allowing unlimited string length while preserving the same mathematical correctness.

3 Likes

If I understand my theory, you would only be collision free for the smallest prime factor of the modulo. The sum of prime factors isn't a prime!

You're absolutely right about sums, but we're using products of primes, not sums. This preserves the unique factorization property even under modular arithmetic since our moduli (≈10⁹) are much larger than any individual prime factor (≤101). The collision probability across 4 independent large prime moduli is approximately 10⁻³⁶, making false positives astronomically unlikely.

1 Like

Ah, I misread the original algorithm as being a sum of prime factors, eg sum(l in a..z, n_l * p_l) for letter count n_l and prime coefficient p_l. But of course, 2+3=5, I have no idea why I thought that would work (other than thinking of perfect hash functions, which are often something like that)

Any reason why you didn't pick one large 127 bit prime? They're pretty well known, eg 2**128-159.

1 Like

Multiple smaller moduli avoids edge cases near u128::MAX and makes debugging easier if there's ever a collision. A single 127-bit prime would work fine too - just a different engineering trade-off.

3 Likes

why don’t compare word len. first. ? then you wouldn’t need primes number ?

Length check is needed but not enough - "abc" and "def" same length, different letters. Prime product checks both length AND which letters appear, all in one calculation.

1 Like

Ok’ 3 multi 4= 6 multi 2.’

this actually is brilliant idea.

your idea is brilliant too. it’s hard because large primes
However ,dividing still can lead to mistakes().
for instance
I have only one prime.
2^17-1 is a prime. so a^17 % this prime = 1.
so (a^17% a^17-1)*char<=> char

Interesting. Can you make more topics like this?

Interestingly with [u128;4] you use ~19.7 bits per letter of the alphabet.

So a solution that just counts the letters in [u16;26] should be faster while guaranteeing no collision up to 65536 occurances of the same character.

While not tested this should be faster as it only needs increments and neither multiplication nor modular operations.

Also fun the 10^-36 is equivalent to 2^119 possible options a birthday attack is said to be possible with 50% probably at 2^(n/2) so 2^59 wich is about 26^12.5 so if we check all possible combinations up to 13 letters in length we should find a collision. Which only takes 80 million years at 1000 words per second to find one (1) collision.

1 Like

with counting letters you don’t need product with prime. and it only waste you u16*26 bits to count up letters in word1 and count down letters in words2.
and return true if all array is 0. O(word1.len()*word2.len())

but both function should firstly check length because len only waste O(1)

True. But it's also the slowest solution I have here. Forget all that Big O stuff, reality bytes. Here is a summary of my benchmark results for different approaches:

sort_anagrams               300 ns
freq_count_anagrams         515 ns
removal_anagrams            319 ns
prime_product_anagrams      126
prime_product_u8_anagrams    73
prime_product_crust          45 ns
prime_product_redglyph       45 ns
prime_product_keyosk        302 ns

Hope those benchmark names suggest the algorithm used and who suggested them. The first 4 work in Rust characters whereas the last 4 work on ASCII bytes.

The prime number trick wins by a wide margin even when using char. Working with ASICII bytes is a real boost over that.

Of course this optimisation only works because the problem is only concerned with the English language so we can use ASCII and the word length is limited. Which stops the products getting too big.

STOP PRESS:

Well, works for the Guiness Book of Records longest anagram:
representationism = misrepresentation,
but fails if we allow for the longest scientific word:
hydroxydeoxycorticosterones = hydroxydesoxycorticosterone

Last time I checked it worked for all the words in Debian's biggest English dictionary file.

You are right about checking the lengths first of course. I didn't bother yet as then I'm not measuring anything.

1 Like

Turns out not. See above. I was surprised that such letter frequency counting is even slower than the prime product trick.

1 Like

The difference in time with the counting version looks suspicious. Can you post your implementation?

2 Likes

Regarding performances: it's very different when you test whether a collection of words contains an anagram of a given word. In that case, a simple bin algorithm would beat the prime (unless it's transformed somewhat, maybe).

I did a quick test from pseudo-random generated (identical) lists, and both find an anagram of a 8-letter word after checking 73,870,264 candidates:

  • bins took 0.485 s
  • prime took 57.305 s

(compiled in release)

It comes from the bins being able to reject any word as soon as a letter isn't in the searched word.

I'm not sure it's easy to do a similar test in the prime algorithm; you'd need to perform a number of modulos to check if a prime is already included, at a high cost. Although you could store the constants to transform the division (modulo) by the primes into multiplications: perhaps worth trying? I haven't checked whether that'd work with the modular multiplications used for lengths > 19, but perhaps it accelerates the original prime algorithm in that context.

PS: Here's a draft of what I mean with the bins method: playground — I've commented out the constants for the "long" test, but don't try to run it from there! I haven't really tried to optimize it (the bin version is in find_anagram).

2 Likes