How to optmize this code?

I have a function that takes in a u128, converts it to a Vec<u32>, sorts the vector, takes all of the zeros at the front of the vec and moves them to the back (ie, 0000112335589 -> 1123355890000), then turns it back into a String.

Code:

 fn sort(num: u128) -> u128 {
    let num_str = num.to_string();
    let mut digits: Vec<u32> = num_str.chars().map(|d| d.to_digit(10).unwrap()).collect();
    digits.sort();

    // count the zeros
    let zeros = digits.iter().filter(|&n| *n == 0).count();

    // remove the zeros
    while let Some(0) = digits.first() {
        digits.remove(0);
    }

    // add the zeros back at the end
    for _ in 0..zeros {
        digits.insert(digits.len(),0);
    }
    
    let mut buffer = String::from("");

    // build the number into a string so i can use `parse::<u128>()`
    for i in digits {
        buffer.push_str(&i.to_string());
    }

    // if there is a better way to do this please tell me
    buffer.parse::<u128>().unwrap()
}

The issue that I am having is that cargo bench says the the function takes ~300 ms/iter (even though at runtime, it does not feel that slow) and I need it to be much faster then that. What can I do to make the code faster?

There's no chance this takes a third of a full second. You are measuring something wrong.

A quick repro shows that even in a debug build, the runtime is in the single-digit microseconds.


With that said, there is a lot of room for improvement. In general, you should not use strings for solving numerical problems. Things like push_str(&i.to_string()); just scream "unnecessary work".

A number can be decomposed into its digits by repeated division, and re-composed by repeated multiplication and addition. Furthermore, your .filter().count() is just .partition_point() (a binary search, possible because 0s are always at the beginning), and the subsequent loop is just .rotate_left(). Furthermore, there's no need for dynamic allocation, because 2^128 < 10^36, so all numbers will be at most 36 digits, so ArrayVec it is.

This is down to ~65 ns from ~500, an order of magnitude faster.

2 Likes

Why do you want to optimize this? 300ms is a lot. Are you sure about this number?

If you're sorting the digits, that means you didn't care about the initial order, which means you can just store the multiplicity of the digits -- aka do a Counting sort - Wikipedia.

Something like this:

fn sort_alt(mut num: u128) -> u128 {
    let mut digits = [0_u8; 10];
    
    while num > 0 {
        let r;
        (num, r) = (num / 10, num % 10);
        digits[r as usize] += 1;
    }
    
    for i in (1..=9).chain([0]) {
        for _ in 0..digits[i] {
            num *= 10;
            num += i as u128;
        }
    }
    
    num
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=e09b83aa27661f39c7ea1fd8293c8521

No dynamic allocation and no sorting, so should be much faster.

4 Likes

I made a quick cargo bench:


#[cfg(test)]
mod tests {
    use test::{Bencher};
    
    fn sort(num: u128) -> u128 {
        let num_str = num.to_string();
        let mut digits: Vec<u32> = num_str.chars().map(|d| d.to_digit(10).unwrap()).collect();
        digits.sort();
    
        // count the zeros
        let zeros = digits.iter().filter(|&n| *n == 0).count();
    
        // remove the zeros
        while let Some(0) = digits.first() {
            digits.remove(0);
        }
    
        // add the zeros back at the end
        for _ in 0..zeros {
            digits.insert(digits.len(),0);
        }
        
        let mut buffer = String::from("");
    
        // build the number into a string so i can use `parse::<u128>()`
        for i in digits {
            buffer.push_str(&i.to_string());
        }
    
        // if there is a better way to do this please tell me
        buffer.parse::<u128>().unwrap()
    }

    #[bench]
    fn bench_pow(b: &mut Bencher) {
        b.iter(|| {
            // Inner closure, the actual test
            for i in 1..100000000 {
                sort(i);
            }
        });
    }
}

The result of the benchmark was something like 300,000,000 ns/iter, which I thought when converted to ms, was .3 seconds. Maybe I converted incorrectly.

You are benching the whole loop, that is sort repeated 99999999 times.

iter() automatically iterates as much as needed. You are measuring 100 million executions. Just drop the inner loop, it's completely unnecessary.

(By the way, how could the benchmark harness possibly know how many iterations you are doing in a black-box closure?)

2 Likes

See also.

Treating the zero-case extra got me another 10% speedup.

pub fn sort(mut num: u128) -> u128 {
    let mut digits = [0_u8; 10];
    while num > 0 {
        let r;
        (num, r) = (num / 10, num % 10);
        digits[r as usize] += 1;
    }
    for i in 1..=9 {
        for _ in 0..digits[i] {
            num *= 10;
            num += i as u128;
        }
    }
    num *= 10u128.pow(digits[0].into()); // special case "0"
    num
}

Using itoa probably helps as well because it does substantially fewer divisions. (4 divisions per 4 digits, compared to 2 divisions per digit in naive algorithm)

1 Like

Reminder that /+% is actually only one division on many common platforms, including x64. Demo: https://rust.godbolt.org/z/47W3csc47

But of course in this code there's actually no divisions at all in the assembly, because LLVM replaces division-by-constant with multiply-and-shift tricks: https://rust.godbolt.org/z/nTezTdh1M

Probably the real reason that atoi would be faster is that it dispatches to smaller numbers rather than working on u128 the whole time. (Up to two .div_rem(10_000_000_000_000_000_000)s to do the work using up to three u64s will be much faster than needing 128-bit operations every time.)

Not for u128. Compiler Explorer

*itoa (yay, C!). Something like that. here is the code for parsing 4 digits with 2 div, 2 mods. 1 div and one mod on u128, one div and one mod on isize. However, I do not understand the compiler output of the code section in the inner loop: Compiler Explorer. I see 3 imul instructions and a call (to a division?)

But: my benchmarks say that when "digitizing" small numbers, the thight loop is still faster.

This is such a weird computation I have to ask: why are you doing this and why does it have to be fast?

1 Like

You can divide and conquer more aggressively. If you first reduce mod 10^19 you can work with a u64 in the inner loop.

Thus my second point -- even for u128, with a constant divisor 10 there might be 5 muls, but there's not a single div: https://rust.godbolt.org/z/qEG84ajW1

Computers are just terrible at division. On my zen2, for example, apparently a 64-bit mul is 3 cycles of latency, whereas a 64-bit div is 13-44. So those five muls for the 128-bit division-by-a-constant are usually faster than a single 64-bit div, and probably still faster if you include a bunch of extra single-cycle subs and such.

There's no native 128-bit division, and it's so complicated that LLVM doesn't inline the code for it, but instead calls out to a function to do it.

Interesting that the / 10000 doesn't get the same optimization-to-mul-and-shift that / 10 does.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.