How to optmize this code?

inyourface3445 · June 27, 2023, 8:10pm

I have a function that takes in a u128, converts it to a Vec<u32>, sorts the vector, takes all of the zeros at the front of the vec and moves them to the back (ie, 0000112335589 -> 1123355890000), then turns it back into a String.

Code:

 fn sort(num: u128) -> u128 {
    let num_str = num.to_string();
    let mut digits: Vec<u32> = num_str.chars().map(|d| d.to_digit(10).unwrap()).collect();
    digits.sort();

    // count the zeros
    let zeros = digits.iter().filter(|&n| *n == 0).count();

    // remove the zeros
    while let Some(0) = digits.first() {
        digits.remove(0);
    }

    // add the zeros back at the end
    for _ in 0..zeros {
        digits.insert(digits.len(),0);
    }
    
    let mut buffer = String::from("");

    // build the number into a string so i can use `parse::<u128>()`
    for i in digits {
        buffer.push_str(&i.to_string());
    }

    // if there is a better way to do this please tell me
    buffer.parse::<u128>().unwrap()
}

The issue that I am having is that cargo bench says the the function takes ~300 ms/iter (even though at runtime, it does not feel that slow) and I need it to be much faster then that. What can I do to make the code faster?

H2CO3 · June 27, 2023, 8:20pm

There's no chance this takes a third of a full second. You are measuring something wrong.

A quick repro shows that even in a debug build, the runtime is in the single-digit microseconds.

With that said, there is a lot of room for improvement. In general, you should not use strings for solving numerical problems. Things like push_str(&i.to_string()); just scream "unnecessary work".

A number can be decomposed into its digits by repeated division, and re-composed by repeated multiplication and addition. Furthermore, your .filter().count() is just .partition_point() (a binary search, possible because 0s are always at the beginning), and the subsequent loop is just .rotate_left(). Furthermore, there's no need for dynamic allocation, because 2^128 < 10^36, so all numbers will be at most 36 digits, so ArrayVec it is.

This is down to ~65 ns from ~500, an order of magnitude faster.

moy2010 · June 27, 2023, 8:20pm

Why do you want to optimize this? 300ms is a lot. Are you sure about this number?

scottmcm · June 27, 2023, 8:30pm

If you're sorting the digits, that means you didn't care about the initial order, which means you can just store the multiplicity of the digits -- aka do a Counting sort - Wikipedia.

Something like this:

fn sort_alt(mut num: u128) -> u128 {
    let mut digits = [0_u8; 10];
    
    while num > 0 {
        let r;
        (num, r) = (num / 10, num % 10);
        digits[r as usize] += 1;
    }
    
    for i in (1..=9).chain([0]) {
        for _ in 0..digits[i] {
            num *= 10;
            num += i as u128;
        }
    }
    
    num
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=e09b83aa27661f39c7ea1fd8293c8521

No dynamic allocation and no sorting, so should be much faster.

inyourface3445 · June 27, 2023, 8:35pm

I made a quick cargo bench:


#[cfg(test)]
mod tests {
    use test::{Bencher};
    
    fn sort(num: u128) -> u128 {
        let num_str = num.to_string();
        let mut digits: Vec<u32> = num_str.chars().map(|d| d.to_digit(10).unwrap()).collect();
        digits.sort();
    
        // count the zeros
        let zeros = digits.iter().filter(|&n| *n == 0).count();
    
        // remove the zeros
        while let Some(0) = digits.first() {
            digits.remove(0);
        }
    
        // add the zeros back at the end
        for _ in 0..zeros {
            digits.insert(digits.len(),0);
        }
        
        let mut buffer = String::from("");
    
        // build the number into a string so i can use `parse::<u128>()`
        for i in digits {
            buffer.push_str(&i.to_string());
        }
    
        // if there is a better way to do this please tell me
        buffer.parse::<u128>().unwrap()
    }

    #[bench]
    fn bench_pow(b: &mut Bencher) {
        b.iter(|| {
            // Inner closure, the actual test
            for i in 1..100000000 {
                sort(i);
            }
        });
    }
}

The result of the benchmark was something like 300,000,000 ns/iter, which I thought when converted to ms, was .3 seconds. Maybe I converted incorrectly.

SkiFire13 · June 27, 2023, 8:38pm

You are benching the whole loop, that is sort repeated 99999999 times.

H2CO3 · June 27, 2023, 8:38pm

iter() automatically iterates as much as needed. You are measuring 100 million executions. Just drop the inner loop, it's completely unnecessary.

(By the way, how could the benchmark harness possibly know how many iterations you are doing in a black-box closure?)

quinedot · June 27, 2023, 8:38pm

See also.

samuelpilz · June 28, 2023, 6:31pm

Treating the zero-case extra got me another 10% speedup.

pub fn sort(mut num: u128) -> u128 {
    let mut digits = [0_u8; 10];
    while num > 0 {
        let r;
        (num, r) = (num / 10, num % 10);
        digits[r as usize] += 1;
    }
    for i in 1..=9 {
        for _ in 0..digits[i] {
            num *= 10;
            num += i as u128;
        }
    }
    num *= 10u128.pow(digits[0].into()); // special case "0"
    num
}

Using itoa probably helps as well because it does substantially fewer divisions. (4 divisions per 4 digits, compared to 2 divisions per digit in naive algorithm)

scottmcm · June 28, 2023, 7:54pm

Reminder that /+% is actually only one division on many common platforms, including x64. Demo: https://rust.godbolt.org/z/47W3csc47

But of course in this code there's actually no divisions at all in the assembly, because LLVM replaces division-by-constant with multiply-and-shift tricks: https://rust.godbolt.org/z/nTezTdh1M

Probably the real reason that atoi would be faster is that it dispatches to smaller numbers rather than working on u128 the whole time. (Up to two .div_rem(10_000_000_000_000_000_000)s to do the work using up to three u64s will be much faster than needing 128-bit operations every time.)

samuelpilz · June 28, 2023, 8:22pm

Not for u128. Compiler Explorer

*itoa (yay, C!). Something like that. here is the code for parsing 4 digits with 2 div, 2 mods. 1 div and one mod on u128, one div and one mod on isize. However, I do not understand the compiler output of the code section in the inner loop: Compiler Explorer. I see 3 imul instructions and a call (to a division?)

But: my benchmarks say that when "digitizing" small numbers, the thight loop is still faster.

user16251 · June 28, 2023, 11:10pm

This is such a weird computation I have to ask: why are you doing this and why does it have to be fast?

user16251 · June 28, 2023, 11:16pm

You can divide and conquer more aggressively. If you first reduce mod 10^19 you can work with a u64 in the inner loop.

scottmcm · June 28, 2023, 11:26pm

Thus my second point -- even for u128, with a constant divisor 10 there might be 5 muls, but there's not a single div: https://rust.godbolt.org/z/qEG84ajW1

Computers are just terrible at division. On my zen2, for example, apparently a 64-bit mul is 3 cycles of latency, whereas a 64-bit div is 13-44. So those five muls for the 128-bit division-by-a-constant are usually faster than a single 64-bit div, and probably still faster if you include a bunch of extra single-cycle subs and such.

There's no native 128-bit division, and it's so complicated that LLVM doesn't inline the code for it, but instead calls out to a function to do it.

Interesting that the / 10000 doesn't get the same optimization-to-mul-and-shift that / 10 does.

system · September 26, 2023, 11:26pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
What is the fastest way to convert bytes to numbers? help	19	1541	March 1, 2023
How to efficiently split a Vec<u8> into Vec<Vec<u8>> and sort? community	3	1849	April 29, 2020
Book chapter 8 questions code review	4	665	October 9, 2022
Feedback on how Improve and be more memory efficient code review	7	383	February 1, 2023
Implementation of hybrid partial sorting algorithm	8	360	October 16, 2023

How to optmize this code?

Related Topics