I have a function that takes in a u128, converts it to a Vec<u32>, sorts the vector, takes all of the zeros at the front of the vec and moves them to the back (ie, 0000112335589 -> 1123355890000), then turns it back into a String.

Code:

fn sort(num: u128) -> u128 {
let num_str = num.to_string();
let mut digits: Vec<u32> = num_str.chars().map(|d| d.to_digit(10).unwrap()).collect();
digits.sort();
// count the zeros
let zeros = digits.iter().filter(|&n| *n == 0).count();
// remove the zeros
while let Some(0) = digits.first() {
digits.remove(0);
}
// add the zeros back at the end
for _ in 0..zeros {
digits.insert(digits.len(),0);
}
let mut buffer = String::from("");
// build the number into a string so i can use `parse::<u128>()`
for i in digits {
buffer.push_str(&i.to_string());
}
// if there is a better way to do this please tell me
buffer.parse::<u128>().unwrap()
}

The issue that I am having is that cargo bench says the the function takes ~300 ms/iter (even though at runtime, it does not feel that slow) and I need it to be much faster then that. What can I do to make the code faster?

There's no chance this takes a third of a full second. You are measuring something wrong.

A quick repro shows that even in a debug build, the runtime is in the single-digit microseconds.

With that said, there is a lot of room for improvement. In general, you should not use strings for solving numerical problems. Things like push_str(&i.to_string()); just scream "unnecessary work".

A number can be decomposed into its digits by repeated division, and re-composed by repeated multiplication and addition. Furthermore, your .filter().count() is just .partition_point() (a binary search, possible because 0s are always at the beginning), and the subsequent loop is just .rotate_left(). Furthermore, there's no need for dynamic allocation, because 2^128 < 10^36, so all numbers will be at most 36 digits, so ArrayVec it is.

This is down to ~65 ns from ~500, an order of magnitude faster.

If you're sorting the digits, that means you didn't care about the initial order, which means you can just store the multiplicity of the digits -- aka do a Counting sort - Wikipedia.

Something like this:

fn sort_alt(mut num: u128) -> u128 {
let mut digits = [0_u8; 10];
while num > 0 {
let r;
(num, r) = (num / 10, num % 10);
digits[r as usize] += 1;
}
for i in (1..=9).chain([0]) {
for _ in 0..digits[i] {
num *= 10;
num += i as u128;
}
}
num
}

#[cfg(test)]
mod tests {
use test::{Bencher};
fn sort(num: u128) -> u128 {
let num_str = num.to_string();
let mut digits: Vec<u32> = num_str.chars().map(|d| d.to_digit(10).unwrap()).collect();
digits.sort();
// count the zeros
let zeros = digits.iter().filter(|&n| *n == 0).count();
// remove the zeros
while let Some(0) = digits.first() {
digits.remove(0);
}
// add the zeros back at the end
for _ in 0..zeros {
digits.insert(digits.len(),0);
}
let mut buffer = String::from("");
// build the number into a string so i can use `parse::<u128>()`
for i in digits {
buffer.push_str(&i.to_string());
}
// if there is a better way to do this please tell me
buffer.parse::<u128>().unwrap()
}
#[bench]
fn bench_pow(b: &mut Bencher) {
b.iter(|| {
// Inner closure, the actual test
for i in 1..100000000 {
sort(i);
}
});
}
}

The result of the benchmark was something like 300,000,000 ns/iter, which I thought when converted to ms, was .3 seconds. Maybe I converted incorrectly.

Treating the zero-case extra got me another 10% speedup.

pub fn sort(mut num: u128) -> u128 {
let mut digits = [0_u8; 10];
while num > 0 {
let r;
(num, r) = (num / 10, num % 10);
digits[r as usize] += 1;
}
for i in 1..=9 {
for _ in 0..digits[i] {
num *= 10;
num += i as u128;
}
}
num *= 10u128.pow(digits[0].into()); // special case "0"
num
}

Using itoa probably helps as well because it does substantially fewer divisions. (4 divisions per 4 digits, compared to 2 divisions per digit in naive algorithm)

But of course in this code there's actually no divisions at all in the assembly, because LLVM replaces division-by-constant with multiply-and-shift tricks: https://rust.godbolt.org/z/nTezTdh1M

Probably the real reason that atoi would be faster is that it dispatches to smaller numbers rather than working on u128 the whole time. (Up to two .div_rem(10_000_000_000_000_000_000)s to do the work using up to three u64s will be much faster than needing 128-bit operations every time.)

*itoa (yay, C!). Something like that. here is the code for parsing 4 digits with 2 div, 2 mods. 1 div and one mod on u128, one div and one mod on isize. However, I do not understand the compiler output of the code section in the inner loop: Compiler Explorer. I see 3 imul instructions and a call (to a division?)

But: my benchmarks say that when "digitizing" small numbers, the thight loop is still faster.

Thus my second point -- even for u128, with a constant divisor 10 there might be 5 muls, but there's not a single div: https://rust.godbolt.org/z/qEG84ajW1

Computers are just terrible at division. On my zen2, for example, apparently a 64-bit mul is 3 cycles of latency, whereas a 64-bit div is 13-44. So those fivemuls for the 128-bit division-by-a-constant are usually faster than a single 64-bit div, and probably still faster if you include a bunch of extra single-cycle subs and such.

There's no native 128-bit division, and it's so complicated that LLVM doesn't inline the code for it, but instead calls out to a function to do it.

Interesting that the / 10000 doesn't get the same optimization-to-mul-and-shift that / 10 does.