Why is f32::from_str so slow


#1

Hi there,

I am currently writing a data processing program, which handles text files > 10GB.

One thing that is really interesting is I found the program runs incredible slowly even in release build.

I first thought it was the BufReader converts the UTF8 string and made a lot of copy. So I directly use File and use memchr and std::mem::transmute instead of BufRead::lines. But it turns out a little bit faster (roughly 50% faster and still surprisingly slow).

However, I just realize once I change f32::from_str to atof, the program runs 8 times faster. ( 148s vs 18s)

I am curious is there any reason for Rust’s from_str so much slower than atof

It turns out rust use bignum for float parsing.


I have dig into the libcore code a little bit, apparently there’s a pitch hole in the string to float conversion in the libcore.

The code first checks if the string has too many digits that an f64 can’t hold. If the answer is yes, it uses bignum and gives a huge performance punishment. I actually doubt this should be definitely an overkill, because we are able to proof that the exceeded bit won’t affect the result, so I believe the bignum is avoidable.

And it seems all the overflow can be handled by compare to a pre-computed max and min decimal.

Any thoughts ?


#2

There is some relevant information in PR #27307 and issue #53015.


#3

Hmm, interesting. It’s kinds similar, but my case isn’t greater than 1e305.

Most of my decimal has roughly 10 digits, but the value never larger than 10 in most of the case.

From the pull request description.

Numbers that need the AlgorithmM fallback (for f64, roughly everything below 1e-305 and above 1e305) take far, far longer, hundreds of microseconds. Note that my implementation is not quite as naive as the expository version in the paper (it needs one to four division instead of ~1000), but division is fundamentally pretty expensive and my implementation of it is extremely simple and slow.

But this is definitely not true. It actually checks the number of digits, so even the number is that range can fall to the slow track.

Also, I doubt the algorithm_m is completely a overkill. It detects overflow, but we have faster way to detect them. (For example, it’s really easy to compute the decimal representation of a max IEEE754 float. And for each time the conversion falls into the slow track, the first thing check num.len() > max_double.len() || (num.len() == max_double.len() && strcmp(num, max_double) > 0)

After that, it wouldn’t be any case that can cause overflow.

For the precision problem, we can also easily proof that the number of decimal digits we needed has upper bound, so there’s definitely no need for a bignum involved.


#4

Sounds like you’ve found a performance bug.

In the meantime, an alternative way to serialize a double is to reinterpret the bits as a 64bit integer and then serialize that. This has the con that you can’t read the text produced as a number but it has the pros that it’s extremely fast and the deserialized version will be bit for bit exact.


#5

You can have a look at this crate which addresses the issue:


#6

Interesting. I finally realized if parsing float defines as find a best IEEE754 representation of the decimal number, then the bignum isn’t avoidable.

However, if we weaken the definition to find the largest IEEE754 number that is smaller than the decimal, then we can avoid most of the CPU cost.

What makes it more interesting is this two definition actually yields slightly different value (The only difference is how to handle the last bit).

So I think for 99% use of this function, bignum is definitely a overkill. And I think it’s really useful for 99% of the use case if we have a much faster, less precise float number parser.


#7

Rust tries to choose full correctness by default, which can be opted out of if you don’t need those guarantees.

Maybe one of the faster, less precise algorithms could be offered as f__::from_str_approx. For now, use the crates ecosystem to help prove that the API is useful.


#8

Actually I just found the easiest way to get around of this is instead of write

f32::from_str(s)

Use the following instead

f32::from_str(if s.len() > 16 { s[..16] } else {s})

And this will fall into the fast track the bignum, since f32 can’t carry than 16 decimal digits (Plus at most one ‘.’ and assume no leading 0).


I personally use the latter code already, it turns out even faster than atof. (148s vs 18s vs 8s)
And I just realized atof uses bignum (GMP for glibc) as well, but it’s very interesting how glibc performes so well.


#9

This is overly simplistic. For example the following two numbers represent different f32 values but differ only after 26 digits.

fn main() {
    println!("{}", "1.000000178813934326171874".parse::<f32>().unwrap());
    println!("{}", "1.000000178813934326171875".parse::<f32>().unwrap());
}

#10

No, it’s not. https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=5971d8b2880ed02695f31964f6469fb4

fn main() {
    let a = "1.000000178813934326171874".parse::<f32>().unwrap();
    let b = "1.000000178813934326171875".parse::<f32>().unwrap();
    
    let ab = unsafe{std::mem::transmute::<_,&u32>(&a)};
    let bb = unsafe{std::mem::transmute::<_,&u32>(&b)};
    
    println!("{:x}", ab);
    println!("{:x}", bb);
    
    println!("{:.16}", a);
    println!("{:.16}", b);
}

It’s true that a and b are different float number, but in fact, none of these two is exactly the number from the string.

The trick is the IEE754 only keeps 23 fraction bit, which is a little bit less than 16 decimal digits. Which means if the difference of the decimal float is affter the 16th digits, it means in IEEE754 f32 form, there’s at most only one digits can be different. And this is the limit of the float point number format itself, rather than any number parsing algorithm.

When you run my code, you will realize that the two number only different in the least significant bit.

3f800001
3f800002

And the difference is cased by the following reason. The f32::from_str actually try to minimize the error (which mean choose the nearest IEEE754 number).

But my function just truncate the number, so it actually returns the max IEEE754 number that is smaller than the number you want.

As I said this is less precise for sure, but it’s too expensive to spend 10x CPU time to determine the least significant bit, because at most you only have an error of 2**(e-23). And Rust’s parser makes the error range to 2**(e-24).

But obviously the LSB doesn’t important at least in my case.

Also, just think about the nature of text file, it’s highly possible that the program produce the text file also do some approximation when dumping the float number to file. So it turns out nonsense to really care about the LSB, because it’s already in less precise format.


#11

And what if the source comes not from a serialization of f32 but from a more precise source?

Your routine is not correct for certain strings (where correct is the closest float to the textual decimal number). The default needs to be correct. A cheaper approximate algorithm can be made available, but the default should be perfectly correct for all cases.

Lossy is sometimes the correct choice. For you it obviously is. No one is arguing that point.


#12

I actually made a mistake, in fact 16 bit is even sufficient for f64 as well. Because log10(2**53) < 16.
For f32 only 8 digits is sufficient.

But the point is we need at most ceil(log10(1<<f)) bit to determine all the bits besides the LSB.

And it seems the approximate works pretty well. I am not saying libcore should “fix” it. But I do really think this is a performance pitch-hole should be aware of and some reasonable workaround.


#13

Note that the point of comparison in the OP, atof, is also implemented with bignums to be correctly rounded (at least if we’re talking about glibc) so sacrificing correct rounding is evidently not necessary for making up that order of magnitude of performance difference. The current implementation in Rust isn’t that fast (sorry about that), and it faces some challenges glibc doesn’t (libcore has no heap allocation so it can’t use existing bignum libraries), but it’s something to keep in mind when framing this as an “accuracy vs performance” trade off.


#14

Sure, that is why this surprises me so much, since atof never becomes a performance bottleneck in my experience. Also just curios, when can we get alloca equivalence in Rust. That sounds really interesting.


#15

The current implementation is what it is because that’s what I managed to finish before I moved on to other projects :sweat_smile: and no one touched it since.

https://rust-lang.github.io/rfcs/1909-unsized-rvalues.html but to be clear, that would be of little to no use for f32::from_str since there’s no way to handle the stack overflow and thus no reliable way to reject overly long inputs or redirect them in a different code path. It has to make do with a fixed/statically bounded amount of stack space.


#16

Hmm, that’s the safety requirement C don’t have. Actually l know gcc is really bad at determine the bound of stack size :joy:


#17

If you get to choose your own serialization format, consider a fast binary format like CBOR. https://github.com/pyfisch/cbor