Efficient f32 -> ascii -> f32 conversion?

  1. The conversion is not required to be f: f32 -> format!("{:?}", f);. In fact, I'm asking for alternative conversions.

  2. Problem: I'm (manually) [1] serializing a number of Rust structs to json/string, storing it in a DB, then, later deserializing it from json/string back to Rust.

  3. Many of these structs contain lots of f32's, and I want an efficient way to store this.

  4. A f32 is only 4 bytes. Even taking the ascii requirement, it shouldn't take more than 8 bytes. However, due to fractional part, format!("{:?}", f32) can easily take more than 8 chars.

  5. Is there some standard f32 -> ascii -> f32 conversion? It's fine the ascii rep is not human readable.

[1] Why aren't you just using serde? These structs might add/remove/delete fields over time. I need a gracefully handle this. Thus, the manually rolled serialization / deserialization.

If you are only storing them for retrieval and don't need to interpret them otherwise, you could just store them as 8 hex digits, or any denser invertible encoding that is acceptable to JSON.

4 Likes

The recently discovered Ryū algorithm converts floats to strings producing the minimum number of digits required to recover the exact same bits of the original float when converted back from string to float. That is, it produces the most space efficient ASCII representation of the float values.

Luckily Ryū is available in Rust: https://github.com/dtolnay/ryu

5 Likes

Use https://doc.rust-lang.org/std/primitive.f32.html#method.to_be_bytes, then encode the bytes to whatever ascii repr you like. (If nightly-only is a problem, use byteorder for the u8 conversion.) 8 hex digits is not very space efficient, but easily understandable and fast to convert. With base64, 6 bytes would be enough.

Not in this case, since the ASCII bytes don't have to be readable as decimal number notation.

1 Like

Hmm... perhaps I misunderstand the requirement.

So the idea is to take binary values in structs, the most space efficient representation, and store them as JSON, one of the least space efficient ways to store anything. But still worry about a little overhead from the float representation?

All sounds a bit odd to me.

2 Likes

Valid criticism. Maybe I am doing something wrong. Here is my full problem statement:

  1. I am storing a "wiki" of Rust structs. The key & value are both Rust structs.

  2. I am storing this data in a table in Aurora Serverless (My SQL).

  3. The API runs through AWS lambda functions running Rust (which are returning JSON. Binary data would be base64 encoded.)

  4. Many of these structs contain lots of f32's.

  5. Now, why the $^%& am I manually serializing these Rust structs to JSON?

  6. It turns out, over time, I may add/delete fields to these Rust structs. In order to have things degrade gracefully, I am writing my own serialization / de serialization routines. And if I'm writing my own routines, using json as a data format seems easier to debug than some type of "binnary blob" format.

  7. I don't care that the float's are "unreadable", since as long as the field names are correct, it's probably working.

  8. As for why I care about f32 efficiency. According to https://en.wikipedia.org/wiki/Single-precision_floating-point_format , the exponent has 8 bits, which means it can go as low as -127.

Now, consider 2^{-127}. If we dump this out to string, it's either (1) going to be inaccurate due to truncation or (2) going to take 129 chars. (two for "0.", and 127 for the actual decimal places).

Let me try an boil that requirement down further:

  1. You want to use an SQL database. Note the "structured" in SQL.

  2. You want to to store unstructured data in it. Or at least data whose structure is ill defined and flexible.
    I gather from: "I may add/delete fields to these Rust structs"

  3. To that end you want to store those structures as JSON. That is not unreasonable. I'm using JSON in postgres for similar reasons.

So far so good. If I'm correct here. You have to have decided to take the space hit of using JSON in return for the flexibility to cater for future changes in data content.

If all that is clear I'm not sure what the problem is with floats. Because:

a) Perhaps in your application you know that numbers are never going to be bigger than some number of digits and are never going to need more than so many digits after the decimal point for accuracy. You can just chop off a lot of digits after the decimal point. A system I work on uses floats but only in the range 0.00 to 1000.00, I only need ASCII 7 bytes for my 4 byte floats. Of course if that is so why store as floats? Why not use fixed point and not store the "."? Then I would have to only store strings from "0" to "1000000".

b) I think the answer to your question re: 2^{-127} is right there in the question. One would not store that as "129 chars. (two for "0.", and 127 for the actual decimal places)." Why would you not store it as the string "2^{-127}" or the common decimal exponential notation. That is what floats are all about after all.

1 Like

It's not that I want to use Aurora Serverless SQL. My data is not relational.

It's that I don't want to deal with the eventual consistency / read-write provisioning of DynamoDB.

All I really need is a KV store.

I might be reading too much into this. It sounds like you might be wondering about breaking the struct apart, making the fields into columns of a table. This wouldn't quite work, as (1) there are structs + enums, (2) the structs are recursive via Box, (3) for a tree containing 1000 nodes, I want to extract it via 1 read, not 1000 recursive lookups, one per node of the tree.

==> But otherwise, I think we have the same problem in mind.

I have this weird situation, where I'm rendering vector graphics via WebGL. It's this weird situation where most points snap to a "grid" where the (x,y) are all multiples of 2^{-20}, but I'm not storing it as a u32 because "derived" / "intermediate" points may not snap to the grid.

I was referring to what format!("{:?}", 2^{-127}) would do. The core of the question is: format!("{:?}", f32) does not do what I want. What are alternatives? I also wouldn't want to store it as "2^{-127}" -- extra parsing work.

At this point, I think the simplest way is f32 -> bit conversion into u32 -> 4 u8's -> 8 hex's.

Thanks everyone for insights!

OK. I was going to suggest taking those 32 bits and stretching them out to six bytes, 0 to 127 each. A la ASCII.

Not sure about the extra parsing work of exponential notation. You are already parsing JSON, so that's hardly anymore work.

I think you're talking about runtime cost of parsing.

I had "cost of writing code" in mind -- parsing JSON is free as serde gives me a serde::Value. Parsing 2^{-127} is only a few regexes; but annoying to maintain (and also to special case writing out a writer that takes mantissa & exponent separately).

I'm not sure how we got into "2^{-127}". I was thinking of storing your JSON using normal decimal scientific notation:

{
    "x": 2323E14
}

No special parser required.

The issue is that xEy is decimal notation, and numbers of the form 2^{-n} have lots of decimal digits. i.e. 0.5, 0.25, 0.125, 0.625, 0.03125, etc ... so if you expand out "2^{-127}" in xEy notation, x will have lots of digits.

As for why we got the -127, f32 has 8 bits for the exponent, going from -127 to 128.

I get more confused with what you are actually trying to do.

Why the need for binary notation?
What is the number range you want to work over?
What is the precision you require over that range?

Bear in mind that you are using floats. You only have 26 bits of mantissa. Your precision gets less as your numbers get bigger.

You don't need the precise decimal form to uniquely represent a 32-bit float -- just enough that any variance is lost to rounding the mantissa when you convert back.

1 Like

@ZiCog: My problem is already solved. f32 -> u32 -> 8 hex's. I'm explaining why format!("{:?}", f32) doesn't work for serialization since there are f32's that have very long xEy notation representations.

If you don't write out a decimal that is not the precise decimal form -- parsing back the f32 is going to be messy.

zeroexcuses

My problem is already solved. f32 -> u32 -> 8 hex's. I'm explaining why format!("{:?}", f32) doesn't work for serialization since there are f32's that have very long xEy notation representations.

The thing is that an f32 only has 32 bits. So it only requires a few decimal digits to represent any of those 4 billion possible numbers in such a way that you get the same binary bits back in your f32 when you parse the digits back again.

There is never a need to have dozens of digits in the ASCII decimal representation. Traditionally though that does happen with formatting routines. They are needlessly redundant.

For example the number: 1.23e40 may get printed as 12300000000000000000000000000000000000000
or the number 1.23e-40
as 0.000000000000000000000000000000000000000123

I guess that is what you mean by very long representations produced by format!("{:?}", f32). Personally I would argue that is a bug in format!. It is claiming more accuracy than it actually has.

This is why I suggested the ryu formatting library which will produce those short representations above.
https://docs.rs/ryu/1.0.2/ryu/

Having said that I suspect that in your case your 8 hex digit solution will work fine.

Ryu is for normal people who want a human readable format.

Eight hex digits is a human-readable (but not semantically meaningful) format. Radix-64 encoding, IMO, is not, which is why I only suggested it as an alternative class of encodings

but did not propose it.

I'm going to classify that as not human readable. At least not normal humans. Others may disagree.

This number is 123 * 2^{-42} * 5^{-42}, which, due to the 5^{-42} can not be represented by a f32, a f64, or even a fNNN, for arbitrarily large NNN.

In practice, numbers that f32/f64 do represent tend to have lots of non zero digits: For example:

f32 can represent 2^{-127}, which looks like:

format(1./(2**127), ".128f")
'0.00000000000000000000000000000000000000587747175411143753984368268611122838909332778386043760754375853139208629727363586425781250'

even with the best ryu, it's still alot longer than 8 hex.

f64 can represent 2^{-1023}, which gives:

0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011125369292536006915451163586662020321096079902311659152766637084436022174069590979271415795062555102820336698655179055025762170807767300544280061926888594105653889967660011652398050737212918180359607825234712518671041876254033253083290794743602455899842958198242503179543850591524373998904438768749747257902258025254576999282912354093225567689679024960579905428830259962166760571761950743978498047956444458014963207555317331566968317387932565146858810236628158907428321754360614143188210224234057038069557385314008449266220550120807237108092835830752700771425423583764509515806613894483648536865616670434944915875339194234630463869889864293298274705456845477030682337843511993391576453404923086054623126983642578125'

again, ryu unlikely to help here.

The problem is that numbers of the form K * 2{-N}, where K is odd and N is large tends to have lots of non zero digits after the decimal point.

The point is that the standard formatting lies to you. No really.

A float 32 only has 32 bits of information, 4 bytes. So I think you will agree there is something wrong with printing 128 decimal digits, 60 odd of which are non zero. That is claiming a lot more accuracy than is actually there. It's certainly printing a lot more information than was present in the original 4 bytes.

Now, it turns out that for any f32 there is a minimal number of decimal digits (other than all those zeros) that can be used to represent that number. Such that when you parse those decimal digits back to the f32 you get exactly the same bit pattern.

Standard format does not give you that minimal number of digits, it gives you that plus a lot of noise.

The whole point of ryu is to format to that minimal representation.

So taking your example. 1./(2**127) comes out as:

0.00000000000000000000000000000000000000587747175411143753984368268611122838909332778386043760754375853139208629727363586425781250

Using standard format. But only:

5.877472e-39

Using ryu.

Ok that is 12 bytes rather than the original 4 of the f2. But it is not much more than your 8 hex bytes and a lot less than the 128 characters standard format gives.

To reiterate, you do not lose information using the ryu representation.

2 Likes