[Serde-JSON] Numeric precision in the JSON format

Moderator note: split from Saving a complex struct to disk - #32

Json is especially bad because its number format is underspecified. You can't store NaN or inf, but you also can't reliably store integers that exceed the precision of a double.

JSON underspecified?

The JSON definition says:

It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999.

Which defines everything about numbers, but the it also
describes the number format in detail. Which as you say excludes NaN and inf. We could debate if that is a good or bad thing, seems reasonable to me.

What is underspecified here: JSON

Am I missing something? Json doesn't define the semantics of its numeric types, just the strings that are valid Json. Or is this specified somewhere else? The problem is what should strings like "1e1000" or "9007199254740993" mean? What about "9007199254740993.0"?

1 Like

JSON specifies the textual format of numbers unambiguously, but leaves the limits of range and precision entirely implementation-defined: RFC 8259 - The JavaScript Object Notation (JSON) Data Interchange Format

It's perfectly valid for a JSON encoder to output numbers that are outside the range/precision of f64 (such as integers bigger than 2^53) even though many JSON decoders will not be able to accurately represent them, and it's perfectly valid for a JSON decoder to decode to a format that doesn't use floating point at all and just drop the fractional part of numbers. All they have to do is approximate the number as best they can.

2 Likes

That is all true and may be significant depending on what one is doing. However I don't understand:

It's perfectly valid for a JSON encoder to output numbers that are outside the range/precision of f64 (such as integers bigger than 2^53)

The spec says: "It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999." Which implies RFC 8259. Which I take as implying that an encoder that spits out things outside of that range is buggy. And likewise a decoder that accepts numbers outside of that range is buggy.

The spec says: "It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999." Which implies RFC 8259. Which I take as implying that an encoder that spits out things outside of that range is buggy. And likewise a decoder that accepts numbers outside of that range is buggy.

JavaScript also allows number literals to be outside the range that its numeric type can represent, and just represents them as best it can (by rounding them, or saturating to positive/negative infinity). So, a JSON file that contains 1e1000 is still valid JavaScript.

(also until ES2019 changed JavaScript to match, JSON was not actually a subset of JavaScript due to different handling of a few Unicode whitespace characters)

Fair enough, though I think serde_json fails that test!

That's an incorrect deduction. "Based on" doesn't mean "it must have exactly the same semantics in every implementation as JavaScript". "Based on" means "originates from" or "similar to" or "extends".

It's also plainly false that RFC 8259 prohibits numbers outside the range/precision of f64. Quoting from Section 6, "Numbers":

This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available. Note that when such software is used, numbers that are integers and are in the range [-(2**53)+1, (2**53)-1] are interoperable in the sense that implementations will agree exactly on their numeric values.

So the RFC recommends that implementations don't emit such extreme values for compatibility, but:

  1. it doesn't prohibit emitting larger or more finely-resolved numbers, and
  2. it doesn't say anything about not being allowed to parse them.

So no, serde_json is not buggy, and nor is any other JSON library that accepts and/or emits arbitrary-precision numbers.

Agreed.
...but leaves the limits of range and precision entirely implementation-defined: RFC 8259 - The JavaScript Object Notation (JSON) Data Interchange Format
[/quote]
Hmm...Where did any RFC enter the picture?

The SON spec refers to: "JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999": https://www.ecma-international.org/wp-content/uploads/ECMA-262_3rd_edition_december_1999.pdf

Which unambiguously specifies use of IEEE 754:

  1. 4.3.20
    Number TypeThe type Number is a set of values representing numbers. In ECMAScript, the set of values represents the double-precision 64-bit format IEEE 754 values including the special “Not-a-Number” (NaN) values, positive infinity, and negative infinity.

But as you pointed out JSON excludes use of NaNs and infinities. Which sounds reasonable to me.

I conclude then that numbers in JSON are fully specified.

Of course JSON numbers may not be suitable for whatever one is doing.

Personally I love the idea of a properly typed protocol. If one is dealing with u8's, i64's etc one should be able to define that in the schema. Inter things like protobuf. But two issues:

  1. Even given those numbers and other types we are still far away from a rigorous protocol spec. For example what if I have some number type that can only be of values between -10 and 100? The same. problem exists with types in most typed languages.

  2. All that faffing around with schema is a pain. In the extreme the photo spec can be lost and then one ends up with reverse engineering a binary protocol to Make anything interoperable. This happened to me on a project at the old Nokia Networks back in the day.

  3. Debugging is much harder.

The JSON spec, ECMA-404, referred to by RFC 8259 and linked at https://www.json.org/json-en.html says

JSON is agnostic about the semantics of numbers. In any programming language, there can be a variety of number types of various capacities and complements, fixed or floating, binary or decimal. That can make interchange between different programming languages difficult. JSON instead offers only the representation of numbers that humans use: a sequence of digits. All programming languages know how to make sense of digit sequences even if they disagree on internal representations. That is enough to allow interchange.

and additionally

JSON was first presented to the world at the JSON.org website in 2001. A definition of the JSON syntax was subsequently published as IETF RFC 4627 in July 2006. ECMA-262, Fifth Edition (2009) included a normative specification of the JSON grammar. This specification, ECMA-404, replaces those earlier definitions of the JSON syntax. Concurrently, the IETF published RFC 7158/7159 and in 2017 RFC 8259 as updates to RFC 4627. The JSON syntax specified by this specification and by RFC 8259 are intended to be identical.

1 Like

Hmm... Here life gets confusing.

The JSON standard page JSON says ECMA-404 The JSON Data Interchange Standard. at the very top. Then in the first paragraph it says: "It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999.".

The former contains that woolly definition of numbers. The later specifically says IEEE 745.

We have a contradiction.

Personally I would naturally adopt the latter as the whole idea of JSON is to be the Javascript Object Format. So any number that Javascript can't handle is obviously wrong.

We don't. You are misinterpreting the standard.

2 Likes

JSON stands for JavaScript Object Notation, emphasis on notation. Representing numbers that JavaScript can't handle is perfectly acceptable.

The interpretation requires ECMA-404 and ECMA-262 editions 4 and 5 being invalid for some reason, in deference to the informal json.org website which is not itself a specification. I think the more reasonable interpretation is that they are not using "based on" in the sense that you're assuming. It's based on ECMA-262 the same way the English language is based on Middle English, and "That slepen al the night with open yë," is not valid English.

The answer to this is provided at the very top of the page linked by @Heliozoa:

Although the newer JSON syntax specs are absolutely meant be (somewhat) compatible with ECMA-262, The semantics of JSON were never formally defined. Even now, developers are free to interpret them however they like to.

3 Likes

On this topic, I have written a crate that deals with a few shortcomings of serde-json, including numeric precision: GitHub - 01mf02/hifijson: High-fidelity JSON lexer and parser