Sort order for JSON


#1

I am writing a proposal to define sort order for JSON values. The
initial draft can be found here.

There are particular challenges I am facing.

  1. Floating point don’t have total ordering.
  2. Although Integer and Floating point are different types, even the combined set of integers and floating-point values have partial ordering.
  3. String order, in the context of unicode collation.

Feedbacks and comments much appreciated.

A working implementation of this proposal is available in jsondata package.

Thanks,


#2

Also, UTF8-encoding introduces other complexities.

Comparing the bytes of UTF-8 works out the same as comparing codepoints, conveniently. So I’d suggest defining the ordering that way.

Note that length of the array shall not be considered by the comparator logic.

As a nitpick, that statement is a bit weird considering the "[] sort before [null]", where there isn’t a “counterpart” against which to compare.

floating point numbers

You should explicitly mention negative zeros, and whether you’re making a total order or not.

Minbound & Maxbound

I’m not sure you need to include these in JSON; you can have them be external, like Bound::Unbounded.


#3

Comparing the bytes of UTF-8 works out the same as comparing codepoints, conveniently. So I’d suggest defining the ordering that way.

I think this is true most of the times, but there seem to be corner cases described by,

Many people expect the characters in their language to be in the "correct" order in the Unicode code charts. Because collation varies by language and not just by script, it is not possible to arrange the encoding for characters so that simple binary string comparison produces the desired collation order for all languages

a bit weird considering the "[] sort before [null]", where there isn’t a “counterpart” against which to compare.

Good point. I will add this detail to the proposal:

  • Length of the array shall not be considered by the comparator logic.
  • And empty array shall sort before non empty arrays.

You should explicitly mention negative zeros, and whether you’re making a total order or not.

Yes the idea is to make a total ordering for the combined set of Integer and Float.

  • f64 values that are <= -2^127 will sort before all i128 integers.
  • f64 values that are >= 2^127-1 will sort after all i128 integers.
  • NaN, Not a Number values shall sort after all i128 integers.
  • -Infinity shall sort before all numbers.
  • +Infinity shall sort after all numbers.
  • NaN shall sort after +Infinity.

Additionally I will amend proposal with

  • f64 Negative ZERO shall short before f64 Positive ZERO.

I’m not sure you need to include these in JSON; you can have them be external, like Bound::Unbounded.

Excellent. To begin with, unboundedness are never part of any sorted set
of values. They just come from query. Will amend the proposal and code.


#4

Capturing it in git-issue: https://github.com/bnclabs/jsondata/issues/9


#5

I think that just changes the uncertain case to [null]-vs-[null,null]. I’d expect a rule like “if the comparison runs out of items, the side the ran out of items sorts before the side that still has more”.


#6

Well phrased. This is exactly the implementation in jsondata does. Thanks.


#7

The IEEE-754 spec includes an official total order for floats.


Notice that, since JSON is a text format, it cannot distinguish between different NaN payloads or different representations of the same subnormal value, and so much of the complexity of that spec can be eliminated.

…then again, JSON doesn’t even have NaN or Infinity, so that same argument could be taken even further…


Edit: I incorrectly thought denormals had degenerate representations. I don’t know what that page is even talking about with having multiple representations for the same finite value.


#8

Thanks for pointing me to total-ordering for floats.

Still I don’t understand how same floating point value can have different representation in IEEE-754 format.
The total ordering section has this to say-

different representations of the same floating point number are ordered by their exponent multiplied by the sign bit.

If we are talking about the possibility of different representation in JSON text format for the same value, may be we can simply state that - “parse floating point string to 64-bit IEEE-754 format and proceed to comparison”.

As far as representing -Infinity, +Infinity, -NaN, +NaN in JSON format, we can always extend the specification, like JSON5 which is backward compatible with JSON, if it make sense to have them in a document database.

And finally we can arrive at:

  • -Infinity shall sort before all numbers.
  • -qNaN negative quite-NaN shall sort after -Infinity.
  • -sNaN positive signalling-NaN shall sort after -qNaN.
  • normal and sub-normal shall sort after -sNaN.
  • +sNaN signalling-NaN shall sort after normal numbers.
  • +qNaN quite-NaN shall sort after +sNaN.
  • +Infinity shall sort after all numbers.
  • -0.0 and +0.0 shall be treated as equal.
  • Different representation of floating point numbers are usually
    normalised in IEEE-754 format.
  • Ordering between sNaN and qNaN in the same class being
    based on the integer payload, multiplied by the sign bit.

Ordering between Integer and floating-point:

  • Floating point value shall be converted to integer and compared.
  • Floating point values that are <= -2^127 will sort before all i128 integers.
  • Floating point values that are >= 2^127-1 will sort after all i128 integers.

Are the above rules enough for total-order for all numbers ? Please share your thoughts.