Forward compatible design for locale


#1

I’ve resumed work on locale and would like to make it into something practical. And I could use some help with proper design.

The problem is that locale library should have an application-wide default locale and this can easily break if multiple versions of it get compiled in, because different libraries depend on different versions of it. So it needs to maintain very good compatibility to avoid such situation.

A localization library can be roughly separated to three layers:

  • Locale database: Most things specific to a locale can be encoded as simple values. Most of them string, with occasional number or list of numbers or strings. Digit grouping, decimal point symbol, symbols for digits, names of months, names of days of week, currency symbol etc. Plus of course database of messages.
  • Primitive functions: formatting and parsing of numbers, dates and times, money amounts. And collation, string transformations and transliterations.
  • Message formatting and message parsing.

So I am thinking whether, and where, this can be split to separate crates to minimize the API surface that must maintain strict compatibility. And I see these options:

  1. Above the database.

    • This is similar to the C approach. POSIX has the nl_langinfo function that returns various values (as char * and you have to know whether they are strings, arrays of strings, 8-bit numbers or arrays of 8-bit numbers) and then the various functions like printf, strftime or strcoll use those data.
    • The database can have really simple API. A single get function, taking either:
      • an enum with specifications like DecimalPoint (= "."), Digit(5) (= "٥" in Arabic) Month(Gregorian, 2, Full) (= "March"; 0-based counting), and generic Message with domain, context and original.
      • or even just simply the triple domain, context, original, with suitable constants provided for the special items like decimal point, digits or month names.
    • However, it basically fixes the algorithms built on top of it. Collation, rule-based number formatting (3151 → “three thousands two hundreds fifty three”), transliteration and such can be implemented using data returned by the database in strings, but it basically fixes the algorithm. And it precludes using the implementations in C library on systems where it is provided.
  2. Above the basic algorithms.

    • This is similar to the C++ std::locale with its facet.
    • This allows utilizing the algorithms from standard library where available.
    • On the other hand, it has larger surface for compatibility. We can add methods, but for example if we initially have month names supporting just Gregorian calendar, adding month names for other calendars will require new method and the old method will have to be kept around for compatibility indefinitely.
  3. Between the interface and implementation of the basic algorithms.

    • Now that I wrote the above two down, I realize, that I can have the locale library define the interfaces to the facets, with as small surface as possible, but not implement them. Instead, it will call to another library to provide the implementation. Now the facets definitions still have to include backward compatibility, but the compatibility for the implementing crate can be broken and there can be alternate implementations, because the libraries that will use localization won’t depend on it directly. Only the application will, if it will require some specific support.
    • The message formatting can be a separate layer on top of this in either case.

Writing things down does clearly help seeing which solution is better. I am still interesting if anybody can see some other gotchas though.