A Linux user reports a problem with my program not recognizing his Linux LANG environment variable value. He has
LANG=”fr_FR@euro”
I'm using cross-platform crates sys-locale and oxilangtag to handle locales. The goal is to get "FR" out, for which I have translations. But oxilangtag doesn't recognize or remove the trailing @euro and rejects this locale.
Oxilangtag docs say it follows RFC5646, which does not support "@" suffixes. "@euro" was a pre-UTF8 hack to switch the character set to ISO/IEC 8859-15, which is a mod to Latin-1 that added the € symbol in the upper code page. (Remember upper code pages?) So "@euro" is now deprecated. I think.
But apparently some versions of Linux still set it.
Searching for @euro locales turns up decades-old references:
I don’t know that Microsoft is referring to the correct POSIX specification. In any case, the POSIX specifications are more or less the same as the “Open Group Base Specifications”, so you can read the definition of how the LANG and related environment variables should be interpreted at https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap08.html. In this case, the format is
[language[_territory][.codeset][@modifier]]
although this is tagged with “XSI”, so it’s actually part of UNIX and not specified by POSIX.
Technically this spec doesn’t allow @modifier on the LANG environment variable. However, sys-locale doesn’t remove it from the LC_CTYPE environment variable either, so I think it’s the crate which is not behaving correctly.
As far as I can tell, you should also be using LC_MESSAGES, not LC_CTYPE, to determine the language for translations.
Everyone on Linux should be using UTF-8 locales these days, that is really the correct fix here. Not using UTF-8 locales is likely to break a bunch of other things.
If the user is using some old program that can't handle that, a tip could be to just set the legacy encoding for those programs. For files you can convert using the command line program iconv.
Even if the rust crate is updated to handle this properly, there will potentially be issues since rust requires UTF-8 for its strings. Also, if you just want translations you should be looking at LC_MESSAGES, not LANG. POSIX supports mixing locales (and this is something people use, I use English text and Swedish numeric/date formatting for example). The basic idea is that:
LC_ALL should be used if set
If not set use the appropriate LC_MESSAGES/LC_NUMERIC/LC_TIME/...
And if that is not set, fall back to LANG.
Many programs get this wrong and don't handle mixed locales properly.
Having read a bit more, I believe @euro also selects a locale where currency_symbol is €, rather than Fr, δρ., ¤ or something else. This is orthogonal to whether the locale is fr_FR.ISO-8859-1, fr_FR.ISO-8859-15 or fr_FR.UTF-8; and also meaningless for similar locales like fr_CA, where the currency symbol is always $ but which could use ISO-8859-1 or UTF-8.
I guess the user could still be on a system that uses Fr as its currency symbol for just fr_FR; or maybe they’ve had it set that way for 20 years and it doesn’t need to be that way any more. It’s still a valid value, anyway.
Is that something the sys-locale crate should be doing? I'm trying to stay with standard cross-platform crates here.
The problem is finding something that's definitely the right answer, so that sys-locale can safely do it and programs can rely on that. Sources differ on what the right answer is.
It looks like LC_ALL and LANGUAGE being present but an empty string is common. LANGUAGE can be a list of languages, reflecting the user's preferences. LC_ALL=C is a thing, and apparently means to sort text in byte order, ignoring language collation sequences. There's a lot of lore around this, and it's not consistent.
I'm not actually familiar with LANGUAGE as opposed to LANG. The definite source of truth should be POSIX, though I know glibc have more LC_* variables than POSIX mandates. For example Locale is shorter than the list I get from running locale on my Arch Linux system. That page seems to be about defining locales.
See Environment Variables (section 8.2 specifically a bit down the page) for resolution order (and many other things).
And here is where I find the first mention of LANGUAGE: Locale Categories (The GNU C Library) (at the bottom of the page). It is apparently a GNU gettext extension. So make of that what you will.
I was not able to quickly find anything on musl, *BSD etc. Which are also POSIX but not glibc.
Addendum: LC_ALL=C (or C.UTF-8) means to use POSIX ordering/language/etc. And to force it. This can be useful in shell scripts where you don't want user locale to mess up output that you are parsing in the script, or a sorting order you are relying on. The C locale in particular sorts upper case before lower case (for ASCII, not sure how it works in general).
To get something like a BCP-47 language tag, sys-locale should be stripping off both the .codeset and the @modifier, as described in the Open Group spec, and replacing the _ with - if it’s there. That will cover most cases.
Assuming that sys-locale is intending to get the language tag to use for translations rather than case-folding, it should be looking at LC_MESSAGES instead of LC_CTYPE. To implement get_locales(), it could potentially use the GNU gettext environment variable LANGUAGE, but that seems like it might be a bit surprising to users.
Looking at the API of sys-locale it seems a bit lacking/misdesigned. There is no support for mixed locales (e.g. English messages, but Swedish calendar / date / time formatting). This is fairly commonly done in countries like Sweden, with high average English proficiency and a relatively small number of users speaking the language leading to subpar translations (plus googling for errors in English is way more likely to be useful).
I use this, and I want to get 24h time, Swedish alphabetical sorting, but still use English messages. So I would consider the current API of sys-locale to be unworkable since it simplifies too much.
For what it’s worth, sys-locale doesn’t necessarily do the right thing on Windows, either. My Windows installation is set to English (New Zealand) in most of the places it can be set, but sys-locale returns "en-GB". To get the language tag for my Windows display language, it should use GetUserDefaultLocaleName, which returns "en-NZ".
(Before I looked into these APIs I thought the Windows situation might somehow be better than Linux.)
I have a vague memory that on Windows you can also configure date formatting / first day of week separately from other language settings. So there is that too.
If a locale library doesn’t do the right thing on one platform, it’s unlikely to do it on others either. Either the author hasn’t even realized that locales have several orthogonal aspects, or has just ignored that to keep things simple.
For the time being, I'm taking whatever sys-locale gives me, removing @ and any following characters, then feeding that into oxilangtag.
This is adequate for all the languages for which I have translations.
What about other parts of formatting such as decimal comma vs decimal period? Or thousands separator (if any)? For example, Swedish uses space as a thousands separator, and comma as the decimal separator.