The internationalization of Rust itself

What conclusion do you draw from this story?

One may conclude that PHP doesn't provide a good internationalization of its programming interface while style using random locales to name internal stuffs which blow in the face of the end user. Well, actually, even when it sticks to English, PHP has such large set of completely WTF randomness inconsistency as an API, it would be hard to make anything worse on purpose.

I just started a wiki research project on the topic. Any help is welcome. Even adding some new references would be very cool.

To answer @matthieum question, I would say "all of the above, and possibly more, but each topic might be considered as independant target".

  • Having ability to translate keywords is nice, but actually it would probably completely useless if that is really the only thing translatable. Nice to have, especially as it would allow to provide more homogeneous interfaces, but really not the most crucial helper.
    • even if you go with allowing keyword translation, there are still questions about which primitive can be bound to which tokens, and how you deal with lexical collision. Namespacing a chunck of codes to a single locale is easy.
    • some language does allow to make rather extensive translexicaliszation, but not completely everything, for example C as #define which enable to easily translexicalize many things, but not #define itself
  • Unicode identifiers is a must have. The only discussion here should be which set of character are explicitly not allowed. For example, last time I checked, Coq was letting non-breaking spaces being used as part of identifier. Plus there are many Unicode tricky things, same-looking glyphs and all, but the open issue on this topic seems on the good track on this points.
  • The main part are generally API. So ability to translate identifiers without integration or performance penalty performance. So facilities should target more an alias feature, rather than a "wrapping everything in functions" ugly hack.
  • In the "possibly more", you might include allowing more freedom in syntactic and inflectional input. As far as I know, Perl have been ahead of anything else on this front. But I personally wouldn't advocate strongly this kind of facility, especially for a language like

Then you have the question of how you migrate from one locale to an other. An option like cargo fmt as already been suggested, and this seems to be actually just what koro is doing with the equivalent gofmt. With such a feature it seems trivial to transform/copy/paste code snippets to whichever locale used by a targeted forum/Q&A website as this was mentioned a concern.

As I stated in the post you quoted, the point I was trying to demonstrate wasn't that
"I speak English and I find it helpful that online help is in English too."
but rather that
"Programmer A wants to find help for a problem their facing."
"Programmer B has already provided an internet resource which helps A with their problem."
"A and B speak different native languages."
"That doesn't matter because the programming language and error messages use one spoken language meaning that A can search for what they need and B can write what they know and google can connect the dots."

The common language happens to be English, yes that is fortunate for me as a native speaker but what I find more valuable is that since the error codes and problem ideas and concepts I am searching for all use one spoken language, I can find the resources I need with a search engine.

Take this forum for example, we are able to communicate here because we are using a common language to communicate our ideas. If people just opened localised topics here it would be cutting anyone who is not a native speaker out of the conversation.

But lets take this away from looking at one person or one language, spoken or code.
Many languages share common keywords which do the same thing:

  • let in Rust, JS and Haskell all declare variables.
  • map in Rust, JS and Haskell all map elements of a collection of items from one value/type to another.
  • fold in Rust and Haskell all reduce a collection of elements into a single value.

The point I'm trying to make here is that each language not only having a single spoken "root language", if you will, but a common spoken "root language" means that learning the keywords and and type names in one language better positions you to quickly learn the keywords and type names in another language.

I'm not saying that internationalising keywords and type names would not make entering Rust or any other language as a non-English speaker easier, it would.
What I am saying is that internationalising keywords and type names would reduce any languages ability to do what it is designed to do, communicate an idea between a human and a computer or two humans, by muddying the waters.

1 Like

common keywords such as let and map everyone understands. But what about more complex stuff like function names?

Only if A and B both speaks English, in which case internationalization won't prevent that to happen. And if neither A nor B speak English, they will still be able to help each other if they have a common language, for example Bengali.

So it opens other language community to foster where otherwise possibly nothing would outcome. Indeed, the alternative for A and B is not going through an English canal to meet the solution, but most likely not finding the existent solution uttered in an unknown speaking language.

And with internationalization, in bonus, you might have C coming in the party who does speak Bengali like A and B, but also know English (or an other language) and when D – who don't speak Bengali – come to ask a question that have not been answered in English (or alia), C might find the resources published by A and B before translating it to D. The alternative scenario with lake of internationalization is no solution was either produced by A and B so neither C nor D can benefit from it.

Take this forum for example, we are able to communicate here because we are using a common language to communicate our ideas. If people just opened localised topics here it would be cutting anyone who is not a native speaker out of the conversation.

No, it would only be cutting anyone who is not a speaker of the used language out of conversation, while letting those who don't speak English still have possibly fruitful exchanges. Other forums, use exclusively other languages to communicate ideas though exchanges that wouldn't happen otherwise. One might guess that it includes some forums about Rust. :slight_smile:

internationalising keywords and type names would reduce any languages ability to do what it is designed to do, communicate an idea between a human and a computer or two humans, by muddying the waters.

How? How would it be different from having internationalization for other software interfaces? Availability of localized versions change nothing to the fact that a person can use the English localization and English related resources.

For information, Stack Overflow has as few internationalized versions:

So my take away is that language-specific communities already exist to some extent -- there are probably communities elsewhere as well that I didn't find with a quick search.

Enabling these users who don't speak English to write Rust programs in their native languages would be beneficial to everybody if it's easy to convert such programs back to English. From the video about Koro shows how korofmt can turn a Bengali "Hello World" program into an English version automatically.

Give that Rust has great support for Unicode, a first step could be to try and do to rustc what the Koro project did to the Go compiler. If it's coupled with good back and forth translation capabilities, then such a tool would be much less of a hack than simply forking the language would be.

4 Likes

@psychoslave, one question I'd like you to cover is where do you want the internationalization to stop. What's the line you're not crossing with programming language localization?

For example, in Russian the standard decimal separator is comma, not period. Do you expect the "Rust in Russian" to use comma as a decimal separator? What will happen to vec![0.3], vec![0,3] and vec![0;3], all of which are legal Rust expressions with different values? If any of these do change with the language switch, are there any other punctuation changes? What about the semicolon as a statement terminator, or comma in a struct field list?

Next, what about characters that are not available in the national standard keyboard layouts? For example, one reason I'd not consider using Russian in Rust, even just for identifiers and keywords, is that the standard keyboard layout for Russian does not include [] and {} (and probably <>, unless you have a special key for them) at all. Oh, and also &, ', #, @ and $. And, IIRC, at least one of /, \ and |.
Do you expect the "Rust in Russian" to be compatible to standard Rust and require constant layout switching, or to invent some replacements? The number of replacements will be a problem, however --- Russian has more letters and less special characters in the overall layout.

Finally, what about global changes?
Should "Rust in Arabic" be written right-to-left? Does this mean that { and } are used the other way round?
Should "Rust in Mongolian script", if it is ever desired, use vertical lines as the Mongolian script itself does?

P.S. Also, I'd like to explicitly mention that, in my opinion, the "international Rust" label shall always mean "Rust in English" unless some very radical changes happen in the world. You've mentioned Esperanto in this thread and I really hope you don't mean splitting the international Rust community into "international in English" and "international in Esperanto".

7 Likes

It seems to me that there are five real issues here:

  1. Internationalization of Rust's defined keywords, attribute names, etc.
  2. Internationalization of identifiers
  3. Internationalization of error messages
  4. Internationalization of comments and other documentation
  5. Internationalization of numbers, dates, times, etc., both in compiler input and output and within the runtime

Others in this thread have made the argument that 1) would make open-source programming much more difficult when learning to program, when interacting with others, and when attempting to reuse the work of others. I personally agree with all those concerns. It would also complicate the use of macros, which usually define symbols in the macro-programmer's language. The hundreds of macros within the compiler that define what most programmers consider to be Rust's built-in aspects would be particularly problematic.

  1. should be a fairly straightforward change to the compiler. In this the Go language shows the way. See The Go Programming Language Specification - The Go Programming Language and the following "Letters and digits" and "Identifiers" sections.

  2. is a significant challenge, due to the number of different error messages in the compiler, the language-dependent order of the variable arguments that appear in the error messages, and the language-specific aspects such as gender, number, inflection, etc. that some languages require. The current nightly build contains 405 distinct format!( ) macros in 103 files that appear to report errors, most of which would have to be generalized to invoke an internationalization module that would in turn be able to load templates for any supported alternative language. Such translation also must address any required reordering of the variable content in each message (e.g., the three parameters of an English-language message might need to occur in a different order in the second language).

    It should be pointed out that the biggest challenge is not in the initial translation, but in the maintenance across compiler changes, A significant pool of language translators would need to be retained to address new or changed error messages for each stable Rust build, with that effort appearing first in the many nightly builds.

  3. Presumably comments can already be written in a language of the programmer's choice. Such use is probably sufficient for non-Markdown comments. For Markdown comments, a means of invoking replacement strings or comment blocks found in a language-specific internationalization module (or section) might be appropriate. As is always the case, the burden here would fall on those individuals providing the translation to a given target language.

5a) The decimal point '.' in non-integer numbers is somewhat problematic, since most of the world uses a comma ',' to separate integral and fractional parts of fixed-point and floating-point numbers. Those of us who read international standards are used to seeing both forms of separator. Both period and comma can be used as a fraction separator in numbers provided that the separator character is surrounded by digits in all such uses, and that use of either character elsewhere in the syntax is required to always avoid separating an integer and an adjacent digit (e.g., by inclusion of whitespace). Because a macro using one separator can be invoked by someone using the other separator, it seems probable that both should be treated equivalently.

5b) Many languages have non-Arabic characters to represent the digits of numbers, sometimes in a radix other than radix 10. Such support does not seem essential for the internationalization of Rust. Restricting Rust to use only the Arabic numeral characters 0 to 9 avoids the often-encountered problem in other languages of the same character being used in numbers and in other words. For example the Chinese numeral ’一‘ (Pinyin yĪ), meaning one, occurs as the initial character in many Hanzi words, as well as in numbers, making disambiguation of numbers from identifiers potentially difficult.

As a point of reference, although the Go language permits use of non-Arabic digits in identifiers, it does not permit them in numbers.

5c) Other internationalization requirements are similar to those in various operating systems and generally should use APIs of the host or target system in determining their behavior.

5 Likes

Hi thank you for your interest,

What’s the line you’re not crossing with programming language localization?

This is obviously a question with a topic too broad (the whole set of potentially localisable material) and a focus too narrow (my personal opinion on a locale extensiveness). So this doesn't open to develop an extensive answer. It should be just noted that internationalisation is one topic, and localisation is an other. You can give a large freedom through flexible internationalisation facilities, and let each localisation team decide to which extent they desire to take the burden of the localisation.

For example, in Russian the standard decimal separator is comma, not period. Do you expect the “Rust in Russian” to use comma as a decimal separator?

It might be expected that the internationalisation facilities should provide enough flexibility to do so, but that the effective localisation decision should be taken by their respective maintainers.

What will happen to vec![0.3], vec![0,3] and vec![0;3], all of which are legal Rust expressions with different values?

Either the maintainers should decide to localise all possibly conflictual notations, or none of them. So people who are doing the locales should be given some hints, and possibly some tools to check in.

If any of these do change with the language switch, are there any other punctuation changes?

It might be expected that all that signs are only considered as tokens by the compiler, that is whether you use , (a comma) or сейчас (just a dummy example, admittedly) to articulate expressions, it doesn't matter for the compiler.

For example, one reason I’d not consider using Russian in Rust, even just for identifiers and keywords, is that the standard keyboard layout for Russian does not include and {} (and probably <>, unless you have a special key for them) at all. Oh, and also &, ', #, @ and $. And, IIRC, at least one of /, \ and |.

Actually, with the proper facilities it should be possible to create a localisation that use keywords which only include characters directly accessible on the standard keyboard layout for Russian.

Do you expect the “Rust in Russian” to be compatible to standard Rust and require constant layout switching, or to invent some replacements?

Of course it should be compatible, and no it shouldn't require to switch keyboard layout (if that's what is meant here), the locale should match locale customs.

Should “Rust in Arabic” be written right-to-left? Does this mean that { and } are used the other way round?

That one is easy, at least in Unicode, brackets are supposed to follow the flow there are in. Read them as "open/close curly bracket", not "left/right bracket".

Should “Rust in Mongolian script”, if it is ever desired, use vertical lines as the Mongolian script itself does?

Why not?

You’ve mentioned Esperanto in this thread and I really hope you don’t mean splitting the international Rust community into “international in English” and “international in Esperanto”.

No, of course, my evil master plan is that every single person will have to learn Esperanto or perish, mouhahaha. :wink:

Interesting Topic!

I see a large dissonance between the abstract "it would be respectful of each culture if there was a programming language in that culture's native spoken tongue", and the many replies providing concrete, technical and/or amount-of-work arguments; even with lots of first-hand experiences.

I agree in general that cultural diversity is something worth fighting for; however, I do not think that transliterating a programming language will help that goal.

If our goal is to let programmer A communicate with the computer; any spoken language will do; and I would argue that it is up to the individual programmers to write their own programming languages in whatever human language they like (like this one in Arabic).

However, programming, like virtually any other undertaking these days, is a very social activity. Discussing with peers, both on-line and off-line; reading documentation written by others, writing my own comments to explain things for maintainers five years from now, etc..
All of this is human-to-human communication, via the medium of either written prose, or source files.

For all forms of human communications, agreeing on a common standard is imperative.
There are no credible automatic solutions for translation, and I doubt they will come in the next decade.
Automatic translators WILL lose information, or even mistranslate to provide actively wrong instructions.
Meaning: there is no replacement for a skilled translator.
Meaning that it either takes extra resources, or happens at the expense of something else.
There is already (too) much to do with a single-language Rust, so I am not in favour of adding a Dutch-Rust..

To share my own experience: As a Dutchman :netherlands: working and programming in Germany :de: both I and my Native-German colleagues use English in our shared codebases; The Germans started with English codebases even when they were still a German-only team.
Since then, our team has had Russians :ru: , Dutchies :netherlands:, Portugalians :portugal: and Germans :de: working on it; and we work closely with, and program for, departments employing North- and South-Indians :india:, Pakistani :pakistan: , Brits :uk:, Frenchpeople :fr:, USAmericans :us:, Canadians :canada: and Koreans :kr:
All of this at the German Cancer Research Center.

In my experience, there is hardly any field where the argument "but it will only be used by X-speaking people" applies, aside maybe from teaching. (And even when teaching, I'd argue for preparing your pupils for what "the rest of the world" does).

I'm all for i18n in the compiler, and non-English documentation, but to me, doing s/keyword/trefwoord/ replacements in source files is madness. (You'd only end up with a horrible mix of translated keywords and non-translated (or even worse, badly translated) identifiers.
Of course, there's also all the fun of (negative) connotations and their differences between languages. (e.g., with apologies to our Mongolian Rustaceans, English "Mongoloid", "of Mongolian descent", and the Dutch "Mongoloïde", (outdated) psychiatrical term to signify "person with IQ below 60", used as an insult these days)
Don't forget: "naming things" is one of the biggest problems in IT... I don't feel that adding cultural baggage on top of my O(log(n)) vs O(n^2) deliberations is a productive change.

English might not be the best language for everyone, but having any single language beats having a מִגְדַּל בָּבֶל (see what I did there? :wink: )

(I deleted my previous post because it only rehashed things others have already said, and was not constructive because it was written before I counted to ten)

6 Likes

Thank you @TomP for you detailed analyse.

I would put comments and other documentation from 4. into two very distinct categories. External documentation doesn't require any specific facility in the rust compilation pipeline after all.

For 5, numbers are literal tokens, while dates and other related things are built object. They come in two very distinct level, dates and so on are at the library level, and so should be discussed separately.

would make open-source programming much more difficult when learning to program, when interacting with others

This was already replied, so unless n

It would also complicate the use of macros, which usually define symbols in the macro-programmer’s language.

How would it different from the rest of the code translexicalised? The only tricky problem might be with run-time instructions relying on identfiers, for example in an eval call of a script language. But this is probably not a concern for a compiled language such as Rust. How are there some tricky case which were not exposed so far?

For 3 – translation of error messages – this is really classical internationalisation and localisation problems, as far as I understand. The biggest burden of the work is on the translators side, sure, but there is really no big technical challenge for implementing internationalisation, isn't it?

Restricting Rust to use only the Arabic numeral characters 0 to 9 avoids the often-encountered problem in other languages of the same character being used in numbers and in other words. For example the Chinese numeral ’一‘ (Pinyin yĪ), meaning one, occurs as the initial character in many Hanzi words, as well as in numbers, making disambiguation of numbers from identifiers potentially difficult.

Once again, it's fine to let flexibility to people creating localisations. For avoiding the possible ambiguity in Chinese numeral, the corresponding locales might decide to use some prefix. Or to no localise it. Having a dedicated lc_numeric environment variable for this part of localisation is already usual. So you could actually let users chose to code most the code in Chinese ideogram while keeping so-called-Arabic characters for digits.

Supposedly, you meant "any shared common language", because "any single language" which are two very different things. If it wasn't intended as some subtle joke, then one should wonder what does it illustrates about efficiency of English as lingua franca.

I’m all for i18n in the compiler, and non-English documentation, but to me, doing s/keyword/trefwoord/ replacements in source files is madness. (You’d only end up with a horrible mix of translated keywords and non-translated (or even worse, badly translated) identifiers.

Would it be a mere dummy token substitution, of course it would, but for that a sed script would indeed be enough and integration of internationalisation facilities would be far less useful. But is precisely because the compiler job is to make a more accurate analyse of provided text that it's interested to put such a feature there, so you can make more reasonable translation according to context.

Don’t forget: “naming things” is one of the biggest problems in IT… I don’t feel that adding cultural baggage on top of my O(log(n)) vs O(n^2) deliberations is a productive change.

How adding English and its cultural baggage would be different than it is with any other language? And if naming identifiers is so important, people should obviously start to work in a language they are well accustomed with. If English is not part of the cultural baggage of the team, then it surely consider at least start with an other common language. Of course it doesn't mean that English shouldn't be considered, even if the current team have a better knowledge of an other common language besides English.

@psychoslave I think this thread is currently at the point where it would be good to make a new one with a concrete proposal of what you hope to add to Rust. Right now there are enough very different possibilities that I'm not sure there's useful input to be had.

3 Likes

Also, it's still at the stage where there is no compelling reason to
integrate any of this into rustc itself (instead of doing this in a fork
like Koro)

You really should do that first, explore the space, and come back with a
concrete proposal

3 Likes

For item 2) in my prior listing, extending Rust to Unicode identifiers, as in Go, would be a significant task, in part because Unicode identifiers have three cases: upper case, lower case and title case. Thus Rust's hygiene rules would need to be extended to cover mixes of three cases rather than just two.

For item 3) in my prior listing, rendering compiler messages in alternate target languages, potentially with argument reordering, I've worked out a general translation/substitution approach. I've also done a more extensive search of the Rust nightly tree for messages that would require translation. My rough count (before eliminating the few duplicates) is that there are about 7000 such messages that would need translation to each target language.

Any concrete proposal to actually implement either of these changes needs to address the size of the effort, which will be significant.

2 Likes

@scottmcm totally agree, it seems unlikely that any new important point will be raised by any participant so far. I'll see to launch a new thread oriented toward a proposal for a single topic identified by @TomP among 1, 2 and 5a (numbers).

Update to a prior post, many months later:

  1. For item 2), although I didn't know it at the time of writing the above, nightly Rust already supported Unicode identifiers. The issue of hygiene for title case identifiers – are they considered upper-case? are they equivalent to upper-case-shifted analogues? – does not appear to have been considered.

  2. For item 3, argument reordering in compiler messages is already supported by format!, println! and related macros and quasi-macros. A revised analysis of the rustc crate on 2018.05.06, using the regex pattern \w!\("(?s)(\\\"|.)*?" found 10151 non-test message strings, which after sort/merge and elimination of strings without English text reduced to about 5700 unique message strings that probably require translation.

1 Like

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

Hi @TomP, thank you for your feedback. I didn't have much time to dedicate to Rust, let alone Rust and internationalization, in the last months unfortunately. Coming months seem more propice for this, although I can't promise anaything.

I'm still very interested with this topic, and to see how I might help on it. Let me know if you have any new more up-to-date suggestion. :slight_smile: