The internationalization of Rust itself


#41

I’m not a programmer, but I know this concept from working with microsoft excel … the keywords for a formula are in the “office”/local language.

for example: english: vlookup() & german: sverweis()

This behaviour is really annoying. Depending on the environment you are working on, you have to remember twice the amount of keywords.

But in my opinion this is a matter of how it is implemented. And excel is probably an example for how NOT to do this. They even change keyboard shortcuts depending on the language. This is good for locals but it is not helping for internationalization :slight_smile:
There should be one common language(global) which always works and additionally support for a local language/keywords.


#43

Just my personal opinion but I feel this somewhat goes against what the purpose of a programming language is.
Learning a programming language isn’t about learning English, its about learning that language and the purpose of that language is two fold:

  1. For a human to have a “conversation” with a computer and provide it with instructions.
  2. To provide a common language and syntax for two humans to discuss a problem.

I feel that internationalising or translating Rust names and Keywords takes away from that second point because being able to discuss a problem with international programmers requires that we both have a common language which we can use to communicate our ideas.


#44

Once again, internationalization doesn’t prevent that. :slight_smile: It just allows people to use a possibly more convenient interface for them, without requiring English as a communication tool even where an other speaking language would be more relevant for the stakeholders in presence.

The main part of sane compilers deal with abstract syntax tree (AST), bytecode and other lower level stuffs that never come in front of the end user most of the time. The biggest problem you might encounter from a technical point of view is probably dealing with reflexive code which relies on some hardcoded identifiers, but I’m not sure that such a thing is even possible in Rust, and you can always expose facilities which enable to do the same thing without relying on the hardcoded string.


#45

Even if only the stakeholders in presence are the only ones who will ever see the code, how many projects exist where the programmers never look up documentation or online assistance for the problems they are solving. Localisation here harms the ability of a programmer to draw on the wealth of knowledge of global programmers.

I am a native English speaker, I’m sure that brings some bias with it but there are so many times that I have gotten assistance by reading threads on StackOverflow with replies and solutions from international programmers and I would not have been able to get the same value from their replies if they weren’t written in English. The same fact is true for two international programmers who both speak different native languages but being able to write code in one speaking language means they have a common means for communicating.


#46

Ok … time for a small story :stuck_out_tongue: When you start off with PHP you often get this error:

Unexpected T_PAAMAYIM_NEKUDOTAYIM

You spend 10-15 mins wondering what it is (though you see that it is a syntax error, but as a beginner there is a chance you may not realise it) and then after some google-searching you realise that it means this in Hebrew.

I know this is a small issue. But then they could be similar issues that consume a lot of dev time.


#47

First of all, I’d like to reiterate @H2CO3’s point. As a French, I started learning programming at 10y and didn’t speak a single word of English then. if, else, print, … were just magic incantations and I used them as such without any issue. It’s only a year later that I started learning English and I was startled to learn that if was actually an English word!


So, first of all I’d like to understand exactly what we are speaking of:

  • translating the keywords,
  • allowing the use of full Unicode identifiers (to allow non-English identifiers),
  • translating the existing identifiers (types, methods, …), what’s the point of non-English keywords if the std is in English,
  • all of the above?

Secondly, I think it would be nice to check what other languages/frameworks have done in this regard, and what their experience has been like:

  • Excel uses different function names in different languages: how does it work, what do people think, etc…
  • Perl was mentioned, what’s the feedback?

It seems wasteful to just forge ahead without learning from the past; let’s learn from others’ mistakes to avoid repeating them!


#48

I think that it’s the whole contrary that should be expected. Actually this assertion holds not better than pretending that localizing any peace of software “draw on the wealth of knowledge of global users”. This is just as valid in both case. The thing is that a good internationalization should in fact lower the barrier, because your code base rely on a software stack that was well thought to do so rather than deeply tying it to a specific spoken language.

I have gotten assistance by reading threads on StackOverflow with replies and solutions from international programmers and I would not have been able to get the same value from their replies if they weren’t written in English.

So it’s good for all the English speakers that this kind of platform do exists. How is that incompatible with having other spoken language questions-answers website? Those who can speak English won’t suddenly be unable to consult the very same website, while those who can’t sill might have some chance to get help using a spoken language they know. Or should the people who didn’t had the chance to learn English be responded that only those who had this privilege will receive help?


#49

What conclusion do you draw from this story?

One may conclude that PHP doesn’t provide a good internationalization of its programming interface while style using random locales to name internal stuffs which blow in the face of the end user. Well, actually, even when it sticks to English, PHP has such large set of completely WTF randomness inconsistency as an API, it would be hard to make anything worse on purpose.


#50

I just started a wiki research project on the topic. Any help is welcome. Even adding some new references would be very cool.

To answer @matthieum question, I would say “all of the above, and possibly more, but each topic might be considered as independant target”.

  • Having ability to translate keywords is nice, but actually it would probably completely useless if that is really the only thing translatable. Nice to have, especially as it would allow to provide more homogeneous interfaces, but really not the most crucial helper.
    • even if you go with allowing keyword translation, there are still questions about which primitive can be bound to which tokens, and how you deal with lexical collision. Namespacing a chunck of codes to a single locale is easy.
    • some language does allow to make rather extensive translexicaliszation, but not completely everything, for example C as #define which enable to easily translexicalize many things, but not #define itself
  • Unicode identifiers is a must have. The only discussion here should be which set of character are explicitly not allowed. For example, last time I checked, Coq was letting non-breaking spaces being used as part of identifier. Plus there are many Unicode tricky things, same-looking glyphs and all, but the open issue on this topic seems on the good track on this points.
  • The main part are generally API. So ability to translate identifiers without integration or performance penalty performance. So facilities should target more an alias feature, rather than a “wrapping everything in functions” ugly hack.
  • In the “possibly more”, you might include allowing more freedom in syntactic and inflectional input. As far as I know, Perl have been ahead of anything else on this front. But I personally wouldn’t advocate strongly this kind of facility, especially for a language like

Then you have the question of how you migrate from one locale to an other. An option like cargo fmt as already been suggested, and this seems to be actually just what koro is doing with the equivalent gofmt. With such a feature it seems trivial to transform/copy/paste code snippets to whichever locale used by a targeted forum/Q&A website as this was mentioned a concern.


#51

As I stated in the post you quoted, the point I was trying to demonstrate wasn’t that
"I speak English and I find it helpful that online help is in English too.“
but rather that
"Programmer A wants to find help for a problem their facing.”
“Programmer B has already provided an internet resource which helps A with their problem.”
"A and B speak different native languages."
“That doesn’t matter because the programming language and error messages use one spoken language meaning that A can search for what they need and B can write what they know and google can connect the dots.”

The common language happens to be English, yes that is fortunate for me as a native speaker but what I find more valuable is that since the error codes and problem ideas and concepts I am searching for all use one spoken language, I can find the resources I need with a search engine.

Take this forum for example, we are able to communicate here because we are using a common language to communicate our ideas. If people just opened localised topics here it would be cutting anyone who is not a native speaker out of the conversation.

But lets take this away from looking at one person or one language, spoken or code.
Many languages share common keywords which do the same thing:

  • let in Rust, JS and Haskell all declare variables.
  • map in Rust, JS and Haskell all map elements of a collection of items from one value/type to another.
  • fold in Rust and Haskell all reduce a collection of elements into a single value.

The point I’m trying to make here is that each language not only having a single spoken “root language”, if you will, but a common spoken “root language” means that learning the keywords and and type names in one language better positions you to quickly learn the keywords and type names in another language.

I’m not saying that internationalising keywords and type names would not make entering Rust or any other language as a non-English speaker easier, it would.
What I am saying is that internationalising keywords and type names would reduce any languages ability to do what it is designed to do, communicate an idea between a human and a computer or two humans, by muddying the waters.


#52

common keywords such as let and map everyone understands. But what about more complex stuff like function names?


#53

Only if A and B both speaks English, in which case internationalization won’t prevent that to happen. And if neither A nor B speak English, they will still be able to help each other if they have a common language, for example Bengali.

So it opens other language community to foster where otherwise possibly nothing would outcome. Indeed, the alternative for A and B is not going through an English canal to meet the solution, but most likely not finding the existent solution uttered in an unknown speaking language.

And with internationalization, in bonus, you might have C coming in the party who does speak Bengali like A and B, but also know English (or an other language) and when D – who don’t speak Bengali – come to ask a question that have not been answered in English (or alia), C might find the resources published by A and B before translating it to D. The alternative scenario with lake of internationalization is no solution was either produced by A and B so neither C nor D can benefit from it.

Take this forum for example, we are able to communicate here because we are using a common language to communicate our ideas. If people just opened localised topics here it would be cutting anyone who is not a native speaker out of the conversation.

No, it would only be cutting anyone who is not a speaker of the used language out of conversation, while letting those who don’t speak English still have possibly fruitful exchanges. Other forums, use exclusively other languages to communicate ideas though exchanges that wouldn’t happen otherwise. One might guess that it includes some forums about Rust. :slight_smile:

internationalising keywords and type names would reduce any languages ability to do what it is designed to do, communicate an idea between a human and a computer or two humans, by muddying the waters.

How? How would it be different from having internationalization for other software interfaces? Availability of localized versions change nothing to the fact that a person can use the English localization and English related resources.


#54

For information, Stack Overflow has as few internationalized versions:

So my take away is that language-specific communities already exist to some extent – there are probably communities elsewhere as well that I didn’t find with a quick search.

Enabling these users who don’t speak English to write Rust programs in their native languages would be beneficial to everybody if it’s easy to convert such programs back to English. From the video about Koro shows how korofmt can turn a Bengali “Hello World” program into an English version automatically.

Give that Rust has great support for Unicode, a first step could be to try and do to rustc what the Koro project did to the Go compiler. If it’s coupled with good back and forth translation capabilities, then such a tool would be much less of a hack than simply forking the language would be.


#55

@psychoslave, one question I’d like you to cover is where do you want the internationalization to stop. What’s the line you’re not crossing with programming language localization?

For example, in Russian the standard decimal separator is comma, not period. Do you expect the “Rust in Russian” to use comma as a decimal separator? What will happen to vec![0.3], vec![0,3] and vec![0;3], all of which are legal Rust expressions with different values? If any of these do change with the language switch, are there any other punctuation changes? What about the semicolon as a statement terminator, or comma in a struct field list?

Next, what about characters that are not available in the national standard keyboard layouts? For example, one reason I’d not consider using Russian in Rust, even just for identifiers and keywords, is that the standard keyboard layout for Russian does not include [] and {} (and probably <>, unless you have a special key for them) at all. Oh, and also &, ', #, @ and $. And, IIRC, at least one of /, \ and |.
Do you expect the “Rust in Russian” to be compatible to standard Rust and require constant layout switching, or to invent some replacements? The number of replacements will be a problem, however — Russian has more letters and less special characters in the overall layout.

Finally, what about global changes?
Should “Rust in Arabic” be written right-to-left? Does this mean that { and } are used the other way round?
Should “Rust in Mongolian script”, if it is ever desired, use vertical lines as the Mongolian script itself does?

P.S. Also, I’d like to explicitly mention that, in my opinion, the “international Rust” label shall always mean “Rust in English” unless some very radical changes happen in the world. You’ve mentioned Esperanto in this thread and I really hope you don’t mean splitting the international Rust community into “international in English” and “international in Esperanto”.


#56

It seems to me that there are five real issues here:

  1. Internationalization of Rust’s defined keywords, attribute names, etc.
  2. Internationalization of identifiers
  3. Internationalization of error messages
  4. Internationalization of comments and other documentation
  5. Internationalization of numbers, dates, times, etc., both in compiler input and output and within the runtime

Others in this thread have made the argument that 1) would make open-source programming much more difficult when learning to program, when interacting with others, and when attempting to reuse the work of others. I personally agree with all those concerns. It would also complicate the use of macros, which usually define symbols in the macro-programmer’s language. The hundreds of macros within the compiler that define what most programmers consider to be Rust’s built-in aspects would be particularly problematic.

  1. should be a fairly straightforward change to the compiler. In this the Go language shows the way. See https://golang.org/ref/spec#Characters and the following “Letters and digits” and “Identifiers” sections.

  2. is a significant challenge, due to the number of different error messages in the compiler, the language-dependent order of the variable arguments that appear in the error messages, and the language-specific aspects such as gender, number, inflection, etc. that some languages require. The current nightly build contains 405 distinct format!( ) macros in 103 files that appear to report errors, most of which would have to be generalized to invoke an internationalization module that would in turn be able to load templates for any supported alternative language. Such translation also must address any required reordering of the variable content in each message (e.g., the three parameters of an English-language message might need to occur in a different order in the second language).

    It should be pointed out that the biggest challenge is not in the initial translation, but in the maintenance across compiler changes, A significant pool of language translators would need to be retained to address new or changed error messages for each stable Rust build, with that effort appearing first in the many nightly builds.

  3. Presumably comments can already be written in a language of the programmer’s choice. Such use is probably sufficient for non-Markdown comments. For Markdown comments, a means of invoking replacement strings or comment blocks found in a language-specific internationalization module (or section) might be appropriate. As is always the case, the burden here would fall on those individuals providing the translation to a given target language.

5a) The decimal point ‘.’ in non-integer numbers is somewhat problematic, since most of the world uses a comma ‘,’ to separate integral and fractional parts of fixed-point and floating-point numbers. Those of us who read international standards are used to seeing both forms of separator. Both period and comma can be used as a fraction separator in numbers provided that the separator character is surrounded by digits in all such uses, and that use of either character elsewhere in the syntax is required to always avoid separating an integer and an adjacent digit (e.g., by inclusion of whitespace). Because a macro using one separator can be invoked by someone using the other separator, it seems probable that both should be treated equivalently.

5b) Many languages have non-Arabic characters to represent the digits of numbers, sometimes in a radix other than radix 10. Such support does not seem essential for the internationalization of Rust. Restricting Rust to use only the Arabic numeral characters 0 to 9 avoids the often-encountered problem in other languages of the same character being used in numbers and in other words. For example the Chinese numeral ’一‘ (Pinyin yĪ), meaning one, occurs as the initial character in many Hanzi words, as well as in numbers, making disambiguation of numbers from identifiers potentially difficult.

As a point of reference, although the Go language permits use of non-Arabic digits in identifiers, it does not permit them in numbers.

5c) Other internationalization requirements are similar to those in various operating systems and generally should use APIs of the host or target system in determining their behavior.


#57

Hi thank you for your interest,

What’s the line you’re not crossing with programming language localization?

This is obviously a question with a topic too broad (the whole set of potentially localisable material) and a focus too narrow (my personal opinion on a locale extensiveness). So this doesn’t open to develop an extensive answer. It should be just noted that internationalisation is one topic, and localisation is an other. You can give a large freedom through flexible internationalisation facilities, and let each localisation team decide to which extent they desire to take the burden of the localisation.

For example, in Russian the standard decimal separator is comma, not period. Do you expect the “Rust in Russian” to use comma as a decimal separator?

It might be expected that the internationalisation facilities should provide enough flexibility to do so, but that the effective localisation decision should be taken by their respective maintainers.

What will happen to vec![0.3], vec![0,3] and vec![0;3], all of which are legal Rust expressions with different values?

Either the maintainers should decide to localise all possibly conflictual notations, or none of them. So people who are doing the locales should be given some hints, and possibly some tools to check in.

If any of these do change with the language switch, are there any other punctuation changes?

It might be expected that all that signs are only considered as tokens by the compiler, that is whether you use , (a comma) or сейчас (just a dummy example, admittedly) to articulate expressions, it doesn’t matter for the compiler.

For example, one reason I’d not consider using Russian in Rust, even just for identifiers and keywords, is that the standard keyboard layout for Russian does not include [] and {} (and probably <>, unless you have a special key for them) at all. Oh, and also &, ', #, @ and $. And, IIRC, at least one of /, \ and |.

Actually, with the proper facilities it should be possible to create a localisation that use keywords which only include characters directly accessible on the standard keyboard layout for Russian.

Do you expect the “Rust in Russian” to be compatible to standard Rust and require constant layout switching, or to invent some replacements?

Of course it should be compatible, and no it shouldn’t require to switch keyboard layout (if that’s what is meant here), the locale should match locale customs.

Should “Rust in Arabic” be written right-to-left? Does this mean that { and } are used the other way round?

That one is easy, at least in Unicode, brackets are supposed to follow the flow there are in. Read them as “open/close curly bracket”, not “left/right bracket”.

Should “Rust in Mongolian script”, if it is ever desired, use vertical lines as the Mongolian script itself does?

Why not?

You’ve mentioned Esperanto in this thread and I really hope you don’t mean splitting the international Rust community into “international in English” and “international in Esperanto”.

No, of course, my evil master plan is that every single person will have to learn Esperanto or perish, mouhahaha. :wink:


#59

Interesting Topic!

I see a large dissonance between the abstract “it would be respectful of each culture if there was a programming language in that culture’s native spoken tongue”, and the many replies providing concrete, technical and/or amount-of-work arguments; even with lots of first-hand experiences.

I agree in general that cultural diversity is something worth fighting for; however, I do not think that transliterating a programming language will help that goal.

If our goal is to let programmer A communicate with the computer; any spoken language will do; and I would argue that it is up to the individual programmers to write their own programming languages in whatever human language they like (like this one in Arabic).

However, programming, like virtually any other undertaking these days, is a very social activity. Discussing with peers, both on-line and off-line; reading documentation written by others, writing my own comments to explain things for maintainers five years from now, etc…
All of this is human-to-human communication, via the medium of either written prose, or source files.

For all forms of human communications, agreeing on a common standard is imperative.
There are no credible automatic solutions for translation, and I doubt they will come in the next decade.
Automatic translators WILL lose information, or even mistranslate to provide actively wrong instructions.
Meaning: there is no replacement for a skilled translator.
Meaning that it either takes extra resources, or happens at the expense of something else.
There is already (too) much to do with a single-language Rust, so I am not in favour of adding a Dutch-Rust…

To share my own experience: As a Dutchman :netherlands: working and programming in Germany :de: both I and my Native-German colleagues use English in our shared codebases; The Germans started with English codebases even when they were still a German-only team.
Since then, our team has had Russians :ru: , Dutchies :netherlands:, Portugalians :portugal: and Germans :de: working on it; and we work closely with, and program for, departments employing North- and South-Indians :india:, Pakistani :pakistan: , Brits :uk:, Frenchpeople :fr:, USAmericans :us:, Canadians :canada: and Koreans :kr:
All of this at the German Cancer Research Center.

In my experience, there is hardly any field where the argument “but it will only be used by X-speaking people” applies, aside maybe from teaching. (And even when teaching, I’d argue for preparing your pupils for what “the rest of the world” does).

I’m all for i18n in the compiler, and non-English documentation, but to me, doing s/keyword/trefwoord/ replacements in source files is madness. (You’d only end up with a horrible mix of translated keywords and non-translated (or even worse, badly translated) identifiers.
Of course, there’s also all the fun of (negative) connotations and their differences between languages. (e.g., with apologies to our Mongolian Rustaceans, English “Mongoloid”, “of Mongolian descent”, and the Dutch “Mongoloïde”, (outdated) psychiatrical term to signify “person with IQ below 60”, used as an insult these days)
Don’t forget: “naming things” is one of the biggest problems in IT… I don’t feel that adding cultural baggage on top of my O(log(n)) vs O(n^2) deliberations is a productive change.

English might not be the best language for everyone, but having any single language beats having a מִגְדַּל בָּבֶל (see what I did there? :wink: )

(I deleted my previous post because it only rehashed things others have already said, and was not constructive because it was written before I counted to ten)


#60

Thank you @TomP for you detailed analyse.

I would put comments and other documentation from 4. into two very distinct categories. External documentation doesn’t require any specific facility in the rust compilation pipeline after all.

For 5, numbers are literal tokens, while dates and other related things are built object. They come in two very distinct level, dates and so on are at the library level, and so should be discussed separately.

would make open-source programming much more difficult when learning to program, when interacting with others

This was already replied, so unless n

It would also complicate the use of macros, which usually define symbols in the macro-programmer’s language.

How would it different from the rest of the code translexicalised? The only tricky problem might be with run-time instructions relying on identfiers, for example in an eval call of a script language. But this is probably not a concern for a compiled language such as Rust. How are there some tricky case which were not exposed so far?

For 3 – translation of error messages – this is really classical internationalisation and localisation problems, as far as I understand. The biggest burden of the work is on the translators side, sure, but there is really no big technical challenge for implementing internationalisation, isn’t it?

Restricting Rust to use only the Arabic numeral characters 0 to 9 avoids the often-encountered problem in other languages of the same character being used in numbers and in other words. For example the Chinese numeral ’一‘ (Pinyin yĪ), meaning one, occurs as the initial character in many Hanzi words, as well as in numbers, making disambiguation of numbers from identifiers potentially difficult.

Once again, it’s fine to let flexibility to people creating localisations. For avoiding the possible ambiguity in Chinese numeral, the corresponding locales might decide to use some prefix. Or to no localise it. Having a dedicated lc_numeric environment variable for this part of localisation is already usual. So you could actually let users chose to code most the code in Chinese ideogram while keeping so-called-Arabic characters for digits.


#61

Supposedly, you meant “any shared common language”, because “any single language” which are two very different things. If it wasn’t intended as some subtle joke, then one should wonder what does it illustrates about efficiency of English as lingua franca.

I’m all for i18n in the compiler, and non-English documentation, but to me, doing s/keyword/trefwoord/ replacements in source files is madness. (You’d only end up with a horrible mix of translated keywords and non-translated (or even worse, badly translated) identifiers.

Would it be a mere dummy token substitution, of course it would, but for that a sed script would indeed be enough and integration of internationalisation facilities would be far less useful. But is precisely because the compiler job is to make a more accurate analyse of provided text that it’s interested to put such a feature there, so you can make more reasonable translation according to context.

Don’t forget: “naming things” is one of the biggest problems in IT… I don’t feel that adding cultural baggage on top of my O(log(n)) vs O(n^2) deliberations is a productive change.

How adding English and its cultural baggage would be different than it is with any other language? And if naming identifiers is so important, people should obviously start to work in a language they are well accustomed with. If English is not part of the cultural baggage of the team, then it surely consider at least start with an other common language. Of course it doesn’t mean that English shouldn’t be considered, even if the current team have a better knowledge of an other common language besides English.


#62

@psychoslave I think this thread is currently at the point where it would be good to make a new one with a concrete proposal of what you hope to add to Rust. Right now there are enough very different possibilities that I’m not sure there’s useful input to be had.