On the Cargo generated files encoding possible issues


#1

The Rust compiler eats up utf-8 coded source files… but cargo generates ascii source files – is it good?
toml -file is intended to be ascii coded? - if so then the decision looks very inconvinient and unfounded
Personally, I spent some irritating minutes to work about compiler&cargo’s issues…


#2

All valid ASCII is also valid UTF-8. What kind of issues did you face?


#3

Only if English is native to your OS. For the instance if I replace “Hello World” with “Привет мир!” (in russian)- then i will get the compiler’s (The source is not UTF-8 coded)…
and what is MUCH worse, when i eat-up to cargo utf-8 coded CORRECTLY formed toml (with no visible
non english chars) - i will see confusing message .(Failed to parse the manifest…)


#4

ASCII is always a strict subset of UTF-8, no matter what the native language of the OS is. The problem could be the OS’s locale is making the default encoding something other than UTF-8.

As an example, the following works fine:

fn main() {
    println!("Привет мир!");
}

Could you be more specific about what’s happening? E.g. post any source code and Cargo.toml’s that you’re having trouble with.

Also, what do you mean by “generated”? Do you have a build script? (The only file that Cargo generates by default is Cargo.lock.)

and what is MUCH worse, when i eat-up to cargo utf-8 coded CORRECTLY formed toml (with no visible
non english chars) - i will see confusing message .(Failed to parse the manifest…)

This sounds like it may not be correctly encoded as UTF-8 (or may have some other problem).


#5
  1. It compiles well but ONLY if I convert it to UTF-8 -THE PROBLEM is -I HAVE TO do it EXPLICITLY (if don’t- i just
    get the compilers error - for default os coding page is 1251 ) - it is confusing for visually there is no difference between UTF-8 and CP-1251 versions.
  2. Maybe… but i use FAR’s in-build editor for the conversions /// and (as user) I see no reasons to distrust it, but I HAVE some reasons to do so for some mmmmm… new stuff (like cargo’s and rust).

So maybe it willl be MUCH less confusing if the cargo will consume and generate everything in UTF-8 since it is native to
the Rust language…

yet I forgot to say that i got the issues just following to the “Getting Started” instructions from the official site (with the minor exception - I use Far’s in-build editor for editing) - and to my opinion there is no good in it.


#6

Ok, having a different default codepage does sound like it would cause this problem if editors use that for writing files by default.

However, I’m still confused by what you mean by cargo generating files. The only file that Cargo itself creates is Cargo.lock, and this is, AFAIK, pure 7-bit ASCII. I believe that 7-bit ASCII is a subset of Windows-1251 so it seems surprising that this would cause problems.

It would help us help you if you could be more specific (e.g. publish a repository that demonstrates the problem on your computer).


#7

just following to the “Getting Started” instructions from the official site… no more… :slight_smile:
Mark you, i do not treat the issues as the errors… it looks like rather unfound, unexpected, unwilling… the cargo’s “feature” to me (personally i worked about EVERY issue, but i spent the time upon…yet i plan to waste no time for it )


#8

I just went through the instructions, using “こんにちは” as the name of a Cargo package… and it worked 100% fine. You’re going to have to be more specific about what your problem is; as far as I can tell, Cargo does generate UTF-8. Knowing your platform, which specific instructions you’re following, and how you’re doing things different would help.


#9

@dizer One thing to keep in mind when asking such questions is to specify clearly your environment. Based on a few of your responses, it sounds like you are running Windows, with a Cyrillic locale, and are using and editor built in to the Far file manager (something that I had never heard of before). It would be easier to answer the question if you specified that up front, as well as posted the exact command you ran and error message that you received.

One thing to check is whether your editor is generating a UTF-8 BOM. This is a common problem with Windows based editors writing out UTF-8. The encoding that is called “Unicode” on Windows is actually UTF-16; and UTF-16 has an ambiguous endianness. There is a character known as the Byte Order Marker, or BOM, with value U+FEFF, which can be used to distinguish endianness if it appears as the first character in a file. So many editors on Windows will write out text with a BOM.

UTF-8, however, does not have an ambiguous endianness; it is an octet based encoding, with a defined ordering. The same character, U+FEFF, can be encoded in UTF-8 as EF BB BF, but because there is no need to specify the endianness for UTF-8, that character is vestigial, and many parsers that parse UTF-8 text don’t recognize it.

I don’t have Windows with a Cyrillic locale or the Far file manager available to test out, but I tried adding a BOM to the beginning of Cargo.toml, and I get the following error:

$ cargo run
failed to parse manifest at `/Users/lambda/tmp/rust-playground/Cargo.toml`

Caused by:
  could not parse input as TOML
Cargo.toml:1:1 expected a key but found an empty string

Is that the error that you’re getting? If so, I would guess that the problem is the BOM. If not, could you post what Cargo.toml you’re using, the command you’re running, and the error you are getting?

If this is the problem, you should try to see if there is a way to get your editor to output UTF-8 without a BOM. If you can’t, I’d recommend trying another editor, like Sublime Text, Atom, Visual Studio Code, Emacs, or VIM (the latter two only if you are up for learning them, as they can have a bit of a learning curve), all of which I believe are capable of outputting UTF-8 without a BOM.


#10

That’s a good point: I guess cargo and rustc should probably handle (i.e. ignore) the BOM, given it’s not specified as invalid in the standard, AFAICT,


#11

Took a look around the source of Far, and it looks like it does produce a UTF-8 BOM by default but has an option somewhere to disable it (there’s some help text here that also indicates that this is the case); so I would recommend looking around in the options to figure out how to disable writing out UTF-8 with a BOM.


#12

2DanielKeep - NO, Cargo.toml is generated in ANSI CP (in the my case -win1251), so if you add some extra chars…
into the authors sections ( like - authors = [“проба”] ) - you got the error (file is not utf-8 coded).


#13

@dizer You are being a bit unclear. When you say “generated”, do you mean by your editor or by Cargo? The only thing that Cargo generates is Cargo.lock; Cargo.toml is just input.

Cargo expects Cargo.toml to be encoded in UTF-8. As I explained, it expects UTF-8 without a Byte Order Mark. Try saving your file in UTF-8 without a BOM, and it should work correctly, even with Cyrillic text in it.


#14

No. Problem is NOT in far manager… conversing or even BOM mark existing…- problem goes from some inconsistances (and it is not an error in the common sense) in the cargo outputs - if everything would be produced in utf-8 - there 'll be no problems at all.


#15

why, Cargo.toml was generated by cargo itself with “cargo new … --bin” command line… — and it is in ANSI CP.


#16

(Incidentally, you can select part of someone’s post and click “quote reply”; that should notify the person you’re responding to directly.)

Again, you need to provide more details. My own testing shows that cargo new definitely creates a UTF-8-encoded Cargo.toml.

It almost sounds like you’re generating a Cargo.toml, editing it to add some non-ASCII text (authors = ["проба"]), saving it, and then it doesn’t work.

If that is what’s happening, then it’s your editor’s fault; whatever the input is, it’s not saving valid UTF-8. In that case, you’ll just have to use a different editor; supporting non-UTF-8 8-bit encodings would be a terrible idea.


#17

1.Thanks
2. No … I check it once more… It is ANSI(1251) , sorry … Yet I use default rust 1.1 installation…running under win8.1 64bit OS.
3. If so … then i wonder if you can explain why main.rs -is displayed as utf-8, but Cargo.toml - as ANSI coded in the my “wrong” editor… yet if i convert it into utf-8 with my wrong editor everythin’ works just fine.


#18

Can you tell me if the file starts out with any Cyrillic characters in it after you run cargo new ... --bin, or if you are adding them when you edit it?

If it doesn’t contain such Cyrillic characters, then Cargo is likely writing it out in ASCII, which both CP1251 and UTF-8 are supersets of. In that case, if your editor interprets it as CP1251, then it is likely a problem where your editor defaults to using CP1251 when inserting into a document of otherwise just plain ASCII.

The other possibility is that it Cargo is actually writing it out with the Cyrillic characters encoded in CP1251. In that case, it would likely be because it is getting them in from the command line or from your environment encoded in CP1251, and not properly decoding them. I would be surprised if this were the case, as I would expect it to fail UTF-8 validation, but it’s possible that it could happen. If this is what is happening, then it is a bug in Cargo.

But to help figure out which it is, could you please paste in the exact command you are running and the exact output you see? Without that, it it hard to tell what is going on; I don’t have a Windows system in a Cyrillic locale with Far installed to test against (and most other people in the thread probably don’t either), so if you post your exact results it will be a lot easier to tell what’s happening.


#19

Previously, I tested that on Linux. I just tested it on Windows 7 64-bit.

F:\Programming\Rust\sandbox\cargo-test>chcp
Active code page: 850

F:\Programming\Rust\sandbox\cargo-test>cargo new tëstØ --bin

F:\Programming\Rust\sandbox\cargo-test>cd tëstØ

F:\Programming\Rust\sandbox\cargo-test\tëstØ>dir
 Volume in drive F is Stuff
 Volume Serial Number is 446C-4027

 Directory of F:\Programming\Rust\sandbox\cargo-test\tëstØ

03/07/2015  03:46 PM    <DIR>          .
03/07/2015  03:46 PM    <DIR>          ..
03/07/2015  03:46 PM                 7 .gitignore
03/07/2015  03:46 PM                95 Cargo.toml
03/07/2015  03:46 PM    <DIR>          src
               2 File(s)            102 bytes
               3 Dir(s)  46,149,648,384 bytes free

F:\Programming\Rust\sandbox\cargo-test\tëstØ>type Cargo.toml
[package]
name = "tëstØ"
version = "0.1.0"
authors = ["Daniel Keep <daniel.keep@gmail.com>"]

This alone proves that the manifest is UTF-8: the command was entered in codepage 850, which was translated by Windows into Unicode, which Cargo then translated into UTF-8.

Here is what happens when you build the above:

F:\Programming\Rust\sandbox\cargo-test\tëstØ>cargo build
   Compiling tëstØ v0.1.0 (file:///F:/Programming/Rust/sandbox/cargo-test/t%C3%ABst%C3%98)
warning: crate `tëstØ` should have a snake case name such as `tëst_ø`, #[warn(non_snake_case)] on by default

I then edited the manifest in Sublime Text 3 to add “проба” as an author. Saved, re-ran:

F:\Programming\Rust\sandbox\cargo-test\tëstØ>cargo clean

F:\Programming\Rust\sandbox\cargo-test\tëstØ>cargo build
   Compiling tëstØ v0.1.0 (file:///F:/Programming/Rust/sandbox/cargo-test/t%C3%ABst%C3%98)
warning: crate `tëstØ` should have a snake case name such as `tëst_ø`, #[warn(non_snake_case)] on by default

F:\Programming\Rust\sandbox\cargo-test\tëstØ>

It works fine.

Edit: Having just seen your reply, I did it again with codepage 1251:

F:\Programming\Rust\sandbox\cargo-test>chcp
Active code page: 850

F:\Programming\Rust\sandbox\cargo-test>chcp 1251
Active code page: 1251

F:\Programming\Rust\sandbox\cargo-test>cargo new testпроба --bin

F:\Programming\Rust\sandbox\cargo-test>cd testпроба

F:\Programming\Rust\sandbox\cargo-test\testпроба>type Cargo.toml
[package]
name = "testРїСЂРѕР±Р°"
version = "0.1.0"
authors = ["Daniel Keep <daniel.keep@gmail.com>"]

F:\Programming\Rust\sandbox\cargo-test\testпроба>cargo build
   Compiling testпроба v0.1.0 (file:///F:/Programming/Rust/sandbox/cargo-test/test%D0%BF%D1%80%D0%BE%D0%B1%D0%B0)
error: linking with `gcc` failed: exit code: 1
note: "gcc" "-Wl,--enable-long-section-names" "-fno-use-linker-plugin" "-Wl,--nxcompat" "-Wl,--large-address-aware" "-sh
ared-libgcc" "-L" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib" "F:\Programming\Rust\sandbox\cargo-test\testпро
ба\target\debug\testпроба.o" "-o" "F:\Programming\Rust\sandbox\cargo-test\testпроба\target\debug\testпроба.exe" "-Wl,--g
c-sections" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib\libstd-74fa456f.rlib" "F:\Programs\Rust\bin\rustlib\i6
86-pc-windows-gnu\lib\libcollections-74fa456f.rlib" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib\librustc_unico
de-74fa456f.rlib" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib\librand-74fa456f.rlib" "F:\Programs\Rust\bin\rus
tlib\i686-pc-windows-gnu\lib\liballoc-74fa456f.rlib" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib\liblibc-74fa4
56f.rlib" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib\libcore-74fa456f.rlib" "-L" "F:\Programming\Rust\sandbox
\cargo-test\testпроба\target\debug" "-L" "F:\Programming\Rust\sandbox\cargo-test\testпроба\target\debug\deps" "-L" "F:\P
rograms\Rust\bin\rustlib\i686-pc-windows-gnu\lib" "-L" "F:\Programming\Rust\sandbox\cargo-test\testпроба\.rust\bin\i686-
pc-windows-gnu" "-L" "F:\Programming\Rust\sandbox\cargo-test\testпроба\bin\i686-pc-windows-gnu" "-L" "F:\Programming\Rus
t\sandbox\.rust\bin\i686-pc-windows-gnu" "-Wl,-Bstatic" "-Wl,-Bdynamic" "-l" "ws2_32" "-l" "userenv" "-l" "advapi32" "-l
" "compiler-rt"
note: gcc.exe: error: F:\Programming\Rust\sandbox\cargo-test\test?????\target\debug\test?????.o: Invalid argument

error: aborting due to previous error
Could not compile `testпроба`.

To learn more, run the command again with --verbose.

This isn’t a problem with the manifest or source; it looks like GCC itself doesn’t support those characters in a path. That’s unfortunate, but it’s GCC’s problem, not Cargo’s or Rust’s.

Again, I added “проба” as an author and rebuilt:

F:\Programming\Rust\sandbox\cargo-test\testпроба>cargo build
   Compiling testпроба v0.1.0 (file:///F:/Programming/Rust/sandbox/cargo-test/test%D0%BF%D1%80%D0%BE%D0%B1%D0%B0)
error: linking with `gcc` failed: exit code: 1
note: "gcc" "-Wl,--enable-long-section-names" "-fno-use-linker-plugin" "-Wl,--nxcompat" "-Wl,--large-address-aware" "-sh
ared-libgcc" "-L" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib" "F:\Programming\Rust\sandbox\cargo-test\testпро
ба\target\debug\testпроба.o" "-o" "F:\Programming\Rust\sandbox\cargo-test\testпроба\target\debug\testпроба.exe" "-Wl,--g
c-sections" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib\libstd-74fa456f.rlib" "F:\Programs\Rust\bin\rustlib\i6
86-pc-windows-gnu\lib\libcollections-74fa456f.rlib" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib\librustc_unico
de-74fa456f.rlib" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib\librand-74fa456f.rlib" "F:\Programs\Rust\bin\rus
tlib\i686-pc-windows-gnu\lib\liballoc-74fa456f.rlib" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib\liblibc-74fa4
56f.rlib" "F:\Programs\Rust\bin\rustlib\i686-pc-windows-gnu\lib\libcore-74fa456f.rlib" "-L" "F:\Programming\Rust\sandbox
\cargo-test\testпроба\target\debug" "-L" "F:\Programming\Rust\sandbox\cargo-test\testпроба\target\debug\deps" "-L" "F:\P
rograms\Rust\bin\rustlib\i686-pc-windows-gnu\lib" "-L" "F:\Programming\Rust\sandbox\cargo-test\testпроба\.rust\bin\i686-
pc-windows-gnu" "-L" "F:\Programming\Rust\sandbox\cargo-test\testпроба\bin\i686-pc-windows-gnu" "-L" "F:\Programming\Rus
t\sandbox\.rust\bin\i686-pc-windows-gnu" "-Wl,-Bstatic" "-Wl,-Bdynamic" "-l" "ws2_32" "-l" "userenv" "-l" "advapi32" "-l
" "compiler-rt"
note: gcc.exe: error: F:\Programming\Rust\sandbox\cargo-test\test?????\target\debug\test?????.o: Invalid argument

error: aborting due to previous error
Could not compile `testпроба`.

To learn more, run the command again with --verbose.

Again, not Rust’s problem.

It looks more and more like your editor is to blame. I can’t reproduce whatever your problem is.


#20

Yes , exactly … i add some cyrillyc chars AFTER generation .

Yet after a bit pondering… i must agree with huon…problem is BOM related… the editor has no reason to treat file as utf-8 coded if it is in plain English and there is no BOM (and in this case the choice of default system cp is preferable ). So it is not of rustc, cargo’s or editor’s problem.