To add to your String
/Path
problem — hardly anybody uses OsStr(ing)
for file names and args_os()
for path arguments. Even clap
hardcoded a String
assumption, and you can panic Cargo and almost every Rust tool by providing a non-Unicode path.
Though I feel a maintainer is entitled to say "I don't support non UTF-8 paths, please don't use weird bytes [groups] in your filenames."
Last I checked, Cargo catches this (and exits, but catches this). It used to panic. clap
can handle it now... if you ask it to. Default is still to panic. So there has been some improvement... but not nearly so much as I wish. Doubt it will happen so long as OsString
has severely limited capabilities though.
You have to carry it around as an OsStr
/OsString
forever, yeah. And you have to write!
it, or if you don't mind printing a mangled version, temporarily make it a &Path
so you can .display()
it (why doesn't OsStr
have this directly?). bstr
can help. There's some other libraries that will print escaped versions instead of mangled versions.
I wouldn't call them weird; EUC is still going strong in SE Asia.
But given the limitations, I don't disagree with that stance either, necessarily, if it makes sense for the application. Arguably it makes sense for Cargo, say, since module names are also UTF8 in Rust. But it does irk me when I see it in a "here's my oxidized cat
, you'll never need the original" or whatever.
Tbh using as_bytes on Unix and to_string_lossy().as_bytes()
on Windows will work well for almost all applications. The bytes could be used via bstr
for consistency.
Adding to this, I don't think I have ever purposely, either with code or manually, generated a filename that had a non-ascii character in it. IMHO, even UTF-8 filenames seems like asking for trouble.
Is English your primary language?
FWIW English is not my primary language (Hebrew is), and I haven't really used non ascii filenames since, IDK, 10th grade (for various reasons).
Not sure it represents anything significant; I expect it's more common to use non-English filenames with non developers.
Edit: just to clarify, modern system default Hebrew to UTF-*, so it's not much of an issue.
A few years back you could have found some weird extended ascii encodings; some old websites still serve those but they look completely broken in modern web browser unless you manually tweak the encoding.
Even ignoring the valid/invalid UTF-8 issue I think there is another, more important difference between str
, OsStr
, and Path
.
Namely that one is for text, another is for passing strings to the OS, and the last is for filenames. Even though they are all (effectively) newtypes around each other, each of them have different semantics with domain-specific methods and just by choosing one over the other you are documenting what the value represents.
It's so annoying when are writing code in another language and have to ask yourself, "does this string contain the contents of my file, or just its name?"
English is 2nd but current primary language. Did you start with K&R C or xterm/urxvt ? I wonder if it's an era thing where in certain time periods, outside of APL, languages did not even support unicode variable names, and terminals may or may not have had easy unicode input, which would have made non-ascii file names annoying to work with.
It should be .encode_wide().collect()
as on windows the os-specific string actually is an array of u16
s which may not be a valid UTF-16LE sequence. If you're ok with the lossy encoding, why do you preserve bytes on unix?
On most Unixes UTF-8 is only a convention. As noted above, other encodings are still in wide spread use for some languages.
On Windows, UTF-16 is the one true encoding. All other encodings are converted to it. Indeed, if you set the code page to UTF-8 it will do lossy conversion that's equivalent to Rust's lossy string conversion.
On my personal systems / files / whatever, I generally strive to stick to an even stricter subset... roughly, nothing that needs escaped, and almost always lowercase. (Almost all my file management happens on the command line. Not to mention my sloppy run-once scripts.)
However, that's not really the point when it comes to utilities. No matter where they came from (non-ascii-language download, buggy script, fs corruption, malicious actor...), these files sometimes exist. Utilities have to be able to work with them to be considered complete.
As for applications, if you're not going to support it, exit with a nice error message instead of panicking. Triggering a crash when someone runs app $'\xff'
is just not a good look IMO. (Still better than being buggy of course.)
The argument here is along practical lines:
The reason why using byte strings for this is potentially superior than the standard library’s approach is that a lot of Rust code is already lossily converting file paths to Rust’s Unicode strings, which are required to be valid UTF-8, and thus contain latent bugs on Unix where paths with invalid UTF-8 are not terribly uncommon. If you instead use byte strings, then you’re guaranteed to write correct code for Unix, at the cost of getting a corner case wrong on Windows.
Conceivably there could be a "wstr
" library to work on u16
strings, but I didn't see one from a quick scan of crates.io.
I have gone out of my way to try to Do It Right(tm) with regards to filenames, environment variables and such. But all attempts at being consistent crumble as soon as one needs to store the name/string outside of the process.
I found myself being shocked and annoyed at rusqlite for not supporting PathBuf
, but then it occurred to me .. of course it doesn't -- what format should it store them in? And how should it handle them on a "foreign" platform? You end up needing to tag the data in some manner to indicate what platform it originated from, and then having to add support for foreign encodings in your code.
I’m certainly sympathetic to that, but even technical users might not have the luxury of fully Unicode file names. I’ve still got files on my PC that have been copied from old to new machine since the mid 90s, across multiple OSes and whatever character encodings were in use at the time. Luckily I’m a native UK English speaker so it’s only very occasionally a problem (a weird-looking name in ls
or Nautilus), but the idea of programs panicking on encountering such files doesn’t fill me with confidence—especially as they might panic part way through a difficult to resume operation or potentially lose data.
Though it makes me wonder if there’s a decent file name encoding fixer-upperer for Linux out there.
String encodings strike again!
This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.