Path, OsStr, and supporting non-UTF-8 paths/inputs

To add to your String/Path problem — hardly anybody uses OsStr(ing) for file names and args_os() for path arguments. Even clap hardcoded a String assumption, and you can panic Cargo and almost every Rust tool by providing a non-Unicode path.

7 Likes

Though I feel a maintainer is entitled to say "I don't support non UTF-8 paths, please don't use weird bytes [groups] in your filenames."

2 Likes

Last I checked, Cargo catches this (and exits, but catches this). It used to panic. clap can handle it now... if you ask it to. Default is still to panic. So there has been some improvement... but not nearly so much as I wish. Doubt it will happen so long as OsString has severely limited capabilities though.

You have to carry it around as an OsStr/OsString forever, yeah. And you have to write! it, or if you don't mind printing a mangled version, temporarily make it a &Path so you can .display() it (why doesn't OsStr have this directly?). bstr can help. There's some other libraries that will print escaped versions instead of mangled versions.

I wouldn't call them weird; EUC is still going strong in SE Asia.

But given the limitations, I don't disagree with that stance either, necessarily, if it makes sense for the application. Arguably it makes sense for Cargo, say, since module names are also UTF8 in Rust. But it does irk me when I see it in a "here's my oxidized cat, you'll never need the original" or whatever.

4 Likes

Tbh using as_bytes on Unix and to_string_lossy().as_bytes() on Windows will work well for almost all applications. The bytes could be used via bstr for consistency.

Adding to this, I don't think I have ever purposely, either with code or manually, generated a filename that had a non-ascii character in it. IMHO, even UTF-8 filenames seems like asking for trouble.

Is English your primary language?

2 Likes

FWIW English is not my primary language (Hebrew is), and I haven't really used non ascii filenames since, IDK, 10th grade (for various reasons).

Not sure it represents anything significant; I expect it's more common to use non-English filenames with non developers.

Edit: just to clarify, modern system default Hebrew to UTF-*, so it's not much of an issue.
A few years back you could have found some weird extended ascii encodings; some old websites still serve those but they look completely broken in modern web browser unless you manually tweak the encoding.

Even ignoring the valid/invalid UTF-8 issue I think there is another, more important difference between str, OsStr, and Path.

Namely that one is for text, another is for passing strings to the OS, and the last is for filenames. Even though they are all (effectively) newtypes around each other, each of them have different semantics with domain-specific methods and just by choosing one over the other you are documenting what the value represents.

It's so annoying when are writing code in another language and have to ask yourself, "does this string contain the contents of my file, or just its name?"

3 Likes

English is 2nd but current primary language. Did you start with K&R C or xterm/urxvt ? I wonder if it's an era thing where in certain time periods, outside of APL, languages did not even support unicode variable names, and terminals may or may not have had easy unicode input, which would have made non-ascii file names annoying to work with.

It should be .encode_wide().collect() as on windows the os-specific string actually is an array of u16s which may not be a valid UTF-16LE sequence. If you're ok with the lossy encoding, why do you preserve bytes on unix?

1 Like

On most Unixes UTF-8 is only a convention. As noted above, other encodings are still in wide spread use for some languages.

On Windows, UTF-16 is the one true encoding. All other encodings are converted to it. Indeed, if you set the code page to UTF-8 it will do lossy conversion that's equivalent to Rust's lossy string conversion.

3 Likes

On my personal systems / files / whatever, I generally strive to stick to an even stricter subset... roughly, nothing that needs escaped, and almost always lowercase. (Almost all my file management happens on the command line. Not to mention my sloppy run-once scripts.)

However, that's not really the point when it comes to utilities. No matter where they came from (non-ascii-language download, buggy script, fs corruption, malicious actor...), these files sometimes exist. Utilities have to be able to work with them to be considered complete.

As for applications, if you're not going to support it, exit with a nice error message instead of panicking. Triggering a crash when someone runs app $'\xff' is just not a good look IMO. (Still better than being buggy of course.)

The argument here is along practical lines:

The reason why using byte strings for this is potentially superior than the standard library’s approach is that a lot of Rust code is already lossily converting file paths to Rust’s Unicode strings, which are required to be valid UTF-8, and thus contain latent bugs on Unix where paths with invalid UTF-8 are not terribly uncommon. If you instead use byte strings, then you’re guaranteed to write correct code for Unix, at the cost of getting a corner case wrong on Windows.

Conceivably there could be a "wstr" library to work on u16 strings, but I didn't see one from a quick scan of crates.io.

2 Likes

I have gone out of my way to try to Do It Right(tm) with regards to filenames, environment variables and such. But all attempts at being consistent crumble as soon as one needs to store the name/string outside of the process.

I found myself being shocked and annoyed at rusqlite for not supporting PathBuf, but then it occurred to me .. of course it doesn't -- what format should it store them in? And how should it handle them on a "foreign" platform? You end up needing to tag the data in some manner to indicate what platform it originated from, and then having to add support for foreign encodings in your code.

1 Like

I’m certainly sympathetic to that, but even technical users might not have the luxury of fully Unicode file names. I’ve still got files on my PC that have been copied from old to new machine since the mid 90s, across multiple OSes and whatever character encodings were in use at the time. Luckily I’m a native UK English speaker so it’s only very occasionally a problem (a weird-looking name in ls or Nautilus), but the idea of programs panicking on encountering such files doesn’t fill me with confidence—especially as they might panic part way through a difficult to resume operation or potentially lose data.

Though it makes me wonder if there’s a decent file name encoding fixer-upperer for Linux out there.

2 Likes

String encodings strike again!