Disappointed with Path

Then treat my comment as an important caveat to your advice that it may result in significantly fewer users. I'm not kidding or exaggerating when I say that ripgrep would effectively not exist if regexes or globs required valid UTF-8 to work. It's not just trying to maximize the total number of users. It's about getting any users at all in the first place.

My obvious implication here is that if that's the natural conclusion of your advice, then maybe it's not good advice to be giving generally.

I was responding specifically to your point about requiring UTF-8 for everything. Namely that "the entire world has migrated to UTF-8" (not true) and "seeing any other encoding is archaic and anachronistic at best" (not true) and "downright lazy of the developer at worst" (again, not true).

Making regexes and globs work on &[u8] is a necessary but sufficient criterion for making them work on OsStr. And there absolutely is interpretation required, because you need to design your algorithms to be UTF-8 aware by convention. That doesn't just happen for free.

3 Likes

Tough luck, then. I simply don't believe not supporting "I want a custom find() impl on paths and I want it to support IBM mainframes and it's not good enough for me to work on bytes" is a problem. At least it should not be the problem of the makers of a modern programming language/library in 2020. Such specialized needs would already warrant writing one's own path type.

Not even Rust's designers/implementors can anticipate everything.

Hang on here. You said it wasn't unfixable, but it is if the internal representation isn't exposed. That's a different thing entirely from saying, "yes, it is unfixable and that is by design, so tough cookies."

You also conveniently picked the most easily dismissed example. :wink: It's much harder to say, "too bad, you can't run globs on paths without either sacrificing the edge cases that OsStr was designed to fix or paying some additional cost."

You're also coming across as really combative here. Pointing out the flaws in a design does not mean I am opposed to the design itself! The purpose of pointing out flaws is to understand them and see whether we can find a way to fix them. (As the irlo thread I linked to above was trying to do.)

7 Likes

I didn't claim that at all. I said that practically the entire world has moved on, with the understanding that the places that haven't done so are woefully outdated for whathever reason.
Now there's something to be said for accommodating certain use cases and users, but I only do that if it fits with my original goals. If there's a conflict then it's not even a contest, my greater goal wins flat out. It's why I wrote the crate in the first place, mind you. And usually that conflict comes in the form of limited resources, mostly time.

How about some concrete counterexamples where a piece of software has not moved on to UTF-8, yet is not archaic, outdated and anachronistic? Everything I have seen in my professional career since roughly the mid-oughties (whether I was working on it or not) uses UTF-8. Even windows supports it (though internally it's stuck on UTF-16 for historic reasons, which doesn't make it any less outdated and anachronistic).

Again, may I have some concrete counterexamples?

1 Like

Seems like splitting hairs to me. The Java and Windows ecosystems aren't "woefully outdated." They aren't legacy and are actively developed today. But if you're going to claim that those things are woefully outdated, then I suppose we just have a difference of opinion. If so, your opinion seems pretty extreme to me and would result in a lot of problems in practice for a lot of users, particularly folks in the Windows ecosystem. Ironically, if everyone had that view, then the design of OsStr might be very different today (and WTF-8 may never have been created).

3 Likes

Okay, but note that one of the costs is that serializing a Path or OsStr field on Windows is always inefficient today. Even on valid Unicode paths there is no way to avoid validating the whole string and/or allocating and decoding into a wide string. (Serde only does the latter, which is more expensive in the best case but cheaper in the worst case.)

Is this really a niche use case? To me, it feels like a footgun that you can't use the standard library types if you want to store a path in any type that gets serialized, without paying a performance penalty on one of our tier-1 platforms. (Not just "IBM mainframes.")

I seriously think that some fundamental design rethinking here could get libstd to a much nicer place, with few downsides, and that collecting pain points is helpful and should not be dismissed as griping or nitpicking.

4 Likes

Yup, this was so annoying to deal with correctly and efficiently that ripgrep just gives up. If ripgrep encounters a path that is invalid UTF-16 on Windows, then when it writes it to JSON (or to a terminal), then it will lossily convert the path to UTF-8 and write that instead.

4 Likes

Well, I was corrected. Or at least trying to refute your claim would take disproportionate effort on my part (no better proof of something being possible than implementing it), so I'll take "it is unfixable" for granted. However, it doesn't really change my overall view on whether it should be "fixed".

Again, this is something that I'd have to see and try for myself to be able to argue about. While I understand that there may be unsupported operations on Path, they don't seem to be important enough to write a several-pages-long rant with strong words like "disappointed". In addition, I disagree with OP's parallel assertion that error handling shouldn't be used for catching invalid conversions, because that contradicts the entire design philosophy of the language.

In order to interoperate with the outside world, they need to be able to convert to/from UTF-8 anyway. I've seen this happen at one of the companies I've worked for.
In other words: the internal interpretation matters for performance, but supporting UTF-8 in 2020 at all is not optional.

And yes, UTF-16 is outdated by virtue of the rest of the world moving on. Which is pretty much the definition of outdated, practically speaking. An enclave keeping their internal string representations stable in UTF-16 for reasons of backwards compatibility (which is understandable in and of itself) doesn't change that in the slightest in my opinion.

FTR: just because we have a difference of opinion doesn't make mine extreme. I consider what you said a very polarizing statement.

Those outdated places does not exist in some kind of parallel universe. Yes, 99.999 % of the files on my computers have utf-8 names (most of them have ascii names). But I have lots of files,so of the file systems that are relevant to me, probably at least half contains at least one file with a non-utf8 name.

File names are complicated.

And I think this is the main dividing point between Rust and some other programming languages: Rust makes it as simple as possible to deal with a complicated world. Some other languages lies and say the world are simple, and for learning examples that works wonderfully. But then the program gets exposed to the real world and the end-user code gets filled with complex workarounds.

5 Likes

The reason I effectively say "let the past die" is that infinite backwards compat gets you the mess that is Windows: every update seems to break some kind of critical part nowadays, it seems.

So you have to draw the line somewhere.

My experience report is that I've spent at least several man-weeks of labor specifically because of this design and specifically because the internal WTF-8 representation is not exposed. (And many more than that because of the UTF-8 guarantee on strings.) Maybe you think the bugs I'm fixing for end users aren't worth fixing, or the performance I'm chasing for the benefit of end users isn't worth chasing, but I do. So I naturally have a lot of opinions on the current design, its limitations and possible solutions.

To clarify, I don't necessarily agree with every word the OP has written, and I wouldn't call what I'm feeling "disappointment" either. On the other hand, I can very much empathize with the OP.

I don't think the Windows and Java ecosystems can be accurately described as an "enclave." But obviously we have a difference of opinion here. So I'll just stop here and let my comments serve as important counter-points to your claims.

I don't necessarily disagree, but it's immaterial to my points. Ultimately, what I'm saying is, "if I followed your advice, then {insert popular piece of software} would effectively not exist." It's fine if you don't accept that as a valid argument, but I do think others might, because people care about what you can effectively build with your tools.

9 Likes

You didn't react to the most important thing I said: that in order to interoperate with the outside world, both Windows and Java need to support UTF-8 anyway regardless of their internal representation.
So at the end of the day this isn't about what's possible, but "merely" about how fast it can get done.
Which makes their internal representation irrelevant unless one is writing performance critical code. In which case, why is that person using Java and/or windows again?

I'll even go one further and tell you about the mentality businesses that use either of the 2 generally have: "we'd rather throw 2x hardware under it, than change the code base", implying that most of them don't care all that much about performance, which in turn explains why those environments are stuck there: there's no real impetus for change, so the risk/reward ratio is way off.

Because it doesn't seem worth it. It's a tedious discussion even in the best of circumstances, and this is most definitely not the best of circumstances. On top of that, the outcome of that discussion doesn't impact my most important point: ripgrep doesn't exist if I followed your advice.

6 Likes

Sure, ripgrep wouldn't exist. I have no issue accepting that. It's just not how most resource-constrained developers look at development, because the resource limitation doesn't give you much of a choice. So while it's great that you have more or less infinite resources to spend on such problems, it is a mistake to think that holds for everybody.

As for the tediousness: that tends to happen when a participant in a conversation doesn't immediately accept (without any critical thought) what the another person has said. The solution is rather simple: stop trying to convince me of something that I at least at this point in time won't be convinced by.

I don't have anywhere close to "more or less infinite resources." And I certainly don't think that others do. I have no idea what this has to do with anything in this discussion, other than you just dismissing my own experience as atypical. If you want to do that, then just say you believe my situation is atypical and let other people make up their minds.

11 Likes

I never thought it was literally infinite, if only because infinity doesn't manifest itself in the physical world. Who's splitting hairs now?
The point was that you have enough resources, for some definition of enough.

So yeah that most decidedly makes your experience atypical, unless most developers are much much richer than I think they are. Which is not impossible, but definitely unlikely.

I've never done anything other than that, and claiming this despite the ample body of evidence on this site is just downright insulting. It could also easily be (mis)construed for targeted character assassination.

Say what you want, but in this discussion I've at least been cordial to you despite our difference of opinion.

This discussion got lost in the weeds, and I apologize for my part in that. I'm just going to summarize my high level point to the original thing you said, and then I'm done. While changing your mind on the utility of non-UTF-8 primitives would be nice, my main purpose for posting this is to help other folks understand why it's important, even if they don't agree with it or it isn't applicable to their use cases. It's a subtle topic, and I think it's important to add some counter-balance to my interpretation of some of the things you've said.

Many (most?) Rust crates that do useful work with strings, do so with Rust strings (UTF-8)

I don't understand why this is an issue? Pretty much the entire world has migrated to UTF-8 at this point, so seeing any other encoding is archaic and anachronistic at best, and downright lazy of the developer at worst.

AIUI, you're asking why it's an issue at all if a crate only works with the String/&str data types. That is, presumably you would be okay with, say, regex or globset or aho-corasick exclusively working on String/&str data types. If I've understood this part of your comment correctly, then it seems to me like you think that's a perfectly acceptable state of the world. To me, that doesn't just imply that I personally wouldn't have written ripgrep, but rather, that nobody could write ripgrep (or a similarly fast tool) without also rolling their own regex engine that works on &[u8]. So in my view, the fact that regex supports searching on &[u8] is actually reducing the amount of resources one needs to build a grep-like tool (or anything really that needs to search data without any defined encoding, which is quite common). It has nothing to do with how much free time I personally have. It's a universally useful feature of the API, and it was born from someone else requesting it. From this, it should be clear why the absence of this sort of API is an issue: it prevents folks from running regexes on data that may not be valid UTF-8 without some kind of check or conversion step first, and this in turn inhibits a wide variety of use cases. ripgrep is one of them, but there are others.

From the issue I linked, you can even see that I didn't always hold this position:

When building regex , I never even considered the possibility of running a regex on something that wasn't Unicode.

That is to say, I didn't form my position by reason alone. I formed it through experience with real use cases and end user feedback. I had to be shown why APIs that only work on String/&str were a problem. And specifically, this is of course rooted not in the fact that String/&str are "string types," but rather, than they are string types that must be valid UTF-8. If they didn't have the UTF-8 requirement, then regex would not have a bytes sub-module, for example. (Some languages have string types with no UTF-8 validity requirement.)

So my criticism here is not about one's approach to UTF-8 in specific scenarios where rejecting non-UTF-8 is deemed as acceptable (that's fine, they not only exist but are common). Instead, I'm trying to explain why providing APIs that aren't restricted to String/&str is important. And especially so for crates provided for ecosystem use, which is what people run into, as in the OP's case. That is, I'm making an argument that people publishing libraries that operate on strings would benefit from considering whether it makes sense to offer non-String/&str APIs.

I note that I am discussing &[u8] here while the OP is talking about Path/OsStr. The connection may not be immediately obvious to everyone, but basically, in order to provide an API on OsStr, you need to at least provide an API that works on &[u8]. If you do, then at that point, the problem is just about translating a Path/OsStr to an &[u8]. On Unix, this is free and cheap. On Windows, this is difficult and costly to do non-lossily. The underlying connection here is UTF-8 validity. None of Path, OsStr or [u8] have any kind of UTF-8 guarantee, but String/&str does. So if you can provide APIs that work on Path and OsStr, then by necessity, it is trivial to provide an API on &[u8] because you've already implemented it by virtue of providing Path and OsStr APIs.

13 Likes

Non-unicode paths in Windows are exceedingly rare to the point that I've yet to hear of it happening in the real world (though please tell me if you have!), aside from files created by malware (and even this is rare). Even if files are created using the user's code page the name will be converted to UTF-16 before being passed to the kernel.

Finding non-unicode paths on Linux is more common because the UTF-8 convention is relatively recent.


The situation with OsStr/OsString (and Path/PathBuf) on Windows is the most frustrating to me. I'm slowly coming round to the idea that it just doesn't fit the platform at all. It really wants Windows to be unix-like but it's not so there's a lot of hacks to workaround that. Don't get me wrong, some of those "hacks" are brilliant efforts to try to make it all work but at the end of the day I'm not sure it's possible to make Win32 comfortably fit in a *nix shaped hole.

ripgrep has never handled non-UTF-16 paths on Windows fully correctly, and I've never received a bug report for it. It is possible that it has occurred, but just hasn't been noticed.