This discussion got lost in the weeds, and I apologize for my part in that. I'm just going to summarize my high level point to the original thing you said, and then I'm done. While changing your mind on the utility of non-UTF-8 primitives would be nice, my main purpose for posting this is to help other folks understand why it's important, even if they don't agree with it or it isn't applicable to their use cases. It's a subtle topic, and I think it's important to add some counter-balance to my interpretation of some of the things you've said.
Many (most?) Rust crates that do useful work with strings, do so with Rust strings (UTF-8)
I don't understand why this is an issue? Pretty much the entire world has migrated to UTF-8 at this point, so seeing any other encoding is archaic and anachronistic at best, and downright lazy of the developer at worst.
AIUI, you're asking why it's an issue at all if a crate only works with the String
/&str
data types. That is, presumably you would be okay with, say, regex
or globset
or aho-corasick
exclusively working on String
/&str
data types. If I've understood this part of your comment correctly, then it seems to me like you think that's a perfectly acceptable state of the world. To me, that doesn't just imply that I personally wouldn't have written ripgrep, but rather, that nobody could write ripgrep (or a similarly fast tool) without also rolling their own regex engine that works on &[u8]
. So in my view, the fact that regex
supports searching on &[u8]
is actually reducing the amount of resources one needs to build a grep-like tool (or anything really that needs to search data without any defined encoding, which is quite common). It has nothing to do with how much free time I personally have. It's a universally useful feature of the API, and it was born from someone else requesting it. From this, it should be clear why the absence of this sort of API is an issue: it prevents folks from running regexes on data that may not be valid UTF-8 without some kind of check or conversion step first, and this in turn inhibits a wide variety of use cases. ripgrep is one of them, but there are others.
From the issue I linked, you can even see that I didn't always hold this position:
When building regex
, I never even considered the possibility of running a regex on something that wasn't Unicode.
That is to say, I didn't form my position by reason alone. I formed it through experience with real use cases and end user feedback. I had to be shown why APIs that only work on String
/&str
were a problem. And specifically, this is of course rooted not in the fact that String
/&str
are "string types," but rather, than they are string types that must be valid UTF-8. If they didn't have the UTF-8 requirement, then regex
would not have a bytes
sub-module, for example. (Some languages have string types with no UTF-8 validity requirement.)
So my criticism here is not about one's approach to UTF-8 in specific scenarios where rejecting non-UTF-8 is deemed as acceptable (that's fine, they not only exist but are common). Instead, I'm trying to explain why providing APIs that aren't restricted to String
/&str
is important. And especially so for crates provided for ecosystem use, which is what people run into, as in the OP's case. That is, I'm making an argument that people publishing libraries that operate on strings would benefit from considering whether it makes sense to offer non-String
/&str
APIs.
I note that I am discussing &[u8]
here while the OP is talking about Path
/OsStr
. The connection may not be immediately obvious to everyone, but basically, in order to provide an API on OsStr
, you need to at least provide an API that works on &[u8]
. If you do, then at that point, the problem is just about translating a Path
/OsStr
to an &[u8]
. On Unix, this is free and cheap. On Windows, this is difficult and costly to do non-lossily. The underlying connection here is UTF-8 validity. None of Path
, OsStr
or [u8]
have any kind of UTF-8 guarantee, but String
/&str
does. So if you can provide APIs that work on Path
and OsStr
, then by necessity, it is trivial to provide an API on &[u8]
because you've already implemented it by virtue of providing Path
and OsStr
APIs.