Disappointed with Path

Quick background: I began seriously studying Rust about 9 months ago. Though I've spent many years using Windows, the past 3+ years I've worked exclusively in Linux environments. So my point-of-view is certainly "penguin-oriented."

Reason for my Headaches Today:

Learning these truths can give newcomers (like myself) a lot of headaches. There are many crucial learning curves in Rust. You must think about memory. Learn about the different string types. Ownership. Lifetimes. Traits. Macros. And plenty more.

Until now, even though I struggled, I ultimately triumphed and felt happier about Rust. Ownership was a tough climb. But it makes perfect sense and once it "clicked", things got easier. Lifetime syntax can still trip me up. But I do understand the point, and when they're needed. Using them has gotten easier, and I'm confident it will continue to.

Yet then there's Path. I believe I understand the challenges with Path. I realize the different operating system implementations are the root cause of these challenges (not Rust). I understand that Path is not UTF-8. Not every value of Path can cleanly convert to UTF-8 (and therefore Rust strings).

Right now, I'm mostly concerned about operating system file paths.

  1. Most real-world values of a file Path will cleanly convert to UTF-8. Are there exceptions? Yes. But these are rare edge cases.
  2. Many (most?) Rust crates that do useful work with strings, do so with Rust strings (UTF-8)
  3. The documentation discourages us from using unwrap()
  4. However, manipulating or working with Path silently encourages newbie Rustaceans to make liberal use of unwrap():
use std::path::{Path};
use regex::Regex;

const PATTERN1: &'static str = "(brian)";
const PATTERN2: &'static str = "(john)";

fn contains_pattern(bar: &str, pattern: &str) -> bool {
    let iso_regex = Regex::new(pattern);
    iso_regex.unwrap().is_match(bar)
}

fn main() {
  let not_useful_path = Path::new("/home/brian");
  
  // Regex cannot use Path, so we're forced to unsafely convert to string.
  let useful_string = not_useful_path.to_str().unwrap();
  
  println!("Result 1: {:?}", contains_pattern(useful_string, PATTERN1));
  println!("Result 2: {:?}", contains_pattern(useful_string, PATTERN2));
}

Many, many excellent and useful crates were written to use Strings. Not Path. Not OsPath. This leaves the developer with uncomfortable choices:

Option 1: Avoid using the Path type altogether. Always use string types.

Pros: You may now utilize various libraries and crates, and stop wasting time converting Paths to strings (which we know isn't 100% safe anyway).
Cons: You will miss out on useful Path functions like join() and parent(). You'll have to find libraries that duplicate these functions for String. Or build your own!
Also, your pseudo String-as-Path types will not be compatible with OsStr. This may later result in unexpected errors and failures.

Option 2: Use Path when needed, and unsafely convert to String.
Pros: 99% of the time, the conversion to UTF-8 will be okay.
Cons: As developers, we worry about stability and correctness. We worry about those Paths that might not convert to UTF-8. Also, your code is going to be filled with clunky as_os_str().unwrap().to_string().unwrap() statements all over the place.

You will miss out on useful AsRef conversions too!

// This code will not compile
use std::path::Path;

pub fn foo<'a, S>(parm_path: S) -> ()
    where S: AsRef<&'a str> {
    println!("My path as string = {:?}", parm_path.as_ref());
    ()
}

fn main() {
  let p: &Path = Path::new("/home/brian/some_file.txt");
  foo(p);
}

"the trait std::convert::AsRef<&str> is not implemented for std::path::Path"

...
Option 3: Use Path when you need to. Whenever you need crates (ex. Regex) and the crate requires working with String, re-write the crate to use Path or OsStr instead.
Pros: Appears to be the only "correct" solution Rust is offering us.
Cons: Wildly impractical to rewrite every crate you will need.

Today, I'm doing Option 2. I don't like it. I don't like writing these ''unwrap()'' statements all the time, or doing error handling for hypothetical edge cases that I cannot easily reproduce. Yet I have no better ideas. If my program ever encounters such an odd filename, I certainly would hope/expect it panics! My project was never designed to work with such things.

I'd really like an Option 4. But I don't know what that is.

I understand Rust is supposed to be safe, not choke on unusual values, and be very stable. I'm all for that.

But Rust is also a systems language. And the manipulation of file paths and names? Well, that seems like a core part of systems programming (particularly in Linux, where "everything is a file"). Yet Rust is weirdly awkward and cumbersome when it comes to working with file Paths. You can do a couple useful things, but you quickly encounter a requirement for Strings.

For the first time with Rust, I've learned "why" things didn't work the way I expect...but I've walked away feeling very unsatisfied.

2 Likes

Hello

I understand that you feel uncomfortable about Path being something else than String. I, however, feel like this was the right decision. As a person whose primary language is not English, my file system contains file names which are often not UTF-8 (mostly broken encoding because the files predate wide adoption of UTF-8).

I still regularly meet programs that break in weird way because they assumed that file names have valid encoding or that they are UTF-8. Things like double-clicking on a file in file browser, the correct application starting, only to error out on „File not found“ and similar. I believe the very visible distinction and warning (what you work with is not a String) may make Rust programs more resilient to this problem. It feels awkward because the real world is not simple and if you cut corners, you will get hard to replicate annoying bugs. You really should not pretend Paths are Strings, because they are not.

Note that many crates also can operate on byte strings (eg. &[u8]) ‒ for example Regex has a whole module for such matching, and you can convert to strings in lossy way.

And there's .display() for when you want to print it.

43 Likes

Citation needed. Where? Certainly official resources don't say that "you can use unwrap liberally when converting between different string types".

If your conversion is fallible (which Path -> &str and the like are), you have to handle the error. That you choose to "handle" the error by throwing up your hands and crashing is not the fault of the library, it is clearly and unequivocally your fault.

Again – no, that's not a necessity. That's your choice, and it is a bad choice. You really should be using Rust's graceful error handling mechanisms instead (e.g. propagation using ? and Try).

9 Likes
  • Most real-world values of a file Path will cleanly convert to UTF-8.

I have 800k files in my home directory. In any sufficiently large dataset exceptions exist 100% of the time, so they are the norm you have to deal with. "its rare" is not an incantation that magically makes problems go away. Rust is there to help you and save you from debugging problems with something failing somewhere else, when its not obvious that the problem has anything to do with filenames.

  • Many (most?) Rust crates that do useful work with strings, do so with Rust strings (UTF-8)

Its difficult to do something useful reliably with a sequence of bytes.

  • The documentation discourages us from using unwrap()
  • However, manipulating or working with Path silently encourages newbie Rustaceans to make liberal use of unwrap():

Well, this is general problem with examples. You should convert this into some kind of Error, but that's usually out of scope for a quick demonstration of how something works.

The right solution is to ensure valid utf-8 filenames, which is responsibility of the operating system,
because in increasing amount of places filenames are processed in the context of something that needs to be utf-8 anyway. For this reason I tend to require files to be valid utf-8 and reject working with those that aren't. Its easier overall for fix it at the entry point that to figure out what to do with them later.

8 Likes

TL:DR

Such file paths, when input by users or discovered from directory searches and so on, are inputs to a program. As such they should be checked for validity as much as all other inputs to programs.

It's amazing how frequent such "rare edge cases" can be. Especially when there are millions of people using billions of files originating from God know what operating systems. Far better things are checked properly if one want robust code. As Rust uses do.

It's not clear to me how that is so. Admittedly a lot of example codes for almost anything in Rust omit any error checking and just use unwrap(). But that is traditional in example codes in all kind of languages where error checking is not included for the sake of clarity in the example

Perhaps users of Rust programs would rather have a meaningful error message/dialog and not have the program they are using just crash. Depends what one is creating I guess.

My view of a systems programming language is that it can be used to create the operating system that implements and maintains such file paths. Not just consume them like any old application program.

I presume you don't mean "unsafe" there as in the Rust concept of "unsafe" and the unsafe keyword?

12 Likes

UTF-8 Paths

What you may be looking for, is an intersection of Path and String: if you are the one creating the paths, it makes sense that your paths be UTF-8 (Example: Path::new("some_dir").join("sub_dir")).

With some wrapper type, you could write:

/// `Path` with only UTF-8 components
struct UPath ...
impl Into<Path> for UPath { ... }
impl TryFrom<Path> for UPath { ... } // fails if non-UTF8

// Usages:
// 1 - hard-coded path:
let path = UPath::new("some_dir").join("sub_dir");
let path_str: &str = path.as_ref();

// 2 - user-provied path:
let path: UPath =
    if let Ok(it) = ::std::env::temp_dir().try_into() { it } else {
        // correclty handle the error case ... once!
    }
;
// the rest of your API can operate on a Path that has been checked as being valid UTF-8

Paths as &[u8]

Other thing to keep in mind. Paths may not necessarily be UTF-8, but on Unix at least, they can be seen as slices of bytes (after doing the unfallible .as_os_str()). Granted, there aren't that many convenience functions on &[u8] as there are on &str, but I think that's just a matter of pulling a dependency that does provide these.

6 Likes

Paths aren't strings. Many languages have made that mistake, and either suffer problems with edge cases (e.g. Python) or had to walk back on it (e.g. Cocoa did a major migration from NSString paths to NSURL).

It's true that Rust doesn't give paths first-class support. If you're supporting only Unix in your program, you can use .as_bytes() on paths. Then lots of byte-oriented manipulation is supported. If you also need Windows support, it's doable:

Don't use unwrap()! It will make your program blow up on systems that don't use UTF-8 locale, or even on systems which do use UTF-8, but happen to have a stray bad file somewhere.

You can use e.g.

path.to_str().map_or(false, |s| s.contains("pattern"))

which will skip the path (return false negative) instead of completely blowing up the entire program.

On Unix you can also do:

String::from_utf8_lossy(path.as_bytes()).contains("pattern")
9 Likes

^ I think the entire post could be summed up with just these lines.

rust will ask you "what should I do if an error occurs here?" for any possibility of an error (in other words things like Result and Option types). things like .expect(), .unwrap(), ? are your answers like "exit but leave a message", "just exit", "pass it along". rust won't let you leave without an answer. Now if you think in those cases it's right to panic and the situation is irrecoverable, you are 200% allowed to use unwrap.
If you want to even skip those, and not even want rust to tell you about even the possibility of problems and crash without you telling it to, that's just not something rust was made to do.

I think you're completely fine with all the unwraps everywhere if you think crashing is justified. I'd recommend atleast skipping faulty paths i that's possible.

11 Likes

I think it's a very valid complaint, and I've been expressing it in one form or another for years now. As others have mentioned, there are significant trade offs at play here, but path handling in Rust is undeniably annoying. I feel it a bit more than most perhaps, because allowing paths to be used with various string-like primitives (regexes and globs, principally) is a difficult platform dependent chore with hidden costs, depending on the platform (i.e., Windows).

One possible way to ameliorate this would be to expose the internal WTF-8 representation of OsStr. But this also has significant downsides. There's more discussion on that here: https://internals.rust-lang.org/t/osstr-wtf8-as-bytes-and-to-string-unchecked/12694

9 Likes

This is another example of a relatively common class of dissatisfaction that gets expressed from time to time: a library (especially a language's standard library) needs to provide facilities for all contingencies.

Application programmers, however, are often working in restricted domains where the edge cases so carefully handled by the library are extraneous distractions. Other examples are:

  • A program only operates on ASCII but has to deal with all vagaries of UTF-8
  • A program will be single-threaded, but the library requires Send or Sync so that it can work properly in a threaded context

I would welcome a way to specify this sort of program-wide property so that it's obvious in a single place what the program's environmental assumptions are, and so that the compiler can enforce those assumptions correctly without them being flagged at every use.

1 Like

These are by design, in order to avoid the antipattern known as stringy typed programming.

This isn't inherent to Rust. Any language that takes the differences between OSes seriously would likely arrive at the same destination, because file paths don't necessarily have to be valid UTF-8, and &str and String by definition are valid UTF-8.

I'll give you that, there is definitely a learning curve to Rust. But in retrospect all that is is education that a new Rustacean has missed up until that point that they need in order to produce correct code. In other words: skipping these parts brings you right back to the (many, many) flaws of the likes of C/C++ in terms of various forms of safety: ownership, thread safety, memory safety etc.

I have never seen a type system do probabilistic reasoning. In order to be sound, it has to treat any input value of a given type the same as any other value of the same type. So even though usually it will be UTF-8, and only occasionally not, that fact is 100% irrelevant.

I don't understand why this is an issue? Pretty much the entire world has migrated to UTF-8 at this point, so seeing any other encoding is archaic and anachronistic at best, and downright lazy of the developer at worst.

I don't see the issue with that either?

There is an alternative: properly handle the errors, which is why unwrap usage is so discouraged.

Option 4: treat the Option<&str> value you get from the path to & str conversion like the monad it is, and use .map() on it to transform the contained value.

4 Likes

I think calling "panic" a crash is both sending the wrong message as well as incorrect. To me, and I strongly believe this is the correct interpretation, "Panic" is equivalent to, "You, the programmer (driver), decided not to care about logically (correctly) implementing (driving) your program (car), but don't worry, I've got your back and will do my best to clean up the stack and other resources (watch out for obstacles) and stop the program (pull over to the side of the road) before something bad happens (you kill children in the crosswalk)." This is not a "crash" this is a safe action to prevent a catastrophe because the programmer (driver) is being lazy and not thinking about safety. On the other hand, "Option/Result/etc" (dashboard lights and indicators), are the equivalent of, "I am notifying you that this thing you asked me to do can't be done for a reason of either some malfunction or external environmental reasons or because the inputs you've given aren't compatible. You need to decide what to do, but if you wan't, and you tell me to, I can go ahead and panic (pull to the side of the road)". Neither of these is a crash. A crash is, "Hmm...you've given me senseless inputs or there is something unexpected (there is a child in the road and you are drunk), I'll just do some random and possibly dangerous thing (run over the child or steer into the abutment), and maybe allow someone else to take control of the program and do something you didn't want to allow (let your passenger in the back seat start steering and braking and driving where they want instead of where you want including steering the car into the child or into the abutment)." In any case, in a "crash" all bets are off. Anything can happen up to and including killing you and others.

TL;DR: Panic is not a "crash". Panic is not "unsafe". Panic is a safe action to prevent a crash that the programmer (driver) has made no attempt to properly avoid (or it is entirely unavoidable otherwise due to some failure (the engine blew up or the transmission fell out).

13 Likes

Honestly I wasn't trying to be all that specific. I used crash as an umbrella term for 'the program exiting', regardless of whether it's a graceful panic or something else. but I'll go fix it anyway since it's better to be specific.

4 Likes

While people above have pointed out the benefits to having Path and OsStr be separate types from str, I think it's important to acknowledge that the disadvantages are real. It's currently annoying or even impossible to do many operations with a Path that are easy with a regular string.

For example, Path and OsStr don't have any equivalent to str::find, or String::replace, or many other handy string methods. And implementing them in an external crate is not simple, because they don't expose their internal representation (as @BurntSushi mentioned above). This also leads to inefficiencies when using them with crates like serde.

@BurntSushi's own bstr crate shows an alternate way to handle possibly-non-Unicode strings, which has fewer of these downsides. I think it would be great if we could find ways to make the standard library Path/OsStr types gain similar functionality without losing their resilience. I know some Rust contributors are actively thinking about this, and I think it's an area where suggestions should be welcome.

16 Likes

They could just be implemented, though – so this is at best a (rightful) complaint against a somewhat incomplete corner of the stdlib. It does not show anything inherently flawed or unfixable in the language or the ecosystem.

4 Likes

For that example, sure. But as I mentioned above, you can't do anything outside of std without either additional costs for conversions or exposing OsStr's internal representation. You might want to implement your own substring search algorithm for example. Or you might want to implement a regex or glob engine to run on file paths. Or any one of a number of other things that are almost certainly never going to be in std. So from that perspective, it absolutely could be an unfixable aspect of the current design. (I say "could" because it depends on whether we ever expose the internal WTF-8 representation. If we don't, then this sort of thing is indeed unfixable by design.)

3 Likes

If I took this attitude with ripgrep, I promise you that I'd have a lot fewer users. And it almost certainly would not have been integrated into VS Code. That regexes and globs can work on &[u8] is absolutely critical to the mere existence of a tool like ripgrep.

17 Likes

I don't disagree with anything you said there.

However I don't think any user of a program would make the distinction. The program was doing what they wanted, then suddenly it was not. With no understandable explanation.They might be angry at you that the the crappy program you gave them "crashed".

Heck, I'm not sure I would make the distinction for myself when one of my server processes dies in the night. I would have a hard time explaining to the boss that "no it did not crash". And I am my own boss :slight_smile:

7 Likes

Yeah, some of them at least, and this is one of the first steps toward improvement that I'd hope to see. This old abandoned RFC has a good starting point. Maybe I can find time to resurrect it...

On the other hand, efficient serialization is in that category of "broken by (current) design."

3 Likes

If getting the most number of users possible is an active goal of yours when developing crates, more power to you.
But when I develop most of my crates it's because I need them for something bigger I'm working on, not because of some as-yet-unknown adoption rate that may or may not materialize. And then in the extension of that, if I ever get a PR (or a GH issue where I'm willing and able, as well as have the time required to do the work) the functionality can be grown.

As an aside, working on a byte slice vs working on types like OsStr, is a totally different thing. Supporting a byte array is much much easier because there's no gnarly interpretation necessary. It's just a bunch of bytes.

1 Like