Encoding PathBuf containing path with invalid utf-8 characters using serde

Hi,

I have a struct with pathbuf as one of its field,
struct A {
pub first: bool,
pub path: PathBuf,
}
Then path can have invalid utf-8 characters as linux allow to create files and folder with invalid utf-8 characters.
I am using rmp_serde::to_vec_named() to get vec from the struct object.
But, In case of path with invalid utf-8 characters, it is crashing with Error : SerdeEncodeMspack(Syntax("path contains invalid UTF-8 characters")).

Is there any way to encode a struct with invalid utf-8 charcters without skipping it?
Thanks in advance

You can convert it to an OsStr(ing) and serialize that, because apparently Serde provides an infallible Serialize impl for OsStr(ing) but a fallible one for Path(Buf).

5 Likes

In my case it is not possible to change the data type, Is there any other solution to this problem?

You don't have to change the data type stored in the field. It's sufficient to use #[serde(with = …)] and pass a custom function that calls .as_os_str() on the PathBuf then serializes the returned OsStr.

3 Likes

Using .as_os_str() means that you can't deserialize on windows a path serialized on unix and vice versa. OsStr is encoded as if it is enum OsStr { Unix(Vec<u8>), Windows(Vec<u16>) } and deserializing the wrong variant for the current platform will return an error. Path on the other hand is serialized as UTF-8 string which can be deserialized on all platforms. It also means that OsStr is not human readable even for human readable formats like JSON, while Path is.

3 Likes

Yes, but this means that non-UTF-8 paths simply can't be (de)serialized.

1 Like

The idea to take a mostly-but-not-quite-UTF8 path on Unix, serialize it to some common representation, and deserialize that representation to a mostly-but-not-quite-UTF16 path on Windows, losslessly, and get a meaningful result, is at best quite dubious.

6 Likes

So, what should I do? Because we build our library in both linux as well as windows. Is there any way which is compatibile for both linux and windows?

You can declare upfront that you support only paths which fits in utf-8 and use a wrapper library like camino.

4 Likes

Thats not an option. As in linux files can be created with invalid UTF-8 characters

Yes, I believe that's possible on all major platforms, not only Linux.

But you can make a blanket statement about what your library/application supports. If you're making open source software, you are not obligated to support every use case out there.

It's perfectly reasonnable to get back to your users complaining about that with "please change your process so you don't end up with invalid utf-8 character, I don't and won't support those".

The readme of camino has lots of justification and counter examples.

4 Likes

I read the readme of camino. But it is kind of important for our library to support invalid UTF-8 files as we are working on existing file system which can not be changed. Is there no way to resolve this problem?

I don't think there is (a typesafe way) apart from serializing as a tagged union, as mentioned earlier.

4 Likes

You could write your own enum for serialization that has a Portable(String) variant for paths that are actually UTF-8 and a fallback variant for non-portable, OS-specific paths.

3 Likes

Remember it's fine to use OsStr if you won't serialize on one platform and deserialize on another. This should be a pretty safe bet if you're taking paths from the machine itself, as they are meaningless on other platforms!

If you do want to exactly represent both across platforms though, you will need to implement De/Serialize on your own type and figure out what, for example, the unix path b"\xff" means on windows. Certainly it's not "\xff", as that's U+00FF, or ÿ, which is b"\xc3\xbf in utf-8.

8 Likes

So, for my use case serialization and deserialization are happening on same platform but that platform could be either linux or windows.
Could you maybe show any example of handling this for both windows and linux? Do I have to write different serialization and deserialization methods based on platform?

Use the OsStr example above, as described by @H2CO3

See Field attributes · Serde

2 Likes

Written before you clarified you don't need paths between platforms, so you can ignore, but since I wrote it...


If you're converting paths between platforms for something humans look at, you're already assuming a common encoding -- presumably unicode. Otherwise the paths would already not make sense when sent cross-platform, to a human (excepting any common subset such as ASCII). If you have non-unicode paths, they're probably in some other encoding, EUC-KR or Shift JIS or something. If you know the encoding, you can perhaps still do something sensible. However dealing with encoding translations is generally quite the quagmire.

It's also possible such non-unicode paths don't represent anything in any standard encoding, and are effectively or literally just some sequence of bytes / words. [1] This is even more problematic as there's no standard mapping to unicode (or another common denominator); at least I'm unaware of one. [2] So you would probably make up some custom encoding that covers/converts between all possible paths across the set of platforms (the dubious approach mentioned by @trentj). Probably non-unicode paths will still look like trash on a separate platform (to a human).


  1. And/or you don't care about the human factor and are just wishing for something 1-to-1. ↩︎

  2. And if there was a common approach, presumably serde or another popular crate would be supplying it. ↩︎

5 Likes

But if I use OsStr, isn't that going to be platform specific? In that case I have to write different modules based on platform. Is there any way to generalize it. Example code :

use serde::{Deserialize, Serialize};
use std::path::PathBuf;

#[derive(Serialize, Deserialize)]
struct Demo {
    #[serde(with = "path_handling")]
    path: PathBuf,
}

mod path_handling {
    use super::*;
    use serde::de::Deserializer;
    use serde::ser::Serializer;
    use std::ffi::OsStr;
    use std::os::unix::ffi::OsStrExt;

    pub fn serialize<S>(p: &PathBuf, serializer: S) -> Result<S::Ok, S::Error>
    where
        S: Serializer,
    {
        serializer.serialize_bytes(p.as_os_str().as_bytes())
    }
    pub fn deserialize<'de, D>(deserializer: D) -> Result<PathBuf, D::Error>
    where
        D: Deserializer<'de>,
    {
        let data = <&[u8]>::deserialize(deserializer)?;
        Ok(OsStr::from_bytes(data).into())
    }
}

This code will work for unix systems only right?

Serde implements Serialize/Deserialize for OsString you can see the Serialize impl here on docs.rs

You shouldn't need to handle them differently for different platforms.

4 Likes