I have a struct with pathbuf as one of its field,
struct A {
pub first: bool,
pub path: PathBuf,
}
Then path can have invalid utf-8 characters as linux allow to create files and folder with invalid utf-8 characters.
I am using rmp_serde::to_vec_named() to get vec from the struct object.
But, In case of path with invalid utf-8 characters, it is crashing with Error : SerdeEncodeMspack(Syntax("path contains invalid UTF-8 characters")).
Is there any way to encode a struct with invalid utf-8 charcters without skipping it?
Thanks in advance
You don't have to change the data type stored in the field. It's sufficient to use #[serde(with = …)] and pass a custom function that calls .as_os_str() on the PathBuf then serializes the returned OsStr.
Using .as_os_str() means that you can't deserialize on windows a path serialized on unix and vice versa. OsStr is encoded as if it is enum OsStr { Unix(Vec<u8>), Windows(Vec<u16>) } and deserializing the wrong variant for the current platform will return an error. Path on the other hand is serialized as UTF-8 string which can be deserialized on all platforms. It also means that OsStr is not human readable even for human readable formats like JSON, while Path is.
The idea to take a mostly-but-not-quite-UTF8 path on Unix, serialize it to some common representation, and deserialize that representation to a mostly-but-not-quite-UTF16 path on Windows, losslessly, and get a meaningful result, is at best quite dubious.
Yes, I believe that's possible on all major platforms, not only Linux.
But you can make a blanket statement about what your library/application supports. If you're making open source software, you are not obligated to support every use case out there.
It's perfectly reasonnable to get back to your users complaining about that with "please change your process so you don't end up with invalid utf-8 character, I don't and won't support those".
The readme of camino has lots of justification and counter examples.
I read the readme of camino. But it is kind of important for our library to support invalid UTF-8 files as we are working on existing file system which can not be changed. Is there no way to resolve this problem?
You could write your own enum for serialization that has a Portable(String) variant for paths that are actually UTF-8 and a fallback variant for non-portable, OS-specific paths.
Remember it's fine to use OsStr if you won't serialize on one platform and deserialize on another. This should be a pretty safe bet if you're taking paths from the machine itself, as they are meaningless on other platforms!
If you do want to exactly represent both across platforms though, you will need to implement De/Serialize on your own type and figure out what, for example, the unix path b"\xff" means on windows. Certainly it's not "\xff", as that's U+00FF, or ÿ, which is b"\xc3\xbf in utf-8.
So, for my use case serialization and deserialization are happening on same platform but that platform could be either linux or windows.
Could you maybe show any example of handling this for both windows and linux? Do I have to write different serialization and deserialization methods based on platform?
Written before you clarified you don't need paths between platforms, so you can ignore, but since I wrote it...
If you're converting paths between platforms for something humans look at, you're already assuming a common encoding -- presumably unicode. Otherwise the paths would already not make sense when sent cross-platform, to a human (excepting any common subset such as ASCII). If you have non-unicode paths, they're probably in some other encoding, EUC-KR or Shift JIS or something. If you know the encoding, you can perhaps still do something sensible. However dealing with encoding translations is generally quite the quagmire.
It's also possible such non-unicode paths don't represent anything in any standard encoding, and are effectively or literally just some sequence of bytes / words. [1] This is even more problematic as there's no standard mapping to unicode (or another common denominator); at least I'm unaware of one. [2] So you would probably make up some custom encoding that covers/converts between all possible paths across the set of platforms (the dubious approach mentioned by @trentj). Probably non-unicode paths will still look like trash on a separate platform (to a human).
And/or you don't care about the human factor and are just wishing for something 1-to-1. ↩︎
And if there was a common approach, presumably serde or another popular crate would be supplying it. ↩︎
But if I use OsStr, isn't that going to be platform specific? In that case I have to write different modules based on platform. Is there any way to generalize it. Example code :
use serde::{Deserialize, Serialize};
use std::path::PathBuf;
#[derive(Serialize, Deserialize)]
struct Demo {
#[serde(with = "path_handling")]
path: PathBuf,
}
mod path_handling {
use super::*;
use serde::de::Deserializer;
use serde::ser::Serializer;
use std::ffi::OsStr;
use std::os::unix::ffi::OsStrExt;
pub fn serialize<S>(p: &PathBuf, serializer: S) -> Result<S::Ok, S::Error>
where
S: Serializer,
{
serializer.serialize_bytes(p.as_os_str().as_bytes())
}
pub fn deserialize<'de, D>(deserializer: D) -> Result<PathBuf, D::Error>
where
D: Deserializer<'de>,
{
let data = <&[u8]>::deserialize(deserializer)?;
Ok(OsStr::from_bytes(data).into())
}
}