How to safely store a Path/OsString in a sqllite database

I am writing a application, that needs to store multiple Paths (and other things, which are not relevant) inside a sqlite database. The application will be cross platform, that is why i am trying to save the path as a string with the os-encoding, without having to parse it to another encoding and losing or corrupting data. I am using sqlx and all Path's are rust Pathbuf's.

What i tried:

  • At first i bluntly converted the path to a string and ignored all errors. But this is just a stupid idea.
  • After that i tried to save the OsString directly inside sql. This did not work, because sqlx could not save an OsString, and i could not implement the necessary traits.

My last idea is to dump the raw OsString bytes, and then read those bytes into an OsString every time i need the path. But i dont really know how to start. As it is not really possible to get the raw bytes of an OsString because of implementation reasons, I googled and looked at some libraries (OsStr Bytes and bstr), but i don't know if those are the right choice, as i lack in experience with OsStrings.

How can i safely store those Path's inside my sqlite? Is the last idea the right direction?

You will be forced to have two different implementations: one for Windows, and one for non-Windows (unix).

For unix, Rust allows you to get raw bytes, and you can store them as a BLOB.

For Windows, you will need to store paths as broken UTF-16/UCS-2, just like Windows does it. Or you can choose to break Windows paths with invalid surrogates, and convert them to UTF-8, which allows you to reuse the byte-oriented BLOB approach used for Unix.

7 Likes

On windows, you can convert the path to a Vec<u16> and on other platforms you can convert it to an Vec<u8>. Storing the resulting vector will do it. I don't think there's any standard lossless way to convert the Vec<u16> from windows into an Vec<u8>, but you could come up with your own.

1 Like

Those both ideas sound great. If i would convert the windows path to utf-8, would i always have to reconvert the path back, to use it? And do i need to configure something in the sql database or can i directly save both Vec<u16> and Vec<u8> via sqlx?

Windows paths are not guaranteed to be valid unicode, so you cannot always convert them to utf-8.

I would just convert the raw Vec<u16> to bytes using a specific endianness (likely little-endian), and store it simply as a BLOB. For instance:

#[cfg(unix)]
pub fn os_str_to_bytes(string: &OsStr) -> Cow<'_, [u8]> {
    use std::os::unix::ffi::OsStrExt;
    Cow::Borrowed(string.as_bytes())
}

#[cfg(windows)]
pub fn os_str_to_bytes(string: &OsStr) -> Cow<'_, [u8]> {
    use std::os::windows::ffi::OsStrExt;
    let bytes = string.encode_wide().flat_map(u16::to_le_bytes).collect();
    Cow::Owned(bytes)
}
6 Likes

Oh that looks perfect. I wrote something similar to this, but yours is better i think. Thank you all for the many and quick replies :))

I made some errors in the code above, check the edit for a fix.

That is also true for Linux.

I think i now have implemented it the correct way. It works, but can one of you guys confirm, if this is the right way? I changed the database table to accept a blob. Now every time I save or read my data to/from the database, i convert it using those methods.
Is this the right way? Or should i remove the windows "byte mapping" and just create multiple methods for saving the bytes? One for windows which saves/reads a Vec<u16>, and one for Linux which saves/reads a Vec<u8>?

Do you maybe also have tips on how to write some tests for these methods, to check if it actually converts correctly?

#[cfg(unix)]
pub fn os_str_to_bytes(string: &OsStr) -> Cow<'_, [u8]> {
    use std::os::unix::ffi::OsStrExt;
    Cow::Borrowed(string.as_bytes())
}

#[cfg(windows)]
pub fn os_str_to_bytes(string: &OsStr) -> Cow<'_, [u8]> {
    use std::os::windows::ffi::OsStrExt;
    let bytes = string.encode_wide().flat_map(u16::to_le_bytes).collect();
    Cow::Owned(bytes)
}

#[cfg(unix)]
pub fn bytes_to_os_str(bytes: &[u8]) -> OsString {
    use std::os::unix::ffi::OsStrExt;
    OsString::from_vec(bytes)
}

#[cfg(windows)]
pub fn bytes_to_os_str(bytes: &[u8]) -> OsString {
    use std::os::windows::ffi::OsStringExt;
    let wide: Vec<u16> = bytes
        .chunks_exact(2)
        .into_iter()
        .map(|a| u16::from_le_bytes([a[0], a[1]]))
        .collect();
    OsString::from_wide(wide.as_slice())
}
2 Likes

Your back-conversion uses from_ne_bytes() which will break on big-endian platforms. Also, for converting back from a blob on Unix, the code wouldn't compile because from_vec() takes a vector, not a slice (it takes ownership so as to avoid unnecessary clones).

Is from_le_bytes the better option?

Since you use to_le_bytes, you should also use from_le_bytes.

3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.