Safe, panic-free way to convert &[u8] to &str

Hi,

is there a safe and panic-free way to:

  • given a &[u8] slice on input
  • find its longest prefix that is a valid utf-8 string
  • return that prefix as &str
  • if that prefix is shorter than the input slice, also return Utf8Error or something similar indicating the error

...?

I am imagining signature like

fn foo(slice: &[u8]) -> (&str, Result<(), Utf8Error>)

or

fn foo(slice: &[u8]) -> Result<&str, (&str, Utf8Error)>

The best I have found so far is std::str::from_utf8, but it returns only the length of valid prefix (in case of error), so I need to either

  • call from_utf8_unchecked introducing unsafe (and potential UB if I happen to make a bug there), or
  • call from_utf8(...).unwrap() introducing panic (that would not happen unless I make a bug, but a panicking code path would still be present)

I mean ... it's not a big deal, I will just use one of the above if I won't find anything better, but for some reason I have got an impression that APIs like this (i.e. returning partial success safely without a need to re-check) are a preferred way of doing things in rust, so I suspect I am missing something obvious.

The (very) recently stabilized [u8]::utf8_chunks should do what you want here.

9 Likes

Example.

4 Likes

I don't really understand your question. There are only three ways to do this

  • Either do it unchecked and risk UB
  • unwrap and risk a panic
  • or handle potential errors.

You already listed the first two above and the third is just the same as the second but you do actual errorhandling instead of calling unwrap(). How else would you like this to be solved?

2 Likes

Ahh, are you talking about how to retrieve the already confirmed checked part that Utf8Error.valid_up_to() indicates?

Thanks, I did not know about this one, I will look into the details.

Yes, that's what I was talking about.

What I do not like about it is that it is just an integer and the fact that the prefix has been checked and confirmed valid is not encoded in the type system. While I understand that the Rust type system may not be able to express everything and some unsafe is sometimes necessary, here it seems easy.

Of course I could just do:

use core::str::Utf8Error;
use core::str::from_utf8;
use core::str::from_utf8_unchecked;

// note: here Utf8Error contains some reundant info, but I wanted to keep the example small
fn my_from_utf8(slice: &[u8]) -> (&str, Result<(), Utf8Error>) {
  match from_utf8(slice) {
    Ok(output) => (output, Ok(())),
    Err(err) => (unsafe{ from_utf8_unchecked(&slice[0..err.valid_up_to()]) }, Err(err))  // SAFETY: docs say `slice[0..err.valid_up_to()]` was checked and is valid
  }
}

but I would expect something like that in standard library, so I was asking whether I missed it or it is really not there. (I have yet to check [u8]::utf8_chunks suggested by @BurntSushi)

Ah ok. Sorry that I misunderstood.

I presume here the major reason is that the error can't contain a &str of the valid prefix without being bound to the lifetime of the argument slice. And errors should not contain borrows because they're supposed to be passable back down the call stack.

But in general, yeah, Rust's valid-prefix-parsing story isn't super good. Often when writing ad-hoc parsers it would be nice to have something like

fn parse_prefix<'a, T: FromStrPrefix>(&'a str) -> Result<(T, &'a str), T::Error>
1 Like

Yes, it looks like Utf8Error should store an actual reference to the valid(ated) prefix and then provide a method for extracting it.

Unfortunately, there's no way to do this now, because that would attach a lifetime to the error type, breaking backwards compatibility.

At this point all we could do is introduce a different method with a different error type that does contain a reference to the prefix.

Ok, so I have checked the [u8]::utf8_chunks and it is also not ideal:

  • It does not distinguish between invalid byte and end-of-slice in the middle of character
  • It has the "annoying behavior" of returning None on empty slices

Of course both of these are workaroundable, but unless something other pops out, I am concluding that my "dream function" does not exist in standard library, so I will just build some solution I like.

Thank you all for patience.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.