Why is utf8_width::get_width_assume_valid declared as unsafe?

I just came across this crate, why is this function declared "unsafe" ?

I don't see a reason, but perhaps I am confused or missing something.

I don’t know either. It doesn’t seem to be able to produce UB, at least with its current implementation. In principle, such a function could add something like unreachable_unchecked calls, in hope that that can help the optimizer for subsequent handling of that byte… as it stands right now, it seems like violating the assumption that the byte is value would only really result in an incorrect, but UB-free, result.

3 Likes

Usually unsafe implies a risk of unsoundness, but I've seen some crates that colloquially use unsafe to indicate that a function doesn't protect you from logic bugs.

I'd personally discourage this use of unsafe, but it wouldn't be the first time I've seen it in the wild.

4 Likes

In this case, I think it's there to encourage you in the direction of get_width, which returns 0 for invalid first byte.

And as historical context, the 1993 version of UTF-8 predated the decision to define Unicode as codepoints U+0000 to U+10FFFF; it was able to encoder up to U+7FFFFFFF using 1 to 6 bytes, whereas today's UTF-8 uses 1 to 4 bytes. If you're trying to decode something that's definitely using UTF-8, but you don't know if it has invalid codepoints encoded in it, you could get into a mess with get_width_assume_valid, since it'll break on every 5 or 6 byte sequence.

You should never encounter "real" UTF-8 with 5 or 6 byte sequences, but I've encountered systems that did because they used the old definition of planes E0 to FF and groups 60 to 7F as private use, which involve 5 and 6 byte sequences. This all went away when Unicode said that to ensure that UTF-16 could be used for all characters, groups 1 through 7F were removed (so a single group only), and we'd only use planes 0 through 10 (17 planes).

The documentation of the function doesn't define what happens when you provide an invalid byte. As far as the documentation says, providing an invalid byte might launch nukes. Therefore, given that documentation, it is correct that the function is marked unsafe.

This gives the implementer freedom. The current implementation happens to return 4 for invalid bytes, but the documentation doesn't say it, and this could change with the next version, and it wouldn't be a breaking change.

Given that a safe version of this function exists, it's somewhat reasonable.

They could mark the function safe, but then they would have to specify what happens for invalid characters (e.g. "4" or "panic" or "an arbitrary usize" or "any of the above").

This is a little controversial -- some people will say that the ability for a function to do more-less anything it wants doesn't qualify as undefined behavior because unsafe is only about memory safety (so the nukes might launch, but the memory will be fine). I disagree that this makes sense -- I think the concept of "memory safety" is not even well defined in such a context. This is well explained in this talk.

4 Likes

Thanks for putting the alternative point of view. I think unsafe IS about memory safety ( in ways that are precisely defined here What Unsafe Can Do - The Rustonomicon ). So I think it would be reasonable for a lint to give a warning "unsafe not used". But I think I can now see why it was marked unsafe though - one function is like putting black and yellow tape around a hazard, it is saying "watch out!". So it is not exactly wrong, but I think it is arguably "abusing" the unsafe system, in a potentially confusing way, and someone can reasonably argue it is "poor style".

It's really hard to argue against that in a world where I/O safety is a thing.

But it would be nice if crate included section where it explains what additional safety invariants users of that crate unsafe interface are supposed to uphold.

This page says something along the lines: "Don't do the following things in your code. If you do any of the following things, Rust will no longer guarantee memory safety. If you don't do any of these things, Rust will guarantee memory safety". But what this "memory safety" actually means is a whole different story and that is not specified on that page. It's complicated.

I understand what you mean about limiting unsafe to memory safety, that makes it easier to understand. But it seems that in Rust it is also used when an invariant is assumed (by calling an unsafe method) rather than being explicitly enforced by that method. Memory safety might be violated when the invariant is later relied on, but is not actually true.

One example in the Rust std lib is converting a utf8 u8 slice to a str slice. The memory representation is identical, but the str slice has an invariant (that the utf8 sequences are valid.) Calling from_utf8_unchecked does not enforce the invariant, and is therefore unsafe, unlike from_utf8, which does check the invariant.

When using the unsafe from_utf8_unchecked, memory safety could later be compromised by indexing out of bounds when using the resulting str slice and assuming the invariant is true, meaning that certain bounds checking can be omitted. For example, the length of a utf8 sequence could be calculated from the first byte, and then used to access the following bytes, up to the length, without performing bounds checking.

The same logic could be applied to use of get_width_assume_valid, if you call it and later assume the width is correct and can be used to calculate byte indexes that can be used without a bounds check.

That's my understanding of this issue anyway. If true, it would be nice if this were more clearly spelled out.

1 Like

It's just weird it me that it implements that the way it does. If it's going to be unsafe, rather than just undocumented with a longer name, it should call unreachable_unchecked() in the disallowed path. Because otherwise what's the point?

I agree that unreachable_unchecked() is better. Perhaps the author didn't realize.

The code looks as if it was first written / imagined with unreachable_unchecked() and then manually optimized to avoid a branch. So the same optimization that the compiler would probably do anyway.

It's better to leave it to the compiler because it can potentially optimize even more if the function is inlined.

But note that it's the fact that the function is (properly) marked unsafe that now allows implementing this optimization without breaking users.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.