Whose job is it to validate UTF-8 strings?

I am writing some code to read some bytes from a file that I'm going to build a struct from, with String fields. For example:

struct Simple {
    hello: String,
}

impl Simple {
    fn new(hello: &str) -> Self {
        Self { hello: String::from(hello) }
    }

    fn hello(&self) -> &str {
        &self.hello
    }
}

The question is whose job is it to make sure the strings represent valid UTF-8? This is a real check I need to perform. I assume when reading the file, I will have a &[u8] and can use String::from_utf8, which returns a Result, which is good.

If that was all there was to it, I would be good, and Simple::new wouldn't need to do anything extra. But, I will also allow people to manually construct a Simple passing in a string, probably a &str. The question is should Simple::new also check for a valid string, given a &str, and how would I do that? Or should I have a safe and unsafe version of new?

Also, is there any benefit to returning &String in the Simple::hello() method? Does that provide any sort of guarantee that the string is valid UTF-8 over just returning a &str?

Finally, and off-topic, is there a concept of taking multiple &str's, building a single, owned, read-only String from them all concatenated and then returning a tuple of &strs into their offsets into the one, big String? I doubt it provides much of a benefit over allocating separate strings, but if its normal and easy, I'd appreciate guidance.

It is undefined behavior to construct a str or String with non-UTF8 contents.

Yeah. I want to detect it rather than going through. I saw String::from(&str) just by-passes checks, so I want something like String::from_utf that takes a &str rather than a &[u8]. Mostly I am wondering where that check should go from an interface design perspective.

I also realize there's str::as_bytes. :smile:

There isn't a need to check since both str and String are guaranteed to contain valid utf-8.
Why do you think you need to re-validate the string? Since there's no safe way to construct a string with invalid utf-8, I don't see any reason to re-validate it.

2 Likes

The point is that if you have a &str that isn't UTF8,[1] you've already hit UB (some other code already failed at their job). There's no guaranteed way to recover from UB. You're beyond the event horizon. Therefore...

If they pass a &str, you don't need a check.

If they pass a &[u8], use the same method you use for file data.

There's no benefit and increased indirection, so just return a &str.

You can store each substring as an offset and length, but I wouldn't go this route unless I had a good reason to do so.


  1. and you're not in the unsafe block that created it ↩ī¸Ž

7 Likes

I guess I am just confused then. Good???

I guess what threw me off is that in the Rust book they talk about indexing into a string throwing exceptions: Storing UTF-8 Encoded Text with Strings - The Rust Programming Language. I think what they're saying (correct me if I'm wrong) is that the original string being sliced is guaranteed to be valid UTF-8, but the slice indexes are u8 offsets, not necessarily character offsets, so if the specified range doesn't capture full characters (or whatever you call them), it panics.

1 Like

Sounds like I just want to take &str and return &str. That simplifies a lot. Thanks for clearing that misunderstanding up!

It panics because it would be UB to expose the "partial characters" in a &str -- because that would not be valid UTF8!

I.e. the panicking is enforcing the UTF8 variant for &str, not violating it.

3 Likes

I.e. the panicking is enforcing the UTF8 variant for &str , not violating it.

I would say it doesn't matter (for an end user) if the panic is enforcing rather than violating the variant -- for an end user a panic is always just bad, right? (even if it's better than a segfault or an exploitable buffer overflow or so). I mean, as developer, it's nice that we can sort of rely on panics to happen if they need to happen, but we do want to avoid them... :slight_smile:

Coming back to the original question:

is there a concept of taking multiple &str 's, building a single, owned, read-only String from them all concatenated and then returning a tuple of &str s into their offsets into the one,

@jehugaleahsa It seems to me that this is sth you don't want to do, since you'd being taking over management of UTF-8 validity which normally is done for us as others pointed out. If you would start tracking those offsets then it would also be very easy that a little off-by-one bug makes almost all those byte strings invalid UTF strings (and those bugs can be very hard to find).
But I wonder why you considered this idea? (Are you perhaps dealing with binary files or files that are encoded in old non-unicode codecs and want to figure out at load time if you need to convert them?)

I disagree that the "UB or panicking bug" distinction doesn't matter to the end user. In addition to exploits (already something end users care about), a program that hits UB can continue to run and do anything, including appearing to be functional while trashing your project or whatever.

Preventing UB (and even unsoundness) is a cornerstone of Rust's goals and culture.

6 Likes

Of course - I agree with that. But I thought that the issue really was - "How can I avoid errors when dealing with string slices". The panic was only mentioned, I thought, as one possible undesirable consequence of bugs. If you'd tell an end user: "Rust is so marvelous. Be glad our Rust code panicked! We protected you from a being hacked by a zero-day!" then would they feel happy?

You are missing the point. UB is beyond any discussion on error handling. It's not about surfacing the errors to the user... or not doing so. Once you hit UB, there's nothing you can reliably expect from your program.

5 Likes

If you're curious, I am working with SAS XPORT files, which is this old semi-open/proprietary binary format. It's one of those 7-bit ASCII formats where people force UTF-8, etc., but it's easy for them to accidentally set the length of variables in bytes, not characters, or copy paste a smart quote or long-dash from Word/Excel and trash the file. We try to be smart and fallback on Latin-1 if UTF-8 doesn't work on a character-by-character basis, unless an encoding is specified explicitly.

I have an existing implementation in Java and am trying to learn Rust by converting the code. I've read the Rust book 2+ times now, plus the O'Reilly book, and it just doesn't stick because I am not using it. I also want to see if it's significantly faster without all the allocations.

While parsing the binary file, I plan on building up structs for the different sections of the file. The plan is for each section that's parsed to return a Result<T, _> to handle bad formats and bubble that up. In Java, its just exceptions, and who knows if we're handling everything.

I had to implement an IBM hexadecimal floating point class: https://en.wikipedia.org/wiki/IBM_hexadecimal_floating-point, found a crate I tried to contribute to, and ended up writing my own for now. That was a really fun exercise, but also kind of painful. :smile:

If I can get the reader working, I'll spend some time building a writer, too. I'm curious how the parsing will go as far as I/O is concerned - it will involve pushing and pulling from buffers, so always ends up being a pain. I might make it open source if it's decent and I feel motivated. Putting stuff out there is a good learning exercise, too, like doing docs, setting up CI/CD, publishing to repositories, etc.

2 Likes

If that was what the OP meant, it wasn't clear to me. But in that case, no one has actually answered that part of the question:

If panicking on str indexing would be a bug, and you're not certain your indexing will never panic, the solution is to use get and handle your error conditions appropriately.

If you agree with my last post, I really don't see what you're getting at here, or why you said the distinction didn't matter in the first place. I'm not saying unanticipated panics make the end user happy. I'm saying it's better than UB -- including, yes, better for the end user as well as the developer, whether they acknowledge or understand that or not.

1 Like

I think we don't have a real disagreement - My point was just very trivial: a panic may be better than UB, but that doesn't mean that a panic should not be avoided if possible.

1 Like

Wow - So, I can imagine this could become a very useful tool - perhaps also for processing older files in Asian languages. So, basically you have a real need to be super defensive against user input, plus a need to try to have some kind of self-correcting parser...

Yes, but what does this have to do at all with your question?

  • str is always valid UTF-8. So is String. Neither conversion will panic (except if you run out of memory when allocating a large String, which is not something you can do anything about, except instruct the user to buy more silicon).
  • slicing a string necessarily involves a runtime check, because not every byte index is a valid UTF-8 code point boundary. You can use non-panicking methods such as get() instead of the bracket syntax if you want to handle potential errors yourself instead of panicking.
  • panicking is still better than UB, which means that panicking in the presence of a violated invariant is better than not panicking and causing UB.

Which of the three points above do you disagree with? Why? Why are we even discussing such trivial things?

2 Likes

How could I disagree with any of those points? The first two are facts. The third does involve a value judgement - a hacker for instance might not necessarily agree with it as far as other people's code is concerned - but I do.
As to the first two facts - those are not given knowledge for beginners - and that's what set this thread in motion, before we got sort of side-tracked about relative merits of panicking and not panicking.

The rule that str and String are guaranteed valid UTF-8 is an example of type-driven design:

https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/

By enforcing validity on the type itself, your business logic can assume that its inputs are valid, without having to validate them over and over again.

4 Likes