Error handling and UTF-8 validation

The Rust Language Idioms like "warning: variable does not need to be mutable" and "warning: value assigned to '<variable_name>' is never read" lead to constructs like:

    let vsttrpt: Vec<char> = String::from_utf8(text.to_vec()).unwrap().chars().collect();

as in this Prototype:

Unfortunately the Use of the socalled "Language Sugar":
the ? operator and the unwrap() rest to the Robustness of the Rust Application.
In an imperfect real world use case you actually might find in text invalid UTF8 text. (perhaps one stray ill formated Byte)
Like trying to work with some internet content like in

$ wget -S -O - "http://www.lanzarote.com/de/ausfluge" 2>&1 |target/release/text-sanitizer -d

which than leads to unexpected Rust Engine Panics:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 0, error_len: Some(1) }', src/libcore/result.rs:997:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:39
   1: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:70
             at src/libstd/sys_common/backtrace.rs:58
   2: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:200
   3: std::panicking::default_hook
             at src/libstd/panicking.rs:215
   4: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:478
   5: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:385
   6: rust_begin_unwind
             at src/libstd/panicking.rs:312
   7: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
   8: core::result::unwrap_failed
   9: sanitizer_lib::sanitizer::sanitize_u8
  10: text_sanitizer::main
  11: std::rt::lang_start::{{closure}}
  12: std::panicking::try::do_call
             at src/libstd/rt.rs:49
             at src/libstd/panicking.rs:297
  13: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:87
  14: std::rt::lang_start_internal
             at src/libstd/panicking.rs:276
             at src/libstd/panic.rs:388
             at src/libstd/rt.rs:48
  15: main
  16: __libc_start_main
  17: <unknown>

Imagine a Rust Web Spider that would crash on every wrong encoded Web Page.
(Ironically this tool was meant to sanitize text to make parsing it easier.)

So on this side the overuse of Result objects in the Rust Libraries make the work with the Language complicated, slow and tiresome.

1 Like

There are a lot of methods you could have used instead of unwrapping the result from String::from_utf8() if you want to avoid a panic with invalid utf-8. Some options off the top of my head:

  • Use String::from_utf8_lossy()
  • Change the function signature to return Result<String, Box<dyn Error>> and use the ? operator to safely unwrap or bubble the error to the caller.
  • Stick with Vec<u8> and handle the decoding yourself, e.g. with the encoding_rs crate.

I also noticed some other suspicious things elsewhere in that example, which while off-topic does hint at deeper problems with an understanding of the language:

//Free the STDIN Handle
ostdin = None;

This isn't needed because:

  1. This doesn't actually free anything. The std::io::Stdin type fits into a single pointer-sized slot on the stack. It is backed by an Arc smart pointer providing shared ownership.
  2. The example doesn't read the value of ostdin after this point. So normally the compiler will automatically drop it here. At worst, you are wasting a CPU cycle to write a 0 over the pointer in the stack slot. And another CPU cycle to maintain the unused Option on the stack. Not to mention the 8-bytes of stack space required to store it. All of this gets optimized away without this unused write.
  3. Even if you tried to use it later, you would be unable because the Stdin pointer has already been moved out of the Option by the match expression.
7 Likes

Thank you for your insights on std::io::Stdin. Actually I could not extract that information from the official documentation at:

I was also able to figure out how to use the Cow<'a, str> in this use case so I changed it to:

      let vsttrpt: Vec<char> = String::from_utf8_lossy(text).to_mut().chars().collect();

Still the Text Processing Utility is unable to process the Web Page:

$  wget -S -O - "http://www.lanzarote.com/de/ausfluge" 2>&1 |target/debug/text-sanitizer
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 0, error_len: Some(1) }', sanitizer-lib/src/sanitizer.rs:85:28

this point to the very heart of the text processing logic at:

                let spec = str::from_utf8(&text[icstrt.unwrap()..icend.unwrap()]).unwrap();

I think it's not specifically noted because there are no rust structs which need manual deallocation, and all structures will be properly dropped when moved. It's more a fact of the language than of any particular structure, such as stdin.

I haven't looked at all of the logic, but have you tried replacing that str::from_utf8 with String::from_utf8_lossy? It seems like the website just isn't UTF8, so anywhere you're converting its contents from bytes to a string needs to account for that. Or converting the bytes to a string before that loop using the same function?

1 Like

I think the real problem about the Rust Error Handling concepts can be illustrated by the Example taken from the official documentation of std::string::String::from_utf8 at:

When the Input is not correct the whole Rust Application crashes with only 1 single Byte:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: FromUtf8Error { bytes: [240, 159, 119, 150], error: Utf8Error { valid_up_to: 0, error_len: Some(2) } }', src/main.rs:6:25

So when you need to know What was the wrong random input and where it came from the normal developer used to work with Exceptions tries to undo the unwrap() call to get access to the original data:

But this intent fails causing some headhaches

   Compiling playground v0.0.1 (/playground)
error[E0382]: borrow of moved value: `vsparkle_heart`
 --> src/main.rs:8:49
  |
3 |     let vsparkle_heart = vec![240, 159, 119, 150];
  |         -------------- move occurs because `vsparkle_heart` has type `std::vec::Vec<u8>`, which does not implement the `Copy` trait
...
6 |     let ssparkle_heart = match String::from_utf8(vsparkle_heart) {
  |                                                  -------------- value moved here
7 |         Ok(s) => s,
8 |         Err(_) => String::from(format!("{:x?}", &vsparkle_heart)),
  |                                                 ^^^^^^^^^^^^^^^ value borrowed here after move

In match String::from_utf8(vsparkle_heart) { Err(_) => }; where has vsparkle_heart been moved if the conversion failed ?
And how is this Error recoverable if the Original Data is lost ?

After struggling with it for some days without any solution I think I got the right hint from other developers in the forum.
The answer is:
If you still need the original data or are not sure that your data is 100% valid "don't use String::from_utf8()" to check it.
It gives you an Error but the original data is lost
"Use str::from_utf8()" instead

   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 1.15s
     Running `target/debug/playground`
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `"đź’–"`,
 right: `"[f0, 9f, 77, 96]"`', src/main.rs:18:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Standard Output

[f0, 9f, 77, 96]
  1. So the real problem about Rust's Error Handling is that nothing does actually work like you expect it to do.
  2. The long way around its problems from those Limitations result in a lower performance in comparision with other compiled Languages.

The original author of this question was searching the performance loss in the Assembly Code produced by the Compiler.
But the truth is that Applications become slow when they do slow things.

In this case I need to Double Check all the Data to get a reliable Rust Application.

1 Like

The data there isn't lost, it is in the error value, which you are throwing away without using. In the error case String::from_utf8() returns a FromUtf8Error, which provides methods to get the original data (either as a borrowed slice, or by consuming the error to get an owned value). It also provides a means to get a Utf8Error, which gives more information about where the error occurred, so you can add a custom replacement character or just use the valid prefix of the string (or some more elaborate handling). The common case of just replacing invalid bytes with the Unicode replacement character is more conveniently provided for by String::from_utf8_lossy().

8 Likes

Please, stop accusing the language of being full of design errors while it is obvious that you have not studied it enough to understand its idioms. I'm sorry to say this, but in this case, it's exactly your code that is full of unnecessary epycicles, and you should learn more about Rust best practices before complaining that there is something wrong with the language itself.

You don't have to crash the program. You don't have to use unwrap() on errors. You don't have to ignore the value inside an Err variant. You don't have to convert every slice to an owned Vec. In fact you shouldn't do any of this if you want to write good-quality code.

Your sanitizer code is easy to rewrite without crashing on invalid UTF-8, without any spurious need for error handling, and without any need for "double checking" all the data. Here it is, I just tried it with the same command line that you provided and it worked.

Please, have some respect towards the designers, compiler writers, and experienced users of the language, and don't go on asserting that everyone else is at fault, except you.

12 Likes

Thank you.

Your code has many interesting hints that I was unable to locate even after searching the Official Documentation up and down until now
I did not dare to use

        let c = char::from(c);

since the Documentation states:

The char type represents a single character. More specifically, since 'character' isn't a well-defined concept in Unicode, char is a 'Unicode scalar value', which is similar to, but not the same as, a 'Unicode code point'.

Which meant in my mind u8 != char

Actually according to the Example of

The Expected Result of the Application should be:

#![allow(unused)]
fn main() {
println!("{}", '❤'.escape_unicode());
}

\u{2764}

but the actual Result is:

$ echo "đź’–"|target/release/text-sanitizer_v2
\u{f0}\u{9f}\u{92}\u{96}

And not as required:

$ echo "đź’–"|./text-sanitizer.pl
<3

The Application was required to recognize the related bytes and translate them to a clean ASCII Code.

So your code cannot crash since it does not even try to recognize the UTF8 codepoint but also cannot produce the required Result.

Also the hint about the FromUTF8Error was very interesting but will complicate this simple task even more.

Once more nothing at Rust is like a developer would expect it to be.

Of course char is not an u8. As the documentation correctly states, char is any Unicode code point, which means that its possible values are a strict superset of the possible values of a u8, hence, all u8s can be converted to a char, but not vice versa. The from method is a conversion, not an identity operation.

This does not make sense. If your input is not valid UTF-8, you can't expect to operate on UTF-8 code units. You have to decide between two possibilities:

  • Case 1: you require the input to be valid UTF-8, and you simply want to escape all non-ASCII code points using an ASCII-friendly representation. In this case, supplying non-UTF-8 text to the application is an error which must be handled.
  • Case 2: you don't require that the input be valid UTF-8. In this case, however, you can't expect any programming language to guess what you want to do. You can still escape all non-ASCII bytes, without any assumptions on the encoding – this is what the code I posted does.

Since in your original code, you were only differentiating between ASCII and non-ASCII, and you complained about having to provide UTF-8, I assumed case 2. But if your input is neither of these, what is it, then? What do you want to do when you have valid UTF-8 followed by some non-UTF-8 text in the same string? What should happen?

Your expectations arise out of an insufficient understanding of Unicode, and/or an insufficiently detailed and accurate explanation of your requirements. Rust doesn't have anything to do with this fact.

2 Likes

I don't understand.
When I run your main()... example above it does indeed produce the expected result you have there, namely "\u{2764}"

Also I have no idea what that output means. If that is required what is it?

Emphasis in the quote by me.

Could you elaborate on that a bit, I don't follow? In your first playground example you have "vec![240, 159, 119, 150];". Which would be a untf8 sparkle heart except it has an 146 where it should have 119.

So, it's an error as any kind of unicode character.

What about "clean ASCII" ? If we convert in manually we get:

240 not ASCII
159 not ASCII
119 = "w"
150 not ASCII

So it's a clean ASCII "w" with some junk around it. I guess we could escape them some how, maybe C style: "\xf0\x9Fw\x96"

What is the actual requirement here?

1 Like

It crashes because you took the easy way out by using .unwrap() rather than properly handling the error when it occurs.
So no offense but that's an issue with the code as you wrote it, not with Rust in general.

This is important enough to bear repeating: .unwrap() is not the same as properly handling an error. It is in a sense the opposite, what you are doing is telling the compiler to abort the program if the Result/Option is not an Ok/Some variant. So in effect you are taking responsibility for the Ok/Some-ness of that value. And then you feed it illegal data.

That's easy: clone the input string beforehand, and you can try again as much as you want. Or just reuse the string in the error you're throwing away, which obviates the nerd for cloning. The reason String::from_utf8 consumes the byte Vec is that String as a type by definition owns its data. So if it accepted a byte slice it would have to do an internal clone. Therefore this is more useful, because if you need to clone the data it will be explicit in your code and thus there is no hidden/unexpected performance bottleneck.

That is an extremely subjective assertion to make. They behave precisely as I'd expect, and that's because I have a different exposure to to these concepts (in this case monadic error handling) than you have.
Therefore I have to conclude that it is not an error with Rust. The real solution for the problem you're having is understanding why Rust code is the way that it is.

Also not true. The "long way around" as you call it is merely the proper and correct way to do things. That other programming languages let you write faulty code and accept that, does not make that code correct. Instead you will get unpredictable runtime behavior. This is a flaw of earlier programming languages and so by proxy a flaw in your education that is not your fault at all.
By contrast, if it compiles in Rust, it is at the very least safe along the following dimensions: types, resources (RAM and other resources alike, unlike GC collected languages), freedom from data races, and fearless and relatively easy concurrency.

3 Likes

From the hint of the FromUTF8Error it developed a little prototype which demonstrates the first part of the desired behaviour. Recognizing valid UTF8 characters and notifying invalid bytes:

it converts the sequence

    let vsparkle_heart = vec![240, 159, 146, 150, 119, 250, 240, 159, 146, 150];

to:

\u{1f496}w(?fa)\u{1f496}

Still I was unable to find a way to get the Unicode Codepoint \u{2764}

But I already notice the code again gets just as bloated as the first version was.

The sequence vsparkle_heart does not contain the code point U+2764 HEAVY BLACK HEART (❤). That would be the byte sequence 226, 157, 164. You have the sequence 240, 159, 146, 150, which corresponds to an entirely different code point, U+1F496 SPARKLING HEART (💖). You have written no code that translates one to the other, so of course you won't see \u{2764}.

2 Likes

I'm getting more confused about this.

In you original playground example the vec contains: 240, 159, 119, 150.

What to make of it? I see three possibilities:

  1. If we expect it to be valid UTF8 then the whole thing is in error and there is nothing useful that can be made of it. The fact that it has one byte in the middle that is a valid ASCII "w" is of no meaning.

  2. If we expect it to be "clean ASCII" we might be right in saying it is a "w" surrounded by some erroneous bytes.

  3. If we don't know if it is UTF8 or ASCII we have a problem. We don't now if it was ASCII with three bytes in error or UTF8 with one byte in error.

What is the requirement here?

2 Likes

Thank you very much for this hint.
I was already dizzy from all these experiments.

I just noticed you were replying to me there. As a Rust noobie myself I feel I should comment:

I do feel you pain on finding that Rust and it's libs don't do what we expect from all the experience we have with our old favorite, comfortable, languages. In my case C/C++ and Javascript. However Rust is not C/C++ or Javascript or Python or Java or whatever. In the same way that none of those are any of the others either. It's something else. A language which explores and employs concepts that no other language I know does.

This can be unsettling, whether it's the type checking, the anti aliasing, the error handling... but when one takes the trouble to find out why things are as they are in Rust they start to make a lot of sense.

One of the first things I learned when starting out with Rust was from a presentation where the presenter told his students "never use unwrap" and moved on. This stood out in my mind and I remember it well because we see "unwrap()" all over the documentation and examples. "Why did he say that?" I thought.

Well, of course I also use "unwrap()" all over the place when knocking up new code. But because that statement made such an impression on me I looked into it more and understand what sins I'm committing. There is almost no "unwrap() in my production code.

I suspect that had you had that important piece of advice about unwrap() pointed out none of this discussion would have occurred.

3 Likes

My Initial Argument was the ? and the unwrap() Function are not Language Sugar but rather risky implementations.
The same as my first version of the Application. Obviously the Tests that I made were biased only providing valid UTF8 Text. So everything went fine.
But then using it in the Wild with this Web Page as shown made it crash.
Now the Experiment demonstrating the correct processing of the FromUTF8Error shows why develpers easily are tempted to go the straight and easy way with the unwrap() Function.

About the Requirement of the Sanitize Application it is that

  1. always must give an output if it receives an input. An empty output would be failure indication too. So just exiting on some parsing difficulties is not an option.
  2. then it would preprocess the input to make further analysis easier according to a replacement map. and indicating where there are unrecognized characters within the output as in:
$ echo "● identificador de la función inválido pero ejecución éxitosa|ð" | ./text-sanitizer.pl
* identificador de la funci(?195|179)n inv(?195|161)lido pero ejecuci(?195|179)n (?195|169)xitosa|(?195|176)

$ echo "● identificador de la función inválido pero ejecución éxitosa|ð" | ./text-sanitizer.pl es
* identificador de la funcion invalido pero ejecucion exitosa|(?195|176)

Thank you for your most valuable information about the UTF8Error.
Now that I actually made the effort to work with it at:

I finally understand what the Crash Report wanted to tell me:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 0, error_len: Some(1) }', src/libcore/result.rs:997:5

That there was 1 byte as in error_len: Some(1) on position valid_up_to: 0 that could not be parsed.
This is actually useful information to harden the code.

1 Like

Moderator note: In general, if you are having trouble with something, please start by asking questions like "what is the best way to do this?" instead of declaring that other people's code is broken/slow/unusable. Thanks!

8 Likes

Finally I managed to apply the concept of processing the UTF8Error to the Rust Application in consideration.
It would look something like this.

Fully recovering and still giving a usable output.
When parsing:

    let vsparkle_heart = vec![
        226, 157, 164, 240, 159, 146, 150, 119, 250, 248, 240, 159, 146, 150, 247, 190,
    ];

it would render an output which can be further processed:

uni res: '["❤💖w", "(?fa)", "(?f8)", "💖", "(?f7)", "(?be)"]'
["2764", "1f496", "77", "fa", "f8", "1f496", "f7", "be"]

Obviously the full Application would not parse the "w" Character but split the Input String at that point.

So in the text-sanitizer Application under consideration I replaced this single line:

                let spec = str::from_utf8(&text[icstrt.unwrap()..icend.unwrap()]).unwrap();

with all the previous code
As you can find it at:

Now the Application is able to parse any given Byte Sequence to Unicode Points or Hexadecimal Codes:

$ echo $(perl -e 'my @arrchrs = (226, 157, 164, 240, 159, 146, 150, 119, 250, 248, 240, 159, 146, 150, 247, 190); print pack "U*", @arrchrs; print "\n";'; date +"%s.%N") |target/release/text-sanitizer ; date +"%s.%N"
sequence 0 (cnt: '7', strt: '0', end: '7'): '[e2, 9d, a4, f0, 9f, 92, 96]' - parsing ...
sequence (cnt: '7', strt: '0', end: '7'): '[e2, 9d, a4, f0, 9f, 92, 96]' - parsing ...
utf8 ok: '❤💖'
uni res: '["❤💖"]'
sequence 0 (cnt: '8', strt: '0', end: '8'): '[fa, f8, f0, 9f, 92, 96, f7, be]' - parsing ...
sequence (cnt: '8', strt: '0', end: '8'): '[fa, f8, f0, 9f, 92, 96, f7, be]' - parsing ...
utf8 Err: 'Utf8Error { valid_up_to: 0, error_len: Some(1) }'
vld ps: '0'
vld idx: '0'
ivld chrs cnt: '1'
ivld chrs: 'fa'
sequence (cnt: '7', strt: '1', end: '8'): '[f8, f0, 9f, 92, 96, f7, be]' - parsing ...
utf8 Err: 'Utf8Error { valid_up_to: 0, error_len: Some(1) }'
vld ps: '0'
vld idx: '1'
ivld chrs cnt: '1'
ivld chrs: 'f8'
sequence (cnt: '6', strt: '2', end: '8'): '[f0, 9f, 92, 96, f7, be]' - parsing ...
utf8 Err: 'Utf8Error { valid_up_to: 4, error_len: Some(1) }'
vld ps: '4'
vld idx: '6'
utf8 recovered: '[f0, 9f, 92, 96]'
ivld chrs cnt: '1'
ivld chrs: 'f7'
sequence (cnt: '1', strt: '7', end: '8'): '[be]' - parsing ...
utf8 Err: 'Utf8Error { valid_up_to: 0, error_len: Some(1) }'
vld ps: '0'
vld idx: '7'
ivld chrs cnt: '1'
ivld chrs: 'be'
uni res: '["(?fa)", "(?f8)", "đź’–", "(?f7)", "(?be)"]'
<3<3w(?fa)(?f8)<3(?f7)(?be) 1589548267.406629508
1589548267.408792824

You can clearly notice in the Error Recoverment Implementation what I stated earily about the "long way round"
Also I found that the Rust Compiler improved in its version of 1.43 since its previous version of 1.34 one year ago.
The Application Startup Time with version 1.34 was of 5ms and now has improved to 2.2ms.
Obviously the Rust Community pushes its project strongly forward.

Now when it comes to the Web Page which was the Testing Input.
The Application suggested by the Community parses the 27114 Bytes in 6,8ms

$ echo $(wget -S -O - "http://www.lanzarote.com/de/ausfluge" 2>&1 ; date +"%s.%N") |target/release/text-sanitizer_v2 ; date +"%s.%N"
268K=0,1s 2020-05-15 14:52:26 (268 KB/s) - auf die Standardausgabe geschrieben [/27114] 1589550746.543372521
1589550746.550155990
$ echo "scale=3; 46.550155990-46.543372521"|bc -l
.006783469

On the other hand the Application which implements the UTF8 Parsing with its UTF8Error Handling parses the Web Page in only 19.8ms:

$ echo $(wget -S -O - "http://www.lanzarote.com/de/ausfluge" 2>&1 ; date +"%s.%N") |target/release/text-sanitizer ; date +"%s.%N"
412K=0,06s 2020-05-15 14:50:47 (412 KB/s) - auf die Standardausgabe geschrieben [/27114] 1589550647.351933194
1589550647.371714692
$ echo "scale=3; 47.371714692-47.351933194"|bc -l
.019781498

And that is the Point which I tried to emphasize that the correct "long way around" comes at its Cost.
In comparison the "text-sanitizer" Application implemented with another Language parses the Web Page in 13.2ms

$ echo $(wget -S -O - "http://www.lanzarote.com/de/ausfluge" 2>&1 ; date +"%s.%N") |./text-sanitizer.run -i es de  ; date +"%s.%N"
259K=0,1s 2020-05-15 15:30:14 (259 KB/s) - auf die Standardausgabe geschrieben [/27114] 1589553014.154008297
1589553014.167189308
$ echo "scale=3; 14.167189308-14.154008297"|bc -l
.013181011

So its advantage is not only to find at Assembly Code level but lies also in the Application Logic.

1 Like