Fast ASCII and UTF-8 byte slice validation in Rust


#1

A couple of people have been working on implementing fast functionality for validating whether byte slices are valid ASCII or UTF-8 (and if not, what was the last valid character, etc.).

Let’s discuss this here to not trash other issues.

@killercup @hsivonen


#2

So I started working on vectorized UTF-8 validation, but got a bit hooked by how hard it was to beat the scalar ASCII validation code in core::str::from_utf8. I managed to beat it after a while by ~2x on AVX machines while returning errors as informative as those emited by str::from_utf8, but it wasn’t easy at all! It felt like str::from_utf8 was blazing fast already at least for validating ASCII strings.

The code is here: https://github.com/gnzlbg/is_utf8 . Feel free to do whatever you want with it. Setting up a baseline, validation testcases for ASCII and UTF-8, the error messages of from_utf8, etc. was the most time consuming part and at least for benchmarking, many cases are still not covered…

It would be cool if we could have a single crate that already has interesting benchmarks and a full validation suite so that we can experiment with newer things easily.

For example I don’t really have a benchmark suite for short strings, but this blog post has some references about that: https://woboq.com/blog/utf-8-processing-using-simd.html


#3

To restate what I said in Jetscii now works with (future) stable Rust 1.27.0 :

I put a bunch of useful benchmark inputs in https://github.com/killercup/simd-utf8-check and wrote some macros to make running arbitrary functions (e.g. ones that validate or parse bytes as UTF8!) against them super easy. Feel free to add your own :slight_smile:


#4

I switched to using criterion and added @gnzlbg’s crate. You can find the rendered benchmark results here.