Fast ASCII and UTF-8 byte slice validation in Rust

A couple of people have been working on implementing fast functionality for validating whether byte slices are valid ASCII or UTF-8 (and if not, what was the last valid character, etc.).

Let's discuss this here to not trash other issues.

@killercup @hsivonen

1 Like

So I started working on vectorized UTF-8 validation, but got a bit hooked by how hard it was to beat the scalar ASCII validation code in core::str::from_utf8. I managed to beat it after a while by ~2x on AVX machines while returning errors as informative as those emited by str::from_utf8, but it wasn't easy at all! It felt like str::from_utf8 was blazing fast already at least for validating ASCII strings.

The code is here: . Feel free to do whatever you want with it. Setting up a baseline, validation testcases for ASCII and UTF-8, the error messages of from_utf8, etc. was the most time consuming part and at least for benchmarking, many cases are still not covered...

It would be cool if we could have a single crate that already has interesting benchmarks and a full validation suite so that we can experiment with newer things easily.

For example I don't really have a benchmark suite for short strings, but this blog post has some references about that: UTF-8 processing using SIMD (SSE4)

To restate what I said in Jetscii now works with (future) stable Rust 1.27.0 - #11 by killercup :

I put a bunch of useful benchmark inputs in and wrote some macros to make running arbitrary functions (e.g. ones that validate or parse bytes as UTF8!) against them super easy. Feel free to add your own :slight_smile:

I switched to using criterion and added @gnzlbg's crate. You can find the rendered benchmark results here.