A couple of people have been working on implementing fast functionality for validating whether byte slices are valid ASCII or UTF-8 (and if not, what was the last valid character, etc.).
Let's discuss this here to not trash other issues.
A couple of people have been working on implementing fast functionality for validating whether byte slices are valid ASCII or UTF-8 (and if not, what was the last valid character, etc.).
Let's discuss this here to not trash other issues.
So I started working on vectorized UTF-8 validation, but got a bit hooked by how hard it was to beat the scalar ASCII validation code in core::str::from_utf8. I managed to beat it after a while by ~2x on AVX machines while returning errors as informative as those emited by str::from_utf8
, but it wasn't easy at all! It felt like str::from_utf8
was blazing fast already at least for validating ASCII strings.
The code is here: https://github.com/gnzlbg/is_utf8 . Feel free to do whatever you want with it. Setting up a baseline, validation testcases for ASCII and UTF-8, the error messages of from_utf8
, etc. was the most time consuming part and at least for benchmarking, many cases are still not covered...
It would be cool if we could have a single crate that already has interesting benchmarks and a full validation suite so that we can experiment with newer things easily.
For example I don't really have a benchmark suite for short strings, but this blog post has some references about that: UTF-8 processing using SIMD (SSE4)
To restate what I said in Jetscii now works with (future) stable Rust 1.27.0 - #11 by killercup :
I put a bunch of useful benchmark inputs in https://github.com/killercup/simd-utf8-check and wrote some macros to make running arbitrary functions (e.g. ones that validate or parse bytes as UTF8!) against them super easy. Feel free to add your own
I switched to using criterion and added @gnzlbg's crate. You can find the rendered benchmark results here.