It turns out that rayon calls empty_datastructure (and, consequently, combine_datasructures) many more times than there are threads in the thread pool, with the result that the process slows down enormously, because it spends lots of time on creating and combining lots of instances of the expensive datastructure.
Is there some way of persuading rayon to create only one accumulator per thread?
Or is rayon simply not the right tool for this kind of problem?
Rayon tries to be "adaptive" in its job splitting, in case the work load would not be balanced in a perfect per-thread split, but that does tend to be more eager about splitting up the work. I haven't figured out a good way to let it keep using the same accumulator when the second part of a split doesn't get stolen to new thread, but that would be ideal...
I have rayon#857 trying to dial back the splitting a little, but that could use some real-world benchmarking to give it confidence. You can also try with_min_len to put your own limit on how small it's allowed to split.
I'm sure that lots more performance could be squeezed out, but simply using with_min_len(data.len() / num_threads) (where num_threads is currently picked by the user) already improves the performance enough for the bottleneck of the whole program to have moved elsewhere.