How big are the patches? If the work to be done is simple enough, or there isn't enough of it, rayon's small amount of overhead will absolutely make things slower.
Edit: You might also be interested in Eliminate False Sharing | Dr Dobb's, but I don't think you're hitting that, if my reading of your indexing is correct.
But if your solution is taking 4ms over all, then I think it makes sense that introducing threading slows it down. Any threading, even highly tuned stuff like rayon, has overhead. If you aren't doing enough computation, that overhead won't be worth it.
4 milliseconds is an extremely short time, especially considering that you do IO within your benchmark?
Splitting the work up into y_end - y different takes is only a valuable thing to do if each of those tasks takes at least a few milliseconds to complete, as I understand it.
If you want to pursue it more though, could you possibly include your input in the gist? I haven't done advent of code and thus don't have my own input to put in.
I think the benchmark could probably be made more accurate if we do the file reading outside of the criterion bench_function call.
The (x << 1) + 1 work inside the inner loop is also something I suspect can be auto-vectorised, and that will further exacerbate the relative overhead. It might even be able to do further optimisation around the outer loop too in the original case given contiguous layout.
Regardless, unless x .. x_end is reasonably small, and y .. y_end is rather large, setup overhead is going to dominate.