It seems to me that the unit of work you assign to each thread is very small, which might result in too much thread management overhead. I'd try to slice up the vec into something like 24 mutable slices, and have each thread work on those.
The principle of rayon is that each work unit can be subdivided into smaller units, and if any threads are idle, they'll grab some part (half, I'd imagine) of what the non-idle thread is currently doing. That's cheaper than queueing every single iterator item for a random thread to pick up, but it's more expensive than a loop that isn't prepared to give up part of what it's doing. The plain loop
can compile down to the arithmetic, incrementing a counter, and testing if the counter has reached the constant 1_000_000_000. The parallel version has to, at a minimum, have synchronization operations to check if somebody grabbed the second half of its work.