Deinterleaving memory, fast (or memcpy with stride)

I have a contiguous buffer of bytes like the following:

[x1 y1 x2 y2 x3 y3 ...]

and I need to split them up into separate buffers

[x1 x2 x3 ...] and [y1 y2 y3 ...]

where the two output buffers are not next to each other.

Are there any methods that can make this process fast, maybe using SIMD (although there is no actual computation here, just shuffling) or the DMA (something like memcpy)?

EDIT (for old question) I should say that I actually want to do the opposite - I want to de-interleve.

EDIT2: I am re-writing the question to make it clearer

Just out of curiosity, how well do the extensions from the itertools crate optimise? You may find that it's fast enough for your needs.

Looking at this SO answer it seems that in general no there is not any chip technology you can use for running memcpying strided data to an unstrided buffer. Having said that, Neon (cpu instruction extensions for ARM) seems to provide for this for the Arm case.