Rand uses Fisher-Yates shuffle, and that's probably the most efficient way of doing it.
Note that such proper shuffling needs to pick elements from random places in the slice, and that is by definition unpredictable, so it'll cause tons of cache misses.
If you don't need true randomness, then randomizing in chunks closer to cache sizes might give slightly faster results.
Rand also has a few pseudorandom algorithms to choose from, but here you're probably more bottlenecked on memory than CPU.
Making things cache-line-sized won't make it any faster.
If your input is a sequence of integers that you can describe as a function f(x), then you could randomize the function without moving anything in memory, e.g. f(x ^ y).
I am not sure if we are measuring "slowdown" in the same way. I am measuring "slowdown" in terms of:
time_for_shuffle / time_for_memcopy
So I agree with you that if I am shuffling a bunch of u32, then I am wasting a lot of memory bandwidth as we read a large chunk of memory, but then only extract a single u32 from it.
However, by the above measurement, it seems that as sizeof(Elem) goes to infinity, the slowdown factor should appraoch 1.
I think you can use coprime generators of the group of integers modulo n. You generate the numbers statefully based on a random seed, then n iterative applications of the generator until you've filled a vector with the values.
This website will give you primes greater than your table size. Enter the table size in the Number field and hit "search". The first four primes after 3e8 are 300000007, 300000031, 300000047, 300000089.
Any prime larger than your table size can serve as the group generator. Simply iterate and discard all values larger than the table size. Of course the values aren't truly random; they're completely predictable once the generating prime and starting value are known. But they may be adequate for your purpose.
Also, depending on how you're consuming this vector, if you return at iterator over that generator instead of the vector, you might never need to allocate it at all.
If it isn't apparent, you probably should use rand() to choose a starting value (modulo the table size) for each shuffle. If you want sequences that do not recur from one run to the next once they have generated the same value, then select the generator prime for each run incrementally from the list.