So I tested the std_read 1 byte aligned vs 32 byte aligned version on my desktop and laptop (commit), and on my laptop I cannot reproduce the difference in performance.
Desktop (AMD Ryzen 9 5900X)
std_read 1 byte aligned time: [133.72 ms 134.03 ms 134.37 ms]
std_read 32 byte aligned time: [42.969 ms 43.137 ms 43.306 ms]
Laptop: (AMD Ryzen 7 PRO 3700U)
std_read 1 byte aligned time: [101.60 ms 101.86 ms 102.16 ms]
std_read 32 byte aligned time: [101.98 ms 102.41 ms 102.88 ms]
The systems are OS wise very similar, and both use amd-ucode.
Interestingly the 1 byte aligned version on the desktop is slower than the laptop!
I've also tested it on a Raspberry Pi and an Intel VPS, both show no major differences:
Raspberry Pi 3 Model B Rev 1.2
std_read 1 byte aligned time: [1.6450 s 1.6722 s 1.7206 s]
std_read 32 byte aligned time: [1.6606 s 1.6759 s 1.6939 s]
VPS (Intel Xeon)
std_read 1 byte aligned time: [278.47 ms 291.78 ms 307.45 ms]
std_read 32 byte aligned time: [276.78 ms 286.73 ms 297.45 ms]
Summary
It seems like only the system with the AMD Ryzen 9 5900X is affected - maybe others with a Ryzen 5000 series CPU could try the benchmark on their system and see if they get similar results?
The code is here: GitHub - ambiso/read_slow