Sse2 performance

I was putting together some code that puts max load on a cpu using sse2 instructions.
I was expecting to get throughput of close to 4X the cpu clock frequency in flops, however I am only seeing around 154 mflops. See GitHub - uglyoldbob/benchmark: A benchmarking and load testing utility for code, including a criterion benchmark. Code is based on GitHub - Mysticial/Flops: How many FLOPS can you achieve?