How to “zip” two slices efficiently

For some reason, the float version just doesn't optimize like the integer version does.

Is this a "-ffast-math" type of problem? Llvm doesn't want to do the vectorization since it may change the floating point result slightly(?)

Edit: It's a -ffast-math type of problem, documented here

test zipdot_f32_checked_counted_loop   ... bench:       1,347 ns/iter (+/- 664)
test zipdot_f32_default_zip            ... bench:       1,392 ns/iter (+/- 13)
test zipdot_f32_unchecked_counted_loop ... bench:       1,343 ns/iter (+/- 371)
test zipdot_f32_zipslices              ... bench:       1,342 ns/iter (+/- 466)
test zipdot_f32_ziptrusted             ... bench:       1,342 ns/iter (+/- 387)
test zipdot_i32_checked_counted_loop   ... bench:         380 ns/iter (+/- 113)
test zipdot_i32_default_zip            ... bench:       1,401 ns/iter (+/- 27)
test zipdot_i32_unchecked_counted_loop ... bench:         308 ns/iter (+/- 154)
test zipdot_i32_zipslices              ... bench:         380 ns/iter (+/- 134)
test zipdot_i32_ziptrusted             ... bench:         301 ns/iter (+/- 148)
2 Likes