Get f64 values from __m256d


I am trying to find a way to get all 4 f64s from one __m256d

How can I do this? Even better is there a way to accumulate all 4 values into 1 f64?

I am trying to get the sum of products from 2 vectors ie:

for i in 0 .. 1000 {
  result += vec1[i] * vec2[i];

I want to speed this up using something like this:

        let a = _mm256_set_pd(1.0, 2.0, 3.0, 4.0);
        let b = _mm256_set_pd(1.0, 2.0, 3.0, 4.0);
        let mut acc = _mm256_set_pd(1.0, 2.0, 3.0, 4.0);
        acc = _mm256_fmadd_pd(a, b, acc);

I need the values in acc eventually, how do I do this?
Also, since I can't yet get those values, I can't tell if acc is having it's previous values overridden, or is it doing the result += vec1[i] * vec2[i] part that I'm assuming it's doing.

I will likely have more questions about how to quickly dump f64s from a Vec into these without slowing it down too much.

You may want core::arch::x86::_mm256_store_pd - Rust

Excellent, thanks!

I tried to shove a *f64 in there, but it gave me a (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)

This makes me think that the *f64 needs to be pointing to a 256bit memory location?

I'm not sure how to do this? I've used C before, and malloc, but I am getting all kinds of "techniques" that don't seem to apply to allocating 4 spaces of f64s.

Is there a std function for this or something?


I should have linked to the unaligned version, __mm256_storeu_pd. This will let you write to any [f64] slice, and does not require the pointer to have special alignment.

This tutorial has some example usage:

Ah special alignment, I did see some stuff on how it needs to have a 32 bit alignment, but aside from not starting on some offset, I don't know what that means.

I'll watch that tutorial, thank you so much!

Quick update: I have it working at ~2x the initial speed, thanks for linking that tutorial!