Thermite SIMD: Melt your CPU – Early Announcement and Feedback


is a new SIMD library focused on providing portable SIMD acceleration of SoA (Structure of Arrays) algorithms, using consistent-length SIMD vectors for lockstep iteration and computation. Extensive research and work has gone into minimizing wasted CPU cycles and making the most out of what your CPU can do.

I've been working on Thermite for a little over a month now, and with the AVX2 backend and vectorized math library almost fully implemented, I think now is a good time to announce the crate and ask for feedback. Pre-AVX2/WASM/ARM backends are a work in progress.

The latest documentation is at

What would you like to see in an ideal SIMD framework? What can be done better in Thermite? What would be required to use Thermite in your number-crunching applications?


Do you have some examples of the kind of things I could compute using this?

It's designed for SoA algorithms, where you aren't meant to care how many values are operated on at once, be it 1 or 4 or 16. It's kind of similar to an ECS (Entity-Component-System) in that you can zip along your data and apply changes, 1-to-1.

For example:

use thermite::*;

pub struct Position2D<S: Simd> {
    pub x: VectorBuffer<S, Vf32<S>>,
    pub y: VectorBuffer<S, Vf32<S>>,

pub struct Velocity2D<S: Simd> {
    pub x: VectorBuffer<S, Vf32<S>>,
    pub y: VectorBuffer<S, Vf32<S>>,

pub struct System<S: Simd> {
    pub pos: Position2D<S>,
    pub vel: Velocity2D<S>,

impl<S: Simd> System<S> {
    pub fn update(&mut self, dt: f32) {
        let dt = Vf32::<S>::splat(dt);

        debug_assert_eq!(self.pos.x.len(), self.pos.y.len());
        debug_assert_eq!(self.vel.x.len(), self.vel.y.len());
        debug_assert_eq!(self.pos.x.len(), self.vel.y.len());

        // this is verbose, but I'm working on better iterator APIs
        let px = self.pos.x.as_mut_vector_slice();
        let py = self.pos.y.as_mut_vector_slice();
        let vx = self.vel.x.as_vector_slice();
        let vy = self.vel.y.as_vector_slice();

        for (((px, py), vx), vy) in px.iter_mut().zip(py).zip(vx).zip(vy) {
            *px = dt.mul_add(*vx, *px);
            *py = dt.mul_add(*vy, *py);

and that will apply the velocities to positions in any number at a time, depending on the instruction set used. AVX2 will compute 8 at once, for example.


This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.