Review Request: Can I improve my code?

As a private project for fun, I chose to implement a fully-connected neural network in Rust. To be precise, I'm implementing a multilayer perceptron, which is one of the most basic neural networks available. My goal is to create for a given network (and the features of my computer), the fastest possible feed-forward function. It's my first time using SIMD and I don't know a whole lot about it. I think I created a fairly fast function, but would like to know, if there's anything that I can improve performance-wise.

Aside from performance, making my code less dependent on my specific computer's features while keeping the same performance is my secondary goal. I don't mind using third-party packages for that purpose.

My third goal is to create more ergonomic Rust code. For example, the usage of Cell is more or less a band-aid, right now, because Rust doesn't have a slice::windows_mut function.

Thanks in advance for any tips!

#![no_implicit_prelude]

use ::std::arch::x86_64::__m256;
use ::std::arch::x86_64::_mm256_add_ps;
use ::std::arch::x86_64::_mm256_mul_ps;
use ::std::arch::x86_64::_mm256_rcp_ps;
use ::std::arch::x86_64::_mm256_set1_ps;
use ::std::arch::x86_64::_mm256_setzero_ps;
use ::std::boxed::Box;
use ::std::cell::Cell;
use ::std::eprintln;
use ::std::iter::IntoIterator;
use ::std::iter::Iterator;
use ::std::time::Instant;
use ::std::vec;

fn main() {
    unsafe {
        let mut edges: Box<[Box<[EdgeVector]>]> = vec![
            vec![
                EdgeVector {
                    weight: _mm256_set1_ps(1.0)
                };
                4
            ]
            .into_boxed_slice(),
            vec![
                EdgeVector {
                    weight: _mm256_set1_ps(1.0)
                };
                2
            ]
            .into_boxed_slice(),
        ]
        .into_boxed_slice();

        let mut vertices: Box<[Box<[VertexVector]>]> = vec![
            vec![
                VertexVector {
                    value: Cell::new(_mm256_set1_ps(1.0))
                };
                2
            ]
            .into_boxed_slice(),
            vec![
                VertexVector {
                    value: Cell::new(_mm256_setzero_ps())
                };
                2
            ]
            .into_boxed_slice(),
            vec![
                VertexVector {
                    value: Cell::new(_mm256_setzero_ps())
                };
                1
            ]
            .into_boxed_slice(),
        ]
        .into_boxed_slice();

        let instant = Instant::now();
        feed_forward(&mut vertices, &mut edges);
        eprintln!("feed_forward: {} ns", instant.elapsed().as_nanos());
    }
}

unsafe fn feed_forward(vertices: &mut [Box<[VertexVector]>], edges: &mut [Box<[EdgeVector]>]) {
    for (layers, weight_matrix) in vertices.windows(2).zip(edges.into_iter()) {
        let layer = layers.get_unchecked(0);
        let next_layer = layers.get_unchecked(1);
        let next_layer_len = next_layer.len();
        let mut start = 0;

        for source_vertex_vec in layer.into_iter() {
            for (edge_vec, target_vertex_vec) in (&weight_matrix[start..start + next_layer_len])
                .into_iter()
                .zip(next_layer.into_iter())
            {
                target_vertex_vec.value.set(_mm256_add_ps(
                    target_vertex_vec.value.get(),
                    _mm256_mul_ps(source_vertex_vec.value.get(), edge_vec.weight),
                ));
            }

            start += next_layer_len;
        }

        for target_vertex_vec in next_layer.into_iter() {
            target_vertex_vec
                .value
                .set(_mm256_lgc_ps(target_vertex_vec.value.get()));
        }
    }
}

#[derive(Clone)]
struct EdgeVector {
    weight: __m256,
}

#[derive(Clone)]
struct VertexVector {
    value: Cell<__m256>,
}

/// Compute the logistic value of packed single-precision (32-bit) floating-point elements in a and store the results in dst.///
/// <b>Operation</b>
/// ```
/// FOR j := 0 to 7
/// 	i := j*32
/// 	dst[i+31:i] := lgc(a[i+31:i])
/// ENDFOR
/// dst[MAX:256] := 0
/// ```
#[inline]
unsafe fn _mm256_lgc_ps(a: __m256) -> __m256 {
    _mm256_rcp_ps(_mm256_add_ps(
        _mm256_exp_ps(_mm256_mul_ps(a, _mm256_set1_ps(-1.0))),
        _mm256_set1_ps(1.0),
    ))
}

/// Compute the exponential value of e raised to the power of packed single-precision (32-bit) floating-point elements in a, and store the results in dst.
///
/// <b>Operation</b>
/// ```
/// FOR j := 0 to 7
/// 	i := j*32
/// 	dst[i+31:i] := e^(a[i+31:i])
/// ENDFOR
/// dst[MAX:256] := 0
/// ```
///
/// <b>References</b>
/// [IntelĀ® Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_exp_ps&expand=2273)
#[inline]
unsafe fn _mm256_exp_ps(a: __m256) -> __m256 {
    let mut scalar = F32x8 { simd: a }.scalar;

    for float in &mut scalar {
        *float = float.exp();
    }

    F32x8 { scalar }.simd
}

#[repr(C)]
union F32x8 {
    simd: __m256,
    scalar: [f32; 8],
}

(Playground)

1st Change

As it stands, performance-wise my function is already plenty fast. I just made a small mistake in my naive benchmark and that is measuring time of a single execution. Apparently, when measuring code that operates in nano- or few microseconds, there's a noticeable overhead when measuring the performance, due to "latency" between calling Instant::now and Instant::elapsed. Changing

        let instant = Instant::now();
        feed_forward(&mut vertices, &mut edges);
        eprintln!("feed_forward: {} ns", instant.elapsed().as_nanos());

to

        let instant = Instant::now();

        for _ in 0..10000 {
            feed_forward(&mut vertices, &mut edges);
        }

        eprintln!("feed_forward: {} ns", instant.elapsed().as_nanos() / 10000);

showed a tremendous difference in execution time per function call. The function appears to be more than 10x faster than I previously thought, which is great! However, I also think, that it is almost impossible to improve performance of the current code snippet, anymore.

2nd Change

The second change I made was of ergonomic nature. Reading and correctly interpreting code from others and after enough time, even from oneself is a common task when developing. Therfore, IMO, improving code readability should always be a goal when programming. After wrapping my head around what I'm actually doing in the loops, I realized, that I could use the slice::chunks_exact method, instead of essentially doing the same, but more cryptic.

I changed the following code from

        for source_vertex_vec in layer.into_iter() {
            for (edge_vec, target_vertex_vec) in (&weight_matrix[start..start + next_layer_len])
                .into_iter()
                .zip(next_layer.into_iter())
            {
                target_vertex_vec.value.set(_mm256_add_ps(
                    target_vertex_vec.value.get(),
                    _mm256_mul_ps(source_vertex_vec.value.get(), edge_vec.weight),
                ));
            }

            start += next_layer_len;
        }

to

        for (source_vertex_vec, weight_row) in layer
            .into_iter()
            .zip(weight_matrix.chunks_exact(next_layer_len))
        {
            for (edge_vec, target_vertex_vec) in weight_row.into_iter().zip(next_layer.into_iter())
            {
                target_vertex_vec.value.set(_mm256_add_ps(
                    target_vertex_vec.value.get(),
                    _mm256_mul_ps(source_vertex_vec.value.get(), edge_vec.weight),
                ));
            }
        }

This is doing exactly the same thing without any loss of performance, but at the same time, it is both simpler in design and more expressive. In the outer loop, I iterate over each row of the matrix together with each corresponding element of the source vector. In the inner loop, I iterate over each column together with each corresponding element of the target vector.

Previously, I probably would've had to comment my intent. Now, this seems to be more clear and if I'm going to look at my code, again, in a week, I'm confident I'd spent much less time on figuring out what I'm trying to do here.

3rd Change

One of the remaining annoyances left in the code is the bare usage of the __m256 type and the __mm256_* functions everywhere. Calling them is unsafe and I'd like to convert the feed_forward function from unsafe to safe. The first step I took towards a safe version is to abstract away the SIMD intrinsics by enhancing the F32x8 union type through several implementations. Here's what I did:

#[repr(C)]
pub union F32x8 {
    vector: __m256,
    scalar: [f32; 8],
}

impl F32x8 {
    #[inline]
    pub fn lgc(self) -> Self {
        Self {
            vector: unsafe { _mm256_lgc_ps(self.vector) },
        }
    }
}

impl Clone for F32x8 {
    #[inline]
    fn clone(&self) -> Self {
        Self {
            vector: unsafe { self.vector },
        }
    }
}

impl Copy for F32x8 {}

impl Add for F32x8 {
    type Output = Self;

    #[inline]
    fn add(self, rhs: Self) -> Self::Output {
        Self {
            vector: unsafe { _mm256_add_ps(self.vector, rhs.vector) },
        }
    }
}

impl AddAssign for F32x8 {
    #[inline]
    fn add_assign(&mut self, rhs: Self) {
        self.vector = unsafe { _mm256_add_ps(self.vector, rhs.vector) };
    }
}

impl Mul for F32x8 {
    type Output = Self;

    #[inline]
    fn mul(self, rhs: Self) -> Self::Output {
        Self {
            vector: unsafe { _mm256_mul_ps(self.vector, rhs.vector) },
        }
    }
}

impl From<__m256> for F32x8 {
    fn from(from: __m256) -> Self {
        Self { vector: from }
    }
}

impl From<[f32; 8]> for F32x8 {
    fn from(from: [f32; 8]) -> Self {
        Self { scalar: from }
    }
}

impl From<f32> for F32x8 {
    fn from(from: f32) -> Self {
        Self {
            vector: unsafe { _mm256_set1_ps(from) },
        }
    }
}

trait Zero: Sized + Add {
    fn zero() -> Self;
}

impl Zero for F32x8 {
    fn zero() -> Self {
        Self {
            vector: unsafe { _mm256_setzero_ps() },
        }
    }
}

(Small note: I changed the field name from simd to vector due to personal taste and consistency with how I named other things)

The code is fairly self-explanatory. I only added impls I really need. If I need more later, then I can still add more impls for existing SIMD instructions. Anyway, I replaced every use of SIMD intrinsics with the F32x8 type and its methods.

4th Change

The other leftover annoyance is the use of Cell, which I've only added, because there's no slice::windows_mut method. At first, I decided to create my own windows_mut method, but then I noticed, that I'd still have to use slice::get_unchecked to get the current and next layer without bound checks. I decided to forego the implementation of windows_mut and instead created pairs_mut, which works for exactly 2 elements and returns a tuple, instead of a slice, eliminating the need for indexing operations completely. Sadly, this didn't work out as I wanted, because I could keep the mutable references while being able to call Iterator::next over and over again, causing undefined behavior.

In the end, I went with the simplest solution, which while not being fancy, did the job and enabled me to remove the use of Cell. I'm creating a mutable iterator over the vertices. Then I'm calling Iterator::next once to save the first layer. Then I zip the mutable iterator like I did with windows. At the end of the for-loop, I overwrite the layer-variable with next_layer and then repeat until I've iterated through all vertices. This is both safe and involves no addition of functionality to the slice.

Here's the resulting code:

#![no_implicit_prelude]
#![cfg(target_arch = "x86_64")]
#![cfg(target_feature = "avx")]

use ::std::arch::x86_64::__m256;
use ::std::arch::x86_64::_mm256_add_ps;
use ::std::arch::x86_64::_mm256_mul_ps;
use ::std::arch::x86_64::_mm256_rcp_ps;
use ::std::arch::x86_64::_mm256_set1_ps;
use ::std::arch::x86_64::_mm256_setzero_ps;
use ::std::boxed::Box;
use ::std::clone::Clone;
use ::std::convert::From;
use ::std::eprintln;
use ::std::iter::Iterator;
use ::std::marker::Copy;
use ::std::marker::Sized;
use ::std::num::NonZeroU32;
use ::std::ops::Add;
use ::std::ops::AddAssign;
use ::std::ops::Mul;
use ::std::option::Option;
use ::std::time::Instant;
use ::std::vec;

fn main() {
    if !::std::is_x86_feature_detected!("avx") {
        ::std::process::abort();
    }

    let edges = vec![
        vec![
            EdgeVector {
                weight: F32x8::from(1.0),
                id: [Option::None; 8]
            };
            8
        ]
        .into_boxed_slice(),
        vec![
            EdgeVector {
                weight: F32x8::from(1.0),
                id: [Option::None; 8]
            };
            16
        ]
        .into_boxed_slice(),
    ]
    .into_boxed_slice();

    let mut vertices = vec![
        vec![
            VertexVector {
                value: F32x8::from(1.0),
            };
            8
        ]
        .into_boxed_slice(),
        vec![
            VertexVector {
                value: F32x8::zero(),
            };
            1
        ]
        .into_boxed_slice(),
        vec![
            VertexVector {
                value: F32x8::zero(),
            };
            16
        ]
        .into_boxed_slice(),
    ]
    .into_boxed_slice();

    let instant = Instant::now();

    for _ in 0..10000 {
        feed_forward(&mut vertices, &edges);
    }

    eprintln!("feed_forward: {} ns", instant.elapsed().as_nanos() / 10000);
}

fn feed_forward(vertices: &mut [Box<[VertexVector]>], edges: &[Box<[EdgeVector]>]) {
    let mut vertices_iter = vertices.iter_mut();

    if let Option::Some(mut layer) = vertices_iter.next() {
        for (next_layer, weight_matrix) in vertices_iter.zip(edges.iter()) {
            let next_layer_len = next_layer.len();

            for (source_vertex_vec, weight_row) in layer
                .iter_mut()
                .zip(weight_matrix.chunks_exact(next_layer_len))
            {
                for (edge_vec, target_vertex_vec) in weight_row.iter().zip(next_layer.iter_mut()) {
                    target_vertex_vec.value += source_vertex_vec.value * edge_vec.weight;
                }
            }

            for target_vertex_vec in next_layer.iter_mut() {
                target_vertex_vec.value = target_vertex_vec.value.lgc();
            }

            layer = next_layer;
        }
    }
}

#[derive(Clone)]
struct EdgeVector {
    weight: F32x8,
    id: [Option<NonZeroU32>; 8],
}

#[derive(Clone)]
struct VertexVector {
    value: F32x8,
}

/// Compute the logistic value of packed single-precision (32-bit) floating-point elements in a and store the results in dst.
///
/// <b>Operation</b>
/// ```
/// FOR j := 0 to 7
/// 	i := j*32
/// 	dst[i+31:i] := lgc(a[i+31:i])
/// ENDFOR
/// dst[MAX:256] := 0
/// ```
#[inline]
unsafe fn _mm256_lgc_ps(a: __m256) -> __m256 {
    _mm256_rcp_ps(_mm256_add_ps(
        _mm256_exp_ps(_mm256_mul_ps(a, _mm256_set1_ps(-1.0))),
        _mm256_set1_ps(1.0),
    ))
}

/// Compute the exponential value of e raised to the power of packed single-precision (32-bit) floating-point elements in a, and store the results in dst.
///
/// <b>Operation</b>
/// ```
/// FOR j := 0 to 7
/// 	i := j*32
/// 	dst[i+31:i] := e^(a[i+31:i])
/// ENDFOR
/// dst[MAX:256] := 0
/// ```
///
/// <b>References</b>
/// [IntelĀ® Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_exp_ps&expand=2273)
#[inline]
unsafe fn _mm256_exp_ps(a: __m256) -> __m256 {
    let mut scalar = F32x8 { vector: a }.scalar;

    for float in &mut scalar {
        *float = float.exp();
    }

    F32x8 { scalar }.vector
}

#[repr(C)]
pub union F32x8 {
    vector: __m256,
    scalar: [f32; 8],
}

impl F32x8 {
    #[inline]
    pub fn lgc(self) -> Self {
        Self {
            vector: unsafe { _mm256_lgc_ps(self.vector) },
        }
    }
}

impl Clone for F32x8 {
    #[inline]
    fn clone(&self) -> Self {
        Self {
            vector: unsafe { self.vector },
        }
    }
}

impl Copy for F32x8 {}

impl Add for F32x8 {
    type Output = Self;

    #[inline]
    fn add(self, rhs: Self) -> Self::Output {
        Self {
            vector: unsafe { _mm256_add_ps(self.vector, rhs.vector) },
        }
    }
}

impl AddAssign for F32x8 {
    #[inline]
    fn add_assign(&mut self, rhs: Self) {
        self.vector = unsafe { _mm256_add_ps(self.vector, rhs.vector) };
    }
}

impl Mul for F32x8 {
    type Output = Self;

    #[inline]
    fn mul(self, rhs: Self) -> Self::Output {
        Self {
            vector: unsafe { _mm256_mul_ps(self.vector, rhs.vector) },
        }
    }
}

impl From<__m256> for F32x8 {
    fn from(from: __m256) -> Self {
        Self { vector: from }
    }
}

impl From<[f32; 8]> for F32x8 {
    fn from(from: [f32; 8]) -> Self {
        Self { scalar: from }
    }
}

impl From<f32> for F32x8 {
    fn from(from: f32) -> Self {
        Self {
            vector: unsafe { _mm256_set1_ps(from) },
        }
    }
}

trait Zero: Sized + Add {
    fn zero() -> Self;
}

impl Zero for F32x8 {
    fn zero() -> Self {
        Self {
            vector: unsafe { _mm256_setzero_ps() },
        }
    }
}

(Playground)

5th Change

I noticed, that I have a logic error in my matrix multiplication. As I am storing everything in vectors from the very beginning and only ever operate on vectors, I never did a proper matrix multiplication. For that purpose, I'd have to rotate the elements in a vector and multiply, again, but that is not easily possible. The way I fixed my code was to actually take out the vector-specific code, first and get a proper working example. I expected the resulting code to be slower, not only due to the lack of vector operations, but also due to the additional operations. Against all expectations, The scalar version is actually just as fast as the faulty vector implementation. In conclusion, I removed all vector code and simply let the compiler do what it wants and it works brilliantly!

Here's the scalar implementation:

#![no_implicit_prelude]

use ::std::boxed::Box;
use ::std::eprintln;
use ::std::iter::Iterator;
use ::std::option::Option;
use ::std::time::Instant;
use ::std::vec;

fn main() {
    let edges = vec![
        vec![0.0; 64].into_boxed_slice(),
        vec![0.0; 128].into_boxed_slice(),
    ]
    .into_boxed_slice();

    let mut vertices = vec![
        vec![0.0; 64].into_boxed_slice(),
        vec![0.0; 1].into_boxed_slice(),
        vec![0.0; 128].into_boxed_slice(),
    ]
    .into_boxed_slice();

    let instant = Instant::now();

    for _ in 0..10000 {
        feed_forward(&mut vertices, &edges);
    }

    eprintln!("feed_forward: {} ns", instant.elapsed().as_nanos() / 10000);
}

pub fn feed_forward(vertices: &mut [Box<[f32]>], edges: &[Box<[f32]>]) {
    let mut vertices_iter = vertices.iter_mut();

    if let Option::Some(mut layer) = vertices_iter.next() {
        for (next_layer, weight_matrix) in vertices_iter.zip(edges.iter()) {
            let next_layer_len = next_layer.len();

            for (source_vertex, weight_row) in layer
                .iter_mut()
                .zip(weight_matrix.chunks_exact(next_layer_len))
            {
                for (edge_vec, target_vertex) in weight_row.iter().zip(next_layer.iter_mut()) {
                    *target_vertex += *source_vertex * edge_vec;
                }
            }

            for target_vertex_vec in next_layer.iter_mut() {
                *target_vertex_vec = lgc(*target_vertex_vec);
            }

            layer = next_layer;
        }
    }
}

#[inline]
fn lgc(x: f32) -> f32 {
    1.0 / (1.0 + (-x).exp())
}

(Playground)

I'd like to provide a couple of insights, which you might find useful.

What reason do you have for the #![no_implicit_prelude] annotation? If this is "just because", I would suggest removing it. Which of these looks more maintainable (or which is less typing/copy-pasta'ing)?

#![no_implicit_prelude]

use ::std::boxed::Box;
use ::std::eprintln;
use ::std::iter::Iterator;
use ::std::option::Option;
use ::std::time::Instant;
use ::std::vec;

... or ...

use std::time::Instant;

The rest of these are automagically imported via the prelude.


Have you looked into existing SIMD crates like ultraviolet? It provides a 4x4 matrix type using f32x4 SIMD types. For large vector/matrix types, you'll have to roll-your-own, but this code may provide a good starting point?

nalgebra provides matrix types with generic dimensionality, but relies on the compiler's vectorizer like your scalar implementation does.


Why boxed vectors? This looks like it could be defined in a structure-of-arrays:

struct Edges {
    a: [f32; 64],
    b: [f32; 128],
}

struct Vertices {
    a: [f32; 64],
    b: [f32; 1],
    c: [f32; 128],
}

fn main() {
    let edges = Edges {
        a: [0.0; 64],
        b: [0.0; 128],
    };

    let mut vertices = Vertices {
        a: [0.0; 64],
        b: [0.0; 1],
        c: [0.0; 128],
    };

    // ...
}

pub fn feed_forward(vertices: &mut Vertices, edges: &Edges) {
    // ...
}

You will have to change your outer iterator to struct field accesses, but the inner iterators shouldn't have to change at all.

This also puts both structs entirely on the stack (no heap allocations) which may improve performance in areas where you need to create and destroy these types often.

Bonus: the feed_forward function signature is greatly improved.


If you want to benchmark code without measuring the overhead of your test fixtures, try criterion: https://bheisler.github.io/criterion.rs/book/getting_started.html


This last iteration is a huge improvement over the initial code. :wink:

1 Like

I'll definitely try to use criterion in future iterations of the implementation. Thanks for the hint!

What you couldn't have known is, that the neural network is created during runtime by an external function call, i.e. the number of layers and vertices for each layer may be different than it is in my test case. I think, bumpalo might be more appropriate for my scenario. While I don't see it having a boxed slice type, the vector type it has seems like it would offer what I'd want.

In the future, I'll most likely have a NeuralNetwork type with some checks to make sure it's correctly laid out to improve the function signature.

I'll take a good look at both of them. Thank you for the links!

While I currently do not have definite proof, I expect less dependencies to be better for compilation speed, if I only use a small sub-set of the std prelude. The other reason is, that I just like to know about my dependencies.

P.S.: The feed-forward function has grown more complex since my last comment, because the pure multi-layer perceptron does not cover all my needs for a neural network. I'll likely incorporate criterion before I showcase my next iteration. Having a more robust benchmark tool will be needed, going forward.

The amount of stuff you import from libstd doesn't matter for compilation times. At most it would make the resolve phase of the compilation a bit faster, but that phase doesn't contribute to much of the compilation time anyway.

1 Like

I have added criterion as a dev dependency to my package and made it a library instead of a binary package. I only made it a binary, because I needed to execute some benchmark/test code. I uploaded the source code to Github:

6th Change
I briefly mentioned, that a pure multi-layer perceptron does not fulfill my needs. In addition to what I already have, I also need to be able to work with connections, that skip layers. The way I solved this issue, is by introducing additional vertices per layer, that function as a carrier of all vertices of the previous layer and not using a sigmoid function on them. I still use the same vertices and edges data structures as before. The only change I had to make was, that layers have additional vertices at the beginning of the slice. The branching logic has been kept to an absolute minimum for the sake of branch prediction.

It's obviously slower than what I had before, but that was to be expected, because more vertices per layer means a quadratic increase of calculations. There's nothing I can do about that, unless I'd use a completely different approach, but that'd require much more branching logic and is likely slower due to the CPU being unable to predict branches correctly and thus introduce a ton of cache misses. The exception would be a huge, but very sparsely connected NN. In that case, the increased complexity of the algorithm would be beneficial.

Here's the edited feed-forward function:

pub fn feed_forward(vertices: &mut [Box<[f32]>], edges: &[Box<[f32]>]) {
    let mut vertices_iter = vertices.iter_mut();

    if let Option::Some(mut layer) = vertices_iter.next() {
        for (next_layer, edge_matrix) in vertices_iter.zip(edges.iter()) {
            let (carry_target_vertices, target_vertices) = next_layer.split_at_mut(layer.len());

            for (source_vertex, target_vertex) in layer.iter().zip(carry_target_vertices) {
                *target_vertex = *source_vertex;
            }

            for (source_vertex, edge_row) in layer
                .iter()
                .zip(edge_matrix.chunks_exact(target_vertices.len()))
            {
                for (edge, target_vertex) in edge_row.iter().zip(target_vertices.iter_mut()) {
                    *target_vertex += *source_vertex * edge;
                }
            }

            for target_vertex in target_vertices.iter_mut() {
                *target_vertex = lgc(*target_vertex);
            }

            layer = next_layer;
        }
    }
}

7th Change

This is a simple one. Here's the new benchmark:

use criterion::criterion_group;
use criterion::criterion_main;
use criterion::Criterion;
use mlp::feed_forward;

pub fn criterion_benchmark(c: &mut Criterion) {
    const NUM_INPUTS: usize = 64;
    const NUM_HIDDEN_0: usize = 1;
    const NUM_OUTPUTS: usize = 128;

    let mut vertices = vec![
        vec![0.0; NUM_INPUTS].into_boxed_slice(),
        vec![0.0; NUM_INPUTS + NUM_HIDDEN_0].into_boxed_slice(),
        vec![0.0; NUM_INPUTS + NUM_HIDDEN_0 + NUM_OUTPUTS].into_boxed_slice(),
    ]
    .into_boxed_slice();

    let edges = vec![
        vec![0.0; NUM_INPUTS * NUM_HIDDEN_0].into_boxed_slice(),
        vec![0.0; (NUM_INPUTS + NUM_HIDDEN_0) * NUM_OUTPUTS].into_boxed_slice(),
    ]
    .into_boxed_slice();

    c.bench_function("feed_forward #1", |b| {
        b.iter(|| feed_forward(&mut vertices, &edges))
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);