Announcing `numr`: A "Batteries-Included" Numerical Library for Rust (NumPy + GPU + Autograd)

Hi everyone,

I’ve started working on a new project called numr, and I wanted to share the vision and get early feedback.

The core idea is simple: What if NumPy was built today, in Rust, with the features we always wished it had built-in?

We all love ndarray and the existing Rust ecosystem, but fragmentation is a real pain point. You often need separate crates for BLAS, LAPACK, sparse arrays, and especially GPU support. If you need gradients, you usually have to switch to a full-blown DL framework like burn or candle.

numr aims to be the foundational numerical layer that unifies these. It is designed to be backend-agnostic, differentiable, and extensible.

:rocket: What Makes numr Different?

1. "Same Code, Any Backend" Architecture
numr is built around a generic Tensor<R: Runtime> abstraction. You write your logic once, and it runs on:

  • CPU: (AVX2/AVX-512/NEON accelerated)
  • CUDA: (Native PTX kernels for NVIDIA)
  • WebGPU: (Cross-platform support for AMD, Intel, and Apple Silicon)

Unlike wrappers around cuBLAS or MKL, numr implements native kernels for operations, meaning no massive external C++ dependencies and full transparency down to the metal.

2. Built-in Autograd (Reverse & Forward Mode)
Differentiation isn't an afterthought. It supports:

  • Reverse-mode: For standard gradient descent/training.
  • Forward-mode: For efficient Jacobian-Vector Products (JVP), crucial for scientific computing tasks like stiff ODE solvers.

3. Modern & Comprehensive Dtypes
Beyond standard f32/f64, numr has native support for:

  • f16 / bf16 (Half precision)
  • fp8 (FP8E4M3, FP8E5M2 for modern ML workloads)
  • Complex numbers (Complex64/128)
  • Sparse Tensors (CSR, CSC, COO formats) are integrated directly, not a separate crate.

:hammer_and_wrench: The "SciPy" Layer: solvr

To prove the robustness of numr, I am simultaneously building solvr, a library for higher-level scientific computing (equivalent to SciPy). It currently implements algorithms for:

  • Optimization: BFGS (using tensor ops, fully GPU-accelerated), simple gradient descent.
  • Integration: Trapezoidal, Simpson's rule, and ODE solvers (RK45, Dop853).
  • Signal Processing: FFT, Convolution, STFT.

Because solvr is built on numr traits, all of these algorithms run seamlessly on CUDA or WebGPU without changing a single line of code.

:warning: Current Status

This is currently experimental (beta) software.

  • The architecture is stable.
  • Many kernels (Matmul, Unary, Binary, Reductions) are implemented for all backends.
  • However, performance tuning (vs. vendor libs) is ongoing, and the API is subject to change.

:link: Check it out

I’m looking for feedback on the API design and contributors who are interested in writing native kernels (WGSL/CUDA/Rust) or high-level scientific algorithms.

Repository:

Example usage:

use numr::prelude::*;

// Define a device (CPU, Cuda, or Wgpu)
let device = CudaRuntime::default_device()?;

// Create tensors directly on GPU
let a = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;
let b = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;

// Operations use native GPU kernels
let c = a.matmul(&b)?;

Thanks for reading!

4 Likes

Have you considered reusing (i.e. forking) previous art? I'm speaking of dfdx crate:

// https://github.com/chelsea0x3b/dfdx/blob/main/dfdx/examples/05-tensor-permute.rs

use dfdx::{
    shapes::{Axes3, Rank3},
    tensor::{AutoDevice, Tensor, ZerosTensor},
    tensor_ops::PermuteTo,
};

fn main() {
    let dev = AutoDevice::default();

    let a: Tensor<Rank3<3, 5, 7>, f32, _> = dev.zeros();

    // permuting is as easy as just expressing the desired shape
    // note that we are reversing the order of the axes here!
    let b = a.permute::<Rank3<7, 5, 3>, _>();

    // we can do any of the expected combinations!
    let _ = b.permute::<Rank3<5, 7, 3>, _>();

    // Just like broadcast/reduce there are times when
    // type inference is impossible because of ambiguities.
    // You can specify axes explicitly to get aroudn this.
    let c: Tensor<Rank3<1, 1, 1>, f32, _> = dev.zeros();
    let _ = c.permute::<_, Axes3<1, 0, 2>>();
    // NOTE: fails with "Multiple impls satisfying..."
    // let _ = c.permute::<Rank3<1, 1, 1>, _>();
}

Have you considered reusing (i.e. forking) previous art? I'm speaking of dfdx crate:

I have my own Deep Learning library, that uses Numr.The training and inference working already. Numr's core is actually extracted from it.

I've reviewd dfdx, burn, and candle. Actually, my DL uses candle at first, until I am stuck at a certain point. Then I tried to switch to use CubeCL. But I still couldn't achieve what i want. So, i wrote my own DL library from scratch. During the rewriting of the math components and customized kernels, I have this idea, to just make a general library, that can handle the multiple Dtypes and backend.

The real challenge is dealing with the quantization and multiple precision for current LLM use case. About this DL library, i'll publish it soon, after full refactoring to use Numr.

1 Like

Thanks for sharing the project announcement. Here are some thoughts.

The way I see it, the main reason why NumPy and SciPy came about as monolithic packages is that when these libraries were created (roughly 1995-2005) packaging and use of Python libraries that wrap external native code was complicated and inconvenient. Huge monolithic numeric libraries that had enough critical mass to be available everywhere worked around this problem.

The situation with cargo and Rust is quite the opposite: We benefit from a great packaging system. I would argue that a crate like numr should focus on the absolutely necessary core, and all the rest should be provided by additional crates. In this way, individual components can be composed and swapped out easily.

Even with multiple crates you could provide a coherent, yet more modular user experience. Looking at numr alone, it has grown from 0 to 192 k lines of code in three weeks. Will you (or others) be able to maintain it and keep it coherent once it has grown some more? What if parts of it have to evolve in backwards-incompatible ways: all of numr shares a single version number.

Some thoughts on the design:

You seem to replicate the design of NumPy with

  • reference counting (Arc here) for data storage,
  • dynamic number of dimensions, shape, strides,
  • a dynamic data type.

Overall, this design is very dynamic: almost nothing is known about any given Tensor at compile time. Why is this a good fit for a language like Rust whose strengths lie in an expressive static type system and great support for manual memory management? Why should users of numr bother with the rigidities of Rust, if all they get are very dynamic tensors with reference-counted storage, similar to what can be had with Python + NumPy. Julia seems actually better here, since it employs a real garbage collector.

I would argue that a “rusty” approach to numerical arrays is exemplified by the mdarray crate. True, it has a different focus (more low-level), but IMHO it fits the Rust language perfectly. The core mdarray crate is minimalist. It utilizes Rust’s type system to great effect: for example it is possible to express tensors where the number of dimensions is either known at compile time or dynamic. Similarly, if the number of dimensions is known at compile time, each entry of the shape can be either a compile-time constant or dynamic. This allows efficient numeric algorithms to be implemented directly by users of the library: the compiler can know that the last two dimensions are 4 and 4 (for example) and optimize accordingly. Usage with GPUs is also a possibility for the future.

There might be a lot of value in your approach as an alternative implementation of NumPy/SciPy in Rust (usable from both Python and Rust). But for that Python bindings and compatibility to existing Python code would be crucial.

Just my two cents...

2 Likes

Thanks for the detailed critique! You’ve touched on the exact debate I wrestled with: Systems Language Rigidity vs. Application Language Flexibility.

To give context: numr was extracted from a Deep Learning framework I built from scratch (after hitting walls with Candle and CubeCL regarding mixed-precision/quantization).

Here is why numr deliberately chooses the dynamic path:

1. Rigidity Belongs at the Application Level

I believe the 'Static vs. Dynamic' trade-off shouldn't be forced by the foundational library.

  • The Library (numr) is maintained as a dynamic, flexible runtime. It acts like a JIT compiler for math operations.
  • The Application (e.g., my ML Trainer) is where the rigidity happens. Since the application knows the specific inputs, model architecture, and file formats, it can enforce strict constraints and optimizations there.

If I lock down shapes at the low level (like mdarray), I lose the ability to write flexible tooling. By keeping numr dynamic, I allow the Application Developer to decide where to optimize for security or speed, and where to allow flexibility for experimentation.

2. Modern Science & AI Have Converged

You mentioned mdarray is 'Rusty' and great for low-level efficient algorithms. I agree—it's perfect for Systems programming (e.g., a game physics engine).

But modern scientific computing (Differentiable Physics, ODE parameter fitting) is now an Application domain similar to Deep Learning.

  • Scientists need to differentiate through simulations (Autograd).
  • Input sizes are often determined at runtime (sensor streams, variable batches).
  • numr supports this workflow natively: you can write a physics simulation in Rust and immediately get gradients, running on WebGPU or CUDA without changing code.

3. The 'Monolith' Solves Dependency Hell

Regarding the 192k lines: The fragmentation in the current Rust ecosystem is painful. Often, the 'Sparse Matrix' crate doesn't talk to the 'GPU' crate, which doesn't support the 'Autograd' crate.

numr (and the solvr layer on top) ensures that if you can represent a Tensor, you can mathematically optimize it on a GPU. This 'Batteries-Included' approach is what allows us to solve the 'Two-Language Problem' where users prototype in Python and rewrite in C++. With numr, the prototype is the production code.

There is definitely space for both. mdarray is the correct choice for strictly typed, CPU-bound kernels. numr is for when you need to solve the problem—whether that's training an LLM or fitting a differential equation—on a GPU, with gradients, without fighting the compiler at every step.

1 Like

Awfully similar to this yet-another-LLM-generated scientific library:

1 Like

I know about that, but we have different approach. I start from my use case, then I extracted it to a shared crate. As you can see, the current state is more ML heavy, rather than have full parity with Numpy.

Plus I ensure that multiple backend runtime is possible, with multiple dtype support first. SciRs seems like a library that just want to translate/transfer everything to Rust.

Another difference is, I build numr as foundational building block. I make it so that any higher level libraries or application, can extend numr, create new ops, or swap with their own optimized kernels.

Here is another one:

Interesting, although that does not look LLM-generated. Here is another one (no LLM involved, as far as I can tell), which I have actually used and is pretty good:

My impression is that its rocket-like development pace would have been impossible without heavy use of LLMs. (This is not necessarily a bad thing, especially not for an experiment.)

It just seems to be a general pattern that projects present themselves as the new standard numeric computing stack for Rust, while lacking in terms of contributing something new, being idiomatic (given the language choice), and allowing interoperability/modularity.


This discussion here wouldn’t be complete :wink: without me mentioning our own project:

The idea here is to build a “rusty” linear algebra layer on top of the IMHO very “rusty” mdarray crate.

I haven’t yet announced it here properly, because of ongoing API changes both in the underlying crate and in our linear algebra layer.

By the way, here is a discussion about improving interoperability between the different multi-dimensional array libraries:

2 Likes

The GPU support is really cool especially without all the giant C++ library backends. I am hoping to do something similar with my crate one day, it currently has 0 dependencies and I would like to keep it that way for as long as possible (good reasons for that, such as dependency hell like you mentioned, but a large part is to force me to learn things instead of just borrowing from others), which seems challenging for GPUs. GitHub - mrbuche/conspire.rs: The Rust interface to conspire. I guess it also might be necessary to specify which device is being used like in your example, didn’t think about that before.

I like where this discussion is going, comparing different approaches and discussing how to have rusty scientific foundational crates and actually make them interoperable.

The main problem with a let's reimplement everything from scratch approach is that we end up with several nearly identical projects that each have only one maintainer, little collaboration, and no clear path to a unified scientific ecosystem for rust. If every crate wants to be "the rust default for xxx", the actual outcome that the community gets is the exact opposite, the situation evolves from "n crates to unify them all" to "n+1 crates to unify them all".

Just my 2 cents..

2 Likes

I agree with the pattern's insight. However, like I said, I couldn't make it with the current crates, so I need to implement it for my own specific needs.

Then, I extracted it, just in case someone want to benefit from it. I will continue to mantain it, as this is key to my own proprietory projects. So I don't think i'll abandon it. Abandoning it means abandoning my cash flow. Thats' terrible.

About aiming to be number one. I am not participating. In fact, if I can found better crates that can handle everything that I need, I will gladly use it, less hassle for me. But for now, nothing fits.

I don't mind people not using my crates. It's fine. I just want to share, who knows if someone can benefit from it. And correct, the more people use it, the faster I can catch bugs or enhance it. But right now, it worked for me. That's all I need.

If anyone want to jump onboard to collaborate or mantain it, or request any features/request - I'll gladly welcome it.


Even if we disagree, the n+1 pattern is becoming the new standard. Not just with Rust. But in programming as general. Rather then waiting for a solution, it will become -> no solution? just create it. That's the cold hard truth. Nevertheless, the gaps will be closer, just not in ways that we prefer.

4 Likes

numr 0.4.0 is out!

0.4.0 mission is to ensure everything is fully covered. It covers every missing backend and dtype.

  • Full backend and dtype parity across CPU, CUDA, and WebGPU (260 tests, all green)
  • Benchmark suite with regression gates so nothing silently gets slower, plus comparative benchmarks (using fluxbench)
  • CI now enforces all three backends on every push
  • Added end-to-end examples and an architecture guide to make it easier to get started or contribute

If you've been waiting to try numr, this is a good starting point. Happy to answer questions.

0.4.0 proves it work for science/math library like solvr.
0.5.0 should not introduce many breaking changes. It will be focusing more on further code improvement and adding ops that is useful for ML crates.

< 0.50 - Good for advanced math/science crates
0.5.0+ - Better coverage for AI/ML libraries/crates

Full detail is here in #issue 5