Rust in data science issue

Hi friends, I'm now working in both software development and data engineering. Recently, I've been contemplating the use of Rust for data science tasks, similar to what we typically do with Python pandas/numpy.

I've come across several tutorials and blog posts that discuss various Rust modules that could be helpful. However, I still have some concerns regarding whether using Rust for data science is considered good practice and if there is a growing trend towards adopting data works in Rust.

could you pls give me some thoughts with these concerns?

Don't worry about that. Does it affect your results whether there's a "growing trend", after all?

If Rust fits your use case, then use it. For instance, if you want high-performance processing of well-formed data, or if you want to deploy a machine learning model in production without dependencies, Rust might be a good fit.

If on the other hand you think Rust won't be good for achieving your goal (eg. you value convenience over reliability), then don't use it. Simple as that.

3 Likes

I can't speak for others. But if your task is data mining/visualization/analysis etc, you're choosing an ecosystem/tooling.

Rust is immature in data science for now.
Python or r-lang is more appropriate, but it all depends on your usecases.

1 Like

thanks! make sense to me.

yup. I think the only reason why I'd choose Rust is its performance...

Being a newbie to Rust at this point I would imagine that Rust might be somewhat lacking especially in the ecosystem of libraries.

If performance is your main motivation for looking beyond Python (at least until Mojo is released) then I would suggest taking a look at the underrated F# for Data Science. As F# has been around longer, I imagine it is more mature for this, although I've never used it myself for such purposes.

Also F# intersects with Rust to an extent. Immutability by default, pattern matching, implicit return, non-null, some similarity in syntax.

Almost same shoe as you, have been attempted Rust for data massaging and visualisation for quite a while. Before this, I did Julia & R mostly for data manipulation, curve fitting optimisation and visualisation (plotting). Nothing yet on machine learning. Sharing what I have encountered so far.

  • Rust language itself requires quite an effort to learn it.
  • In term of ecosystem, some of the common toolings - ndarray or nalgebra, polars, argmin{} and plotters are the common crates that I have used. These are not mature yet (< ver 1.0), but usable for simple/ most cases. However, what I find most difficult is to find working examples. So, got to read the documentation again and again, testing provided code examples, testing codes with many println!().
  • There are many creates that provide more or less similar functions, e.g ndarray, nalgebra, polars. They are different but yet similar. None is dominant at the moment, with some being supported by some crates and some by others -> fragmentation. It can be quite a challenging choice to make.

I think I will continue to push myself into Rust. Along these hard paths, I re-learned some of the difficult algebra and calculus, eg Jacobian and Hessian.

3 Likes

I don't think Rust is going to fully take off as the data language, since using graphs in rust is deliberately designed to be a miserable experience. But working with arrays is fine for the most part.

1 Like

No, the teams didn't sit down and brainstorm ways to make using graphs or other pointer-heavy data structures miserable. It's fallout from the approach that was taken to have synchronization-free mutation with memory safety and without data races and without garbage collection or some other runtime (i.e. removing aliasing for non-shared mutability types like &mut).

4 Likes

That's a weird take. "Mature" is not a synonym for "1.0". As of today, ndarray has 7 million downloads, 88 released versions, and it has been around for 7 years. It is used by 707 crates, of which the top 22 have more than 100'000 downloads. If something has been released 7 years ago, under continuous development, and the better part of the ecosystem depends on it, then what's not mature about it?

They are totally different. ndarray is for representing arrays/tensors generically. It doesn't provide linear algebra routines. On the other hand, nalgebra is for doing linear algebra; its Matrix type only exists to have something to work on, but far less advanced in terms of array manipulation (e.g., it's strictly 1 or 2-dimensional and it only supports column major order). Polars is a DataFrame implementation, that's designed to work with potentially non-numeric and/or heterogeneous data (unlike the other two).

This doesn't really cause any more fragmentation than languages people traditionally associate with data analytics contain. I didn't see you complain about Pandas' DataFrame and NumPy's array causing fragmentation, for example. Isn't that a plain double standard?

7 Likes

What's miserable about using petgraph? It seems fine.

1 Like