Rust for "data first" problems?

For many problems, "dumb algorithm + huge dataset" beats "brilliant algorithm + small dataset". (1)

For other problems (writing an OS, writing unix bin utils, etc ...) data is irrelevant to the point there is often zero data. (2)

Informally, let us call problems of domain (1) as "data first" problems and problems of domain (2) as "code first" problems.

Many of our tools are tailed for solving "code first" problems -- version control systems, editors, integrated development environments, debuggers.

There are a few tools that can be used for solving "data first" problems -- ipython notebook?, julia repl?, matlab?, spark?, excel?, sql?. I don't know, I'm not sure.

In a "data first" problem, it is almost as if we are a detective "interrogating" the data, so we want fast iteration. It is not clear if Rust is the right language for this since a "REPL based langauge" is going to start working faster than a compiled language, but for sufficiently large datasets, processing the data dwarfs the startup time.

==========

Question: For those here working on "data first" problems, what has your experience with Rust been like? Are you currently using it, or have you switched to a more REPL-driven language ? (Or maybe the choice of language does not even matter as the most important thing is choice of database?)

Aside:
(1) I know about Evcxr
(2) I also understand that rust has a FFI for interfacing with other languages.

The Nature wrote an article about it.

https://www.nature.com/articles/d41586-020-03382-2

1 Like

Oddly enough people who deal with a lot of data also spend a fortune on developing code in C++ or inventing their own compiled languages to deal with it like Go. A REPL does not cut it, neither does the performance or correctness guarantees of language like Python.

The Julia Language seems to be a bit of an outsider here, it handles like a Python-like language, it has a REPL for quick little experiments. But it also is much more fussy about types and compiles on the fly to very fast native code.

But look what it going on, all these interpreted languages are using code compiled from C, C++ and whatever underneath. Things like GMP for big integer maths and all kind of other mathematical libraries. When you build Julia you need a FORTRAN compiler!

2 Likes

@Hyeonu : Thanks for sharing! The article mentions Jupyter REPL (I assume evcxr based?), nalgebra, geo-rust, rust-bio.

I'm interested a bit more in the lower level details:

What is the work flow like? What backend/db is being used to store the dataset? How is data collection / data cleaning work ? etc ...

I know that alot of people in finance use APL/J/K/Q.

What are examples of these "big data" projects in Go? Are any of these "inventing their own compiled languages" open source / public ?

The Go language is created by Google. I have no inside information on what google does with it. Presumably they support that development with the idea of tackling their own problems, which I can only imagine are about lots of data. That is what they do.

I can't think of another example. Perhaps that is the only one.

Any way your spectrum of "dumb algorithm + huge dataset" to "brilliant algorithm + small dataset" is too simplistic. It's not one dimensional. What about:

small data - small program
big data - small program
big data - big program
big data - small program

That's before we think of other dimensions like correctness.

I'm defining a set

S = { P | P is a problem where "dumb algorithm + huge dataset" beats "clever algorithm + tiny dataset" }

The question is about how Rust performs on problems in set S and does not consider other sets.

The key here is being able to efficiently obtain the interesting subset of data for your problem. That boils down to choosing an appropriate indexing strategy, which is largely independent of whether the programming language is compiled or interpreted.

As far as I can tell, most general-purpose programming languages provide relatively primitive tools for this, when compared to tools like RDBMSs (which have their own problems). Rust’s type system shows some promise in being able to support more advanced data-manipulation tools, but they still need to be developed. (NB: Prototyping something like this is the topic of my MS research)

This is really insightful. I have been thinking, that what is needed here is a database (not necessarily relational) with a language (not necessarily SQL) attached to it -- rather than the current model of "here is my Rust program; here are some FFI's for database XYZ (which runs in a separate process)".

I don't know how to "push" this idea further; I also don't know of any good examples of systems like this (besides mysql/postgres/sqlite + sql).

I’ve been working along these lines for a little while now.

OK. Fair enough.

Can you refine the definition. How many bytes constitutes a "huge" data set? What does that data set look like? I'm not sure how to quantify the cleverness or dumbness of an algorithm.

Anyway, I was thinking, the reason why languages like Python have been so popular among scientists and others doing analytics is not really down to their suitability for the job. Rather it's down to the fact that those users want a simple language with which to get results quickly. They are not computer scientists, they don't want to invest a lot of time in learning to program and fighting with all kind build tools and such. Their skills and interests lie elsewhere.

Big data could be relative to the algorithm not other data.

Basically for data exploration a dynamic language is a lot simpler for just getting going but you might have to rewrite your first drafts for performance, lots ultimately use C++ if only for the libraries. Today Rust is also a great option for doing this though ofc there are some critical infrastructure which can't be simply RIIR'ed.

If your question is: "Can I use Rust like Python with Pandas/SciPy/NumPy?", then the answer is no, it's not as convenient, although certianly doable to some extent.

If your question is: "Should I write this de-facto standard bioinformatics tool used by thousands of researchers in Rust that I would otherwise write in C++ and make horrible memory management errors because I have no idea how computers work?", then the answer is yes, go ahead.

6 Likes

Background: I am 3 years down a PhD in quantitative finance and have written my own version of a number of high-level numerical routines because existing Python implementations didn't fit my problem description. Things like a modified L-BFGS-B or LASSO-LARS. Dataset sizes are typically a couple of gigabytes, with millions of observations and between 2 and 30,000 variables.

I don't care about optimality of code directly, my primary concern is fast iterations to do lots of experiments. I have learned by trial and error that offloading heavy computations to Rust is often not worth it. Writing performant code using Numpy is relatively easy, in which case a C++ backend is doing the heavy lifting. Writing dedicated Rust code will realistically take at least a day to get working for the simple stuff, and the current Python-Numpy-Rust interface seems to incur a significant hidden type conversion performance hit.

TL;DR: In my experiment-first setup Rust has proven to be too slow, mainly because REPL with a good backend leaves little to be desired and REPL allows much faster design/experiment iterations. If I were asked to deploy something to a not-too-dynamic production environment I would definitely port to Rust for safety and concurrency. Not necessarily for speed though.

3 Likes

Great posts... nicely framed question.

I process gigabytes of data. Not tera... yet :). I’ve used sql (all of them :)) excel, python (Numpy and friends)... Notebooks etc. The analytics can be very heavy.

I agree Rust is not the means to experiment/explore. I hope that will change.

Where I have had success is in what I would coin as “byte-level”, as framed above, small domain (1) program types. Rust can’t be beat compared to python here. I’m talking 10-50 fold differences.

A quick addendum:

I’m a big fan of compile-time type checking. By definition, most valuable when dealing with a complex process or analysis. I consider the type-checker a “thought-partner” of sorts; helping to keep my thinking straight and consistent AND helping/promoting pattern finding (reusing types). Again inherently valuable and more so with added complexity.

This capacity is under-utilized in the domain(1). Inherently not a good thing because I'm still "paying the price", one way or another (I'm not talking in terms of computation, but design, engineering and other "meta").

The hurdle for me

A brief case in point, I built a SIMD csv parser and report generator using a recently published "super fast" algorithm (their words, but I won't refute it). The app/tool was a thin exe that wrapped/leveraged two libs/apis that I built using the std lib. One for running the SIMD tape-generator and the other to run a multi-threaded analysis.

I had it up-and-running within 2-3 weeks with thoughtful and poignant/useful input from contributors in this forum.

That said, the hurdle came quickly thereafter. I chalked the two or so years of "on-again, off-again" experimentation with Rust as inherently useful to improve as a software engineer (and "just for fun"). So I was willing to learn/do more. However, my past experience with refactoring, was not going to cut-it in Rust. Despite having unit tests and docs built-along the way, I was quickly "losing sight"; "too many moving parts" is how I described it to myself. So, the project was put aside, for now.

A counter point. I'm versed and have experience with another strongly typed language. To get to production-quality code required two revisions. The first revision was to clean-up/streamline the logic. A good part of the answer involved using some amount of type-level computation; that's when the app started to "pop" (a good thing). The second revision was to switch graphql libs. The enhanced "captured clarity" following the first revision also enabled me to augment the functionality of the app during the second revision without introducing new complexity (i.e., no new forks, nor abstractions required).

The punch-line

I have not yet accomplished the required familiarity/skill to be sufficiently "good at" overcoming the above hurdle in Rust. Rust has the added complexity of having to deal with three versions of a variable (owned, borrowed and exclusive/mut borrow). The borrow-checker is a unique feature indeed - it makes me better (I keep saying to myself :)), but it takes time to build those muscles/intuition. Furthermore, Rust not being either FP nor OO, but is both :)), translated into more to think about. In my case, I failed to leverage the power to reduce complexity. If "software is the wrong way to express an abstraction", Rust may require that much more imagination.

Furthermore, where I was able to find productivity gains elsewhere by leveraging the type system "that much more", I haven't in Rust, yet.

The punch-line within the punch-line

Generics and more generally abstract-data-types are "tough-enough" to reason about. In Rust, there's that and more to learn in order to productively navigate generics over lifetimes, constants and T where T is one of owned, borrowed or exclusive/mut-borrowed.

The path forward

The path to the level of familiarity with the tool ("generics") that increases productivity in a strongly typed language, may be as follows:

  1. Design patterns; reduce complexity by "here is the answer 9/10" to solve "x". Idioms and conventions are our friends! A complete inventory here using generics is likely going to be useful.

  2. Type quantification: The concepts are well articulated in RFCs and elsewhere, but what seems to be missing is the synthesis: A linear, streamlined mental model for quantifying the abstract data type in Rust. That means

    • Inventory of syntax with the same semantic
    • The types generated (or not) when implementing a trait (e.g., or i.e., trait objects create a new type)
    • The types generated with lifetimes (variance yes, but context is likely the more challenging piece)
    • Likely includes a "top-ten" need to know for the async and multi-threaded contexts

So, two prongs from different angles: (1) "here is the typical answer", pattern match to apply elsewhere, and (2) "this is how type-matching works when using generics", that starts with knowing the range of types "admitted" by a given generic expression.

Perhaps another way to capture the essence here: A resource that starts with what abstract data types enable married to the physics/truth of working with memory (a truth enforced by the type and borrow-checker).

Open invitation

My intended "brief addendum" turned into more ending with a plan-of-action. I really like building smart/useful things. Clarity of thought and expression is a critical step. Speed can matter. Rust helps me accordingly. If anyone is interested in further articulating what their path to taking their Rust skills to the next level might look like, please share. Perhaps by formalizing/articulating the right set of questions, we will be that much closer to productive answers.

=> at minimum, create options for using the domain(1) apps as a starting point for domain(2)?

1 Like

My thesis involves a "smart algorithm, small data" class problem. When I ported my project into Rust from R, the performance gain was around 500-700%. Given that I study small mountain streams, with no formal computer science education, I felt a swell of pride that I was able to accomplish this in Rust.

Now I am applying for teaching positions at universities, including one position as an instructor for data analysis in the environmental science program. Their curriculum is entirely in R, based on Wickham's tidyverse, with which I am quite familiar. Personally, after programming in Rust I do not particularly want to go back to R, but professionally I could not recommend changing the class to focus on Rust as a tool. As @sebasv notes, Python-Numpy is a more apt contender because of advantages in usability and user-friendliness.

Many of the basic workflow tasks (especially during data exploration) that are streamlined in R and Numpy require extra steps in Rust. Users on this forum have pointed out that Rust makes fallible operations explicit, so an "automatic" operation like loading a csv into a dataframe in R might require extra time on my end setting up some structs in serde and then checking to make sure the read didn't fail. There are many examples of Rust entailing "extra steps", if not a whole script, to replace a single line of code.

The ecosystem of user-defined packages for R and Numpy exacerbates this trend, abstracting away implementation details for the convenience of the user. Learning Rust has taught me to value knowing when and where my program is doing something fallible, especially during debugging. But when someone else has done the debugging, and I am using their feature-rich package, the idea of reinventing the wheel in Rust has limited appeal.

Using R or Numpy is like driving around in a sports car. You just turn the wheel, press the pedals, and burn rubber. Rust (and other systems languages) are like getting a spaceship. You can go places and do things that you never dreamt of in a car. They are harder to pilot, but the possibilities seem unlimited! With the Rust ecosystem still in development, it feels like parts of your spaceship come in boxes of parts labeled "some assembly required".

6 Likes

Scala? Smalltalk? OCaml? Curious what this language was.

One thing that I experienced while playing with https://pandas.pydata.org/ is that in Rust, it is especially difficult to deal with noisy data that is not yet properly typed, and you are trying to clean up the spurious data. For example, suppose we have a csv

student_id, first_name, last_name, hw1, hw2, hw3, midterm, hw4, hw5, final

so the eventual 'correct' data is something of the form

pub struct Student {
  student_id: usize;
  first_name: String,
  last_name: String,
  hw1: f32, 
   ...
}

but suppose the CSV file contains errors, some of the hw fields are empty; others are strings, etc ...

In Pandas/Python/R, we can load the entire CSV file (even if it is not properly 'typed), then clean up/drop the bad entries, and then get clean data.

In Rust, we can certainly have a Vector<Object> or Vector<Any> for each row, but due to the type system, I've never had found an easy way to deal with "improperly typed / noisy data"

1 Like

Haskell.

I feel your pain having to deal with R. It's truly distasteful. Great documentation, but that's all I can say. If memory serves, using closures "up the wazoo" helped null the pain. At minimum, maybe you can move folks to Numpy and Jupitor - much better. The published and validated algorithms in R can often be found in the python universe.