Rust and Data Science // First impressions from an outsider


#1

To All Rustaceans,

NOTE: After my initial post there has been some focus on the second part of this post. My main interest is in the first part. The second part relates to certain personal preferences that I was happy to share and discuss in good faith. If you feel like replying or commenting, please do so on either or both points. But please give priority on the first part. Thanks. Special thanks to the first responders

PART I:

Hello there Rustaceans!
It is not the best idea, but I have to start with an apology and a disclaimer:

  • Apology: This is likely to be a long note, so brace yourself and take a deep breath;
  • Disclaimer: At no point in this note I have any contrary intent or criticism. I am over fifty and age gives us some more freedom to express our opinions and ideas. Combine that with maturity and the ideas and opinions are both driven and aimed to improve ourselves, the environment (in an ample sense) and hopefully others. I have friends that might hate me for my propensity to offer my opinions, but deep down they say they like it as I present a distinct point of view. It is neither right nor wrong, just different. Take with a grain or a kilo of salt and make your own decisions. OK, enough. I trust you all get it.

To the point: I am a Data Scientist, and I am exploring Rust, so I would like to ask for some suggestions and opinions on matching Rust and Data Science. Some more context and background in bullet points:

  • I am a Data Scientist, so I am not a developer.
    • For me, a developer is someone that earns a living by working day in, day out in producing code to make computers work to enhance our lives.
    • A Data Scientist is someone that uses data to enhance our lives, developing code to some extent in various degrees of time and quality.
  • Data Science is not an IT discipline despite relying heavily on IT resources.
  • I have used punched cards at university back in the day, and I have learned and used about thirty different programming languages over the years, and I still do. I have learned data structures and algorithms (does “Algorithms + Data Structures = Programs” by Niklaus Wirth or “The Art of Computer Programming” by Donald Knuth ring any bells?), but I am no developer as per my definition above.
  • It is not about coding, but Data Science has been associated with languages such as R, Python and Java in most cases with the actual mathematical code under the hood written on C or Fortran decades ago.
  • Python is just an interpreted scripting programming language that has nothing to do with Data Science, but it does offer a large number of libraries and frameworks related with Statistics and Machine Learning and feels “lighter” or perhaps less intimidating for the Data Scientist that has to code occasionally. Did I mention that interpreted means interactive but also means slow?
  • Python and R are suitable for their simplicity and interactivity for the discovery phase, but not any fast for implementing Data Science in a production context. An opportunity for Rust perhaps?

You might be asking if is there a point, a question, or is this getting anywhere any time soon?
So, as an outsider, my central question is: Given that Rust labels itself as a systems programming language, could it be used at the forefront (not doing mathematics behind the scenes) of parts or the whole Data Science workflow, and to be used by Analysts and Data Scientist as needed?

I am hoping for many answers that are likely to be as long as this note, and predicting that they will fall in three or four categories, but I don’t want to influence more than that.

You can take a break and consider writing it back. I hope you do.

PART II:

This section is almost unrelated to the previous one, but I decided to put it all together as it touches on the “easiness”, “lightness” or “looks” of the language on an aesthetical level, which I believe can impact the psychological appeal or repeal one might feel when deciding in learning/using a new language. I never felt attracted to Java as it just looks like C in a JVM. I know this is an utter simplification, but what about first impressions?

As an outsider, with just ten to fifteen hours of Rust presentation on YouTube and halfway the Rust book, all during non-business hours, I would ask for your understanding, so let me start:

It was not meant as a joke, but I hope I get some smiles: A good number of keywords in Rust are 3 letters abbreviations (mut, mod). From the presentations I have seen on the Internet, I can’t believe Rustaceans don’t want to have fun. They prefer to have fn instead. :open_mouth: :drum:

To bring the point up, I don’t like abbreviations in programming languages merely on the aesthetic level and personal taste, others do.
I saw somewhere that Rust wants to be run everywhere, so it would help it to be explicit and “clear”. I can take syntactic sugar as long as the original syntax and meaning are also valid. Perhaps run everywhere doesn’t mean coded by everybody.
Here are pro arguments for short versions:

  • If you are a full-time programmer, you can save some minutes in a year.
    • Say save 10 minutes in a year = 0.000019 (0.0019%); sure one will lose more time compiling or debugging.
    • Granted that developers are lazy (allegedly/in general) but IDEs and Code Editors have means to expand code so that a keyboard combination can produce either fn or function all the same and as fast.
  • It is to make it familiar for C/C++ programmers
    • What is the coherence in having struct like in C but then having fn. Is there fn in C?
    • Developers are intelligent (allegedly/in general). Web developers deal with HTML, CSS, JS/PHP/etc. In the same source code! No need to spoon feed with a familiar syntax.
    • Make C/C++ developers feel more at home, then what about developers that code in Ruby/Python/PHP/Smalltalk/Basic/Fortran/SQL/awk/I can go on for several more lines here.
  • Let’s develop a new language that has all the bells and whistles of modern techniques, but for some perverted reason let’s make it look exactly like all the others languages before (tick the boxes):
    [ ] to respect the giants before us
    [ ] because we are not revolutionary/creative enough
    [ ] this is not important; get over and focus on the job
    [ ] other: _____________
  • The difference between an expression and a statement is a semi-colon at the end of the line. Seriously? This distinction sounds like those human languages that rely on phonetics and people laugh at you if you make a mistake. Fine as a mother tongue but terrible for second language speakers. By analogy, this is as terrible as for occasional programmers. As good as the error messages from the compiler can be, this can lead to hours of debugging, with the source code laughing in your face.
  • For developers that might think that to code in “low level” and “near the metal” is to have a short syntax is a psychological misconception and historical heritance from when resources were limited, and storage was expensive. The binary code has to be short, and that is the compiler’s problem. Let me put this way: The machine has to do it fast, developers have to do it right.
  • Use the best of what is known and improve upon it. Sounds great, but if not done carefully it might result in a Frankenstein like creature. All the sawing might be hard to create a coherent and elegant masterpiece.
  • What about let mut? My suggestions would be let mutate or better yet variable (but not var).

I probably could think of more cases and examples, but I can feel the tension in the room building up. My brain is hardwired in a way that I can see a double space in a text without even reading the text, and I cannot help not pointing it out. Some other people might understand or relate to that.

Unfortunately, I am not eloquent or articulate enough so that I might have felt short on my arguments. I hope it was enough though to convey the message. I wish that there would be enough critical mass in the community or the core team to consider these and other related points, and if so, we could have a Rust 2.0 with some syntax or grammar changes.

I appreciate very much if you have read up to this point. What are your ideas and opinions?

Best Regards
Angelo Klin


A possibly more ergonomic syntax for borrow
#2

It’s not my decision to make, but I’m pretty sure we’re not going to release a Rust 2.0 just so we can write fun instead of fn, or let mutate instead of let mut.


#3

Hello @BurntSushi,

Thanks for your comment.

What I meant by Rust 2.0 is that by the time we reach that point, which might be still far away, there would be multiple changes and enhancements, some of which can be related with syntax changes or not.

Sure doesn’t make any sense to have a new version only for the sake of having a syntax change. Nevertheless, my vote would be for function.

Best regard


#4

My suggestion would be to spend a lot more time with Rust before pitching syntax changes of the sort you’re doing :slight_smile:. Otherwise, you’re bound to suggest superficial things like in this thread, and they’re essentially bikeshedding existing syntax.

I’ll be honest - based on the subject of your post, I thought you were going to talk about missing language features or libraries, or something along those lines. I suspect if you kick the tires some more, you’ll have that type of feedback, which I think will be a lot more useful to the Rust data science community than minor syntactical differences of opinion.


#5

Rust is very unlikely to ever be “the language” for data scientists. It may however be one of few languages data scientists use and it may be a languages that allows to incorporate ideas and tools related to this domain and to create tools for data scientists. This determines the priorities for language designers: the convenience for non-developers is not that important, performance, available libraries, tooling and all other aspects that matter for professional developers are crucial.

Let’s develop a new language that has all the bells and whistles of modern techniques, but for some perverted reason let’s make it look exactly like all the others languages before

You can’t do that. The bells and whistles come from breaking compatibility (on every level, including fundamental concepts). If you could do that, you’d have the bells and whistles added to those old languages, so there would be no need for new ones.

The machine has to do it fast, developers have to do it right.

First, there are many aspects of “right”, and various programming languages focus on different ones.
It can mean “right according to end goal” meaning easy to iterate over solutions to get there (end therefore “wrong” initially). It can mean “right” according to standards of software development (it should not crash, consume minimal amount of memory), or “right” according some domain, meaning need for libraries implementing domain concepts. Each of those “rights” may collide with others, so there are trade-offs and you have to pick your battles.

(minor syntax nitpick)

Those are entirely irrelevant. People come to Rust for performance and features, and common reasons to stay or leave are available libraries, quality of the features, learning curve, speed of compilation and some others. Syntax details may affect those, but its is very unlikely that some changes here will affect Rust popularity in any significant way (unless they stand out in radically good or bad way, I think they mostly don’t). Numpy equivalent in stdlib would brought tons of data scientists more than “let mutate”.
Also Rust has fairly dense syntax that contains a lot of concepts in one line. Expand them all,
and you quickly run out of screen space, making reading code unnecessary harder.


#6

Hello @vitalyd,
Thanks for your message.

As I said I might not have been as articulate as I would or should have been.
To address your points, I am not pitching any changes, I am expressing my opinion and my preferences. Is up to the community to totally disregard or consider any ideas, or so I heard I’m the presentations. Perhaps syntax is superficial, but maybe Chinese, Russian or Spanish people might feel more comfortable if the syntax would be in their native languages. Maybe my underlying points were coherence and acceptance. As I said my brains feels more comfortable when those aspects are present. Again, just personal preference.

But that was the second part. The first part and question were related to opinions to if Rust is or could be a contender to support Data Science processes. As you mentioned, perhaps have some libraries created to that end. Some might say that is not a good fit for some reason, and that is fine. Others might point out the opportunities and benefits and specific parts where that would make sense, which is what I would hope to hear, so I can invest more time learning or creating some libraries myself.

I understand people can be very defensive when they are part of something. Trust me, I know. I have been in situations like that in my youth. These days I focus in accomplishing the goal, the very reason I am exploring Rust and asking to the people that have more experience on what they think. If it can help in achieving a better result, I am happy to embrace it.

Maybe the long text makes people focus on the last part, where I am sharing my view, with no specific request, and missing the first part to comment in any way that would help and welcome a newcomer. If newcomers are not encouraged to speak, perhaps the community is not as open and embracing as advertised. I am certain that is not the case.

In any event, thanks for taking the time for reading and replying to the post.

Please feel free to return to the first part and share any thoughts you may have.

Have a good weekend.


#7

It seems to me Rust is particularly prone to first impression misunderstanding. “let mut” is a classic example. (To be brief, it’s “let (mut name)”, not “(let mut) name”, so no, it shouldn’t be “var name”.) Since this is annoying, it would be nice if it could be fixed, but I am not sure how and I don’t think fixing first impression misunderstanding should be high priority.

Until it is fixed, 1. to new people: your first impression is likely to be a misunderstanding, so take some more time. 2. to old people: disregard and ignore “first impression” posts.


#8

Hello @Fiedzia,

Thanks for your message and balanced analysis.

I understand and agree with the points you make. It is never easy, if not impossible, to encompass the whole spectrum of nuances in a complex subject.

Certainly both overall topics could be the source of long conversations, discussions and email trails. :smile:

Sure speed and robustness are great qualities that would benefit Data Science projects, as many others.

I wonder if current members of the community in the sciences or mainteners of these libraries could be enchanted to produce or port some of them. It does not happen over night, but can start anywhere.

Thanks again for your comments.

Best regards,
Angelo Klin


#9

I wasn’t being defensive - there’s nothing to be defensive about here :slight_smile:. Now that I understand you’re seeking others’ opinions on Rust’s suitability in the DS space, that’s cool and totally welcomed. For some reason, perhaps my fault, your original post seemed a bit heavier on the syntax aspects.

And you also have to look at the flip side - a person, barely familiar with Rust (and sounds like no real practical experience) comes in, posts a title alluding to DS, but then sort of focuses on minor syntax annoyances (or rather, differences in opinion).

Yes, that’s very possible. But ok, I think we’ve squared that away now.

It’s categorically not the case, at least not in this forum. I meant my initial reply with no disrespect. It’s just that syntax bikeshedding almost always ends up going off the rails, and they’re very rarely useful conversations. So I was simply asking to have a more substantive topic to discuss, but again, that was before you’ve clarified that you’re looking on DS feasibility in Rust.

Ok, with that out of the way, I think Rust mostly has as much of a chance in DS as C++. Rust will never be as “easy” as Python, so I think competing with the various Python libs on ergonomics or learnability is a non-starter. But using Rust as a lower-level building block seems completely viable. Or using it to build the distributed compute/storage systems used by frontend data science tools (in other langs); see this blog entry and the related datafusion-rs project. Or perhaps productionizing DS code that was initially prototyped in Python. But I’m not a data scientist (although I work with a bunch of them), so I don’t know how informed my opinion really is on that.

A few other projects you may be interested in taking a look at:

  1. https://github.com/frankmcsherry/timely-dataflow
  2. https://github.com/frankmcsherry/differential-dataflow
  3. https://github.com/weld-project/weld

You’ll note the first two are both by @frankmcsherry, and I’d actually be interested in his opinions on Rust and DS.


#10

Hello @sanxiyn,

Thanks for you comment and explanation.

As I said those are just personal aesthetic preferences. I would not dare to venture in the ins and outs of the solution.

But you put in very simple and concise terms.

I am glad you did not follow your own suggestion to disregard the message and took the time to educate me. It is appreciated.

As I put in another reply, I might have made a mistake with the long and dual subject post as people are focusing more on the second part for any reason.

If you have any comments on the first part, please share them.

Best regards


#11

Hello @vitalyd,
Thanks for your reply.

I just posted another reply considering that was my mistake to combine both topics in one post, as people are hanging on the last and less substantive topic, which I get it, can be quite disturbing or disruptive. Maybe because is exactly the last thing you read or because I was able to be more specific with examples.

Thanks for the links on DS and for including the other member. Would be nice to have as much feed back as possible.

I am still hopeful that Rust could be useful in most parts of the process. The start in Python/R and then go to production using C/Rust looks good in theory, but more often then not, companies want to see the result, so when it is done it ends up stagnating and the translation to C/Rust is never achieved, unless the performance is horrible and the cost is justifiable to be reengineer.

Have a good weekend.
PS, please have a look on further posts, and hopefully the Data Science subject can get some traction.


#12

For data science, perhaps interoperability with Python will be useful?

You could keep most of the codebase in Python, and only replace the most problematic functions with Rust ones.


#13

Hello @kornel,

Thanks for your message and link.

I saw two presentations on YouTube that uses two distinct approaches to combine Rust and Python. They are similar to using Cython or pure C libraries.

This would be analogous to having some hard crunching code in Rust, some wrapper to use Python syntax (import) and the actual Python code.

That is the thing I would hope to avoid. Trying to find the path of least resistance (or complexity in this case).

Appreciated.


#14

(sorry for the lack of links, discourse will only let post 2 actual links)

Similar questions about Natural Language Processing and Machine Learning arise; there is even the website “Are We Learning Yet” (http ://www .arewelearningyet .com); as well as the (unfortunately in limbo) set of projects under AutumnAi (github). My answer here will echo what I’ve discussed with colleagues about such projects. And course, the fields of NLP and ML are more like the tooling one gets after having the base for what is more traditionally known as “data science”, which I will just carte blanche paint with a broad brush as applied statistics and data engineering (forgive me, I know there’s more to it but this relates to my central point, plus i sense you’ll agree with me :slight_smile: ).

For data science, NLP, ML, one needs really high quality statistics and linear algebra libraries (often relying on painstaking work to wrap Fortran libs). These kinds of libraries do not come lightly, and require, as I bet you know, a highly specific background. They are not composed by one person. I don’t see this community in Rust. Two other communities are up-and-coming: Go and Julia.

Go has a few researchers from bio-informatics and similar fields, as well as free-time contributors to projects like gonum, pachyderm, gorgonia deep learning library and gopherdata. Julia has a what I would consider a high number of research and working scientists building libraries and you can find an extra-ordinary set of mathematical libraries (really too many to reference, but JuliaStats and JuliaData are good places to see the community in action).

I can’t find the same kind of active projects in Rust, for example search DataFrames or Statistics on crates dot io; not very active. Striking to me is the amazing community organization around Rust’s working groups RFC: Rust 2018 Roadmap, for example WebAssembly. The community is very well organized and clear in its focus, which is amazing… but it’s also very much not focused on data science.

Go has a small but dedicated set of people working on data science tooling. But, arguably, there are productivity constraints with ad-hoc data engineering tasks due to its compiled nature; and I’d imagine Rust would run into the same problems. Often one just needs a REPL to check out some hunch. It’s not a deal breaker, and I’v done plenty fo ad-hoc scripting in Go, so I’m not sure this is a legit mark against Rust. Julia (which is where the ‘ju’ in jupyter comes from) has great support for jupyter notebooks and ad-hoc, exploratory data analysis; though the startup times due to the JIT can be an minor annoyance for one-off analysis (compared to python or R).

I don’t know Rust as well (language and community), but I don’t see the same kind of work being done in Rust that is happening in Julia. Julia’s diverse math packages give it a serious edge in terms of anything being able to compete with the dominance of R or Python, but even in these instances it is widely pointed out the Julia is nowhere near close to having he kind of statistics coverage that R has.

The population of people with training, skill, and time to write mathematics software is very small. It almost guarantees clustering into a highly condensed language communities. I see a lot of this clustering around Julia, I don’t see it in Rust. To sum up, outside of language design and ergonomics, it all boils down to community and I don’t see the Rust community as one that prioritizes data science (similar to go).


#15

Hello @jbowles,

Thanks for your contribution and perspective.

As I read I was thinking in purpose and long term vision.

Let’s see what others can add to it.

I haven’t check GoLang in many details, but was not interested much. I haven’t checked Julia at all, but you provide a good summary. Thanks.

Let’s hope that we get the ball rolling at some point.

Best regards


#16

Computer scientists have blurred the lines between programming languages and tools used for mathematics. So much so, that people tend to forget the difference between a programming language and something like Mathematica.

I could flip your reasoning and say that Matlab or Mathematica needs to change their original ideas because I want to write a USB driver, or an operating system with them.


#17

For the record, I wish above all that Rust had this support… I’ve done a lot of data science in Go and to be honest its not the best tool. It is very wisely used in environments where other people are not data scientists and you need to simplify both maintenance, deployment. In fact, go data science pushes hard on the idea of “productionizing data science” by using more statically typed languages that are easy to deploy and stable. But the lack of polymorphic types can be an often annoyance. Rust seems a perfect fit here.

Julia is different from any language I’ve worked in. I wish it were statically typed. It is dynamically typed, however, you can actually add type declarations and get “type stability”, however you can still end up with runtime errors. But to be honest, though I much prefer statically typed languages, it is hard to ignore the activity of Julia right now.


#18

agreed. The community in julia seems to lean much more towards having a mathematical programming language. I’m going back to school for a masters program in computational mathematics and for the last year was reviewing the language i wanted to focus on: Haskell, C++… and I found Julia. I’m happy having Julia as my math language and nothing else. I’d be happy writing software projects in Rust and doing math and analysis on problems in Julia.

I need a compiled statically typed language. Up till its been go. But I don’t like the idea that I can still have data races and other problems. I eventually hope to move entirely off of go and use rust for all my typical software development projects. It would be sweet to do my data engineering and machine learning stuff in rust too (instead of Julia)… but I’d rather use 2 high quality tools focused on delivering the best of breed and state of the art in their respective domains than use 1 tool with half-baked support form my needs


#19

As an honest question, why are you exploring Rust? I would expect that the ergonomics of python (or similar) with under-the-hood libraries in C for “decades” (maybe Rust for new stuff) would often be a good trade-off, not unlike targeted rust-compiled-to-WASM use in JS or speeding up bits of Ruby with Rust.


#20

caveat: not trying to start a flamewar :slight_smile:

I’ll hold off from mentioning various reasons not to use python, since there’s always a counter to these kinds of arguments… and plenty of digital ink to find hashing these points out.

Instead, I’ll push from 3 other args (even these are obviously veiled criticisms of python; lol):

  1. variety and options. imagine being able to run a website in rust, with a backend api in rust, serving a sophisticated deep learning model from rust, trained in rust. Cool. Also, and I think this is not a contentious point: solving problems in languages from different paradigms can inspire and help with solutions across communities. For deep learning examples, PyTorch did it, Julia Flux is doing it, and Swift has good potential to it (Haskell has done some neat things too).
  2. Python/C++ and R/C/C++ is a big context switch, one that lowers productivity or locks some people out (e.g., those without expertise in c++).
  3. using a statically typed compiled language carries the same benefits as it does for network, os, dbs over to data science. I’ve had many errors training a model that fit suspiciously well only to find i was (re)using some global variable that should have been local (even done this in go). the burden is heavy when your end goal is to pay attention to the math and data AND having to verify every single aspect of your program for type safety, etc…

There are some big-time disadvantages to data science in rust, too, don’t get me wrong. But 1-3 is why I’m interested in rust for data science.