Rust equivalent of Apache Spark?

Hi Rustaceans!

Having programmed in Scala(Use it along with the SMACK Stack) for almost a year, I have assisted a great number of teams in building Big Data Analytics platforms & solutions.

Being attracted to Rust's similarity with Scala(Closures to Iterators to much more "first-class" constructs!; OOPS kept aside :slight_smile: ), I am wondering if Apache Spark-like big data processing systems can be implemented in Rust? If certain, what PROs and CONs can we expect?

MOTIVATION:

  1. JVM is resource bloated and bit boring(at least after 6 yrs).

  2. I believe Rust 1.17 is stable enough(Successful 1Million thread benchmark & all) to be considered as an alternative now.

Keen on hearing. Thanks in advance :slight_smile:

4 Likes

I do not see any particular reasons why you can not build almost anything in Rust. A good example that everything is possible is Redox - which is an actual running OS. On the other hands keep in mind that Rust will never allow you for too much flexibility as what you have in Python, but there is a very good reason for that.

Probably, the major challenge is human labor. Apache Spark is out there for quite a while and as far as I remember is build on top of Hadoop or some other backend, which are usually even older. If someone is to build such thing in Rust you will need enough dedicated people and people who are well familiar with HPC and distributed computing. On the bright side I think a lot of the people developing in Rust are well aware of performance in code. On the other hand, I haven't seen too many Data Science oriented code written yet in Rust as well as it would be very hard to convince people that they need to learn rust to use this as its learning curve is relatively high compared to Python and Scala.

2 Likes

When it comes to data science, Rust is not a trivial choice at the moment. Even if we discard the language's learning curve, the available libraries for common tasks in data science are scarce and underdeveloped. We've got some good people out there working on idiomatic libraries, such as visual data plotting or support for reading and writing HDF5 files, but even these two take time. To be fair, although I don't like Python as a programming language, it still is my go-to technology for these things, including machine learning. Speaking of machine learning, "Are we learning yet?" shows that we're still at an early stage.

Ok, back to HPC. That I know of, there was Collenchyma, which provided an abstraction for computation over GPUs and other hardware. Support was eventually dropped and the hard fork Parenchyma was made. It's a pity because Collenchyma was the computational base for leaf, the machine learning framework. Anyway, once Parenchyma becomes more stable, it should leave room for further frameworks focused on MapReduce and whatnot.

This doesn't mean that I don't support Rust as a language for scientists. I still hope to see Rust shine in these use cases, hopefully resulting in faster and safer solutions in academia and industry alike.

5 Likes

Thank you all for the glorifying info & replies. The "arewelearningyet" has been really helpful in understanding the contemporary!

There is TiKV as an option. I think the beta was recently released

https://github.com/pingcap/tikv

Yes @vitiral, the TiKV is an interesting K-V rust implementation as well.

Looking for the implementation at the analytics layer. I came across diesel.rs which has an attractive ORM/Query Builder suitable for almost any serious DB application.

Is there an analytical engine to predict or detect patterns such as frauds etc? If not, I wish to know if anyone can help me in building one(Side Project until it gets incubated by my co.)

Interesting no one mentioned timely, there is a talk from the author as well. Not feature complete as main stream solutions but it would be good to see more features built on top of it.

1 Like

There was some work by Mohammed Makhlouf (@msmakhlouf) and Mohammed Samir from Q-CERT on a project called Antimony.

See this talk:

e: Here is the video:
http://www.video.ethz.ch/events/2017/rust/26e53d2f-1121-4e4a-a64d-d0812e408d95.html

I know this is an old thread, but I have actually started an open source project to see how feasible it is to build something similar to Apache Spark in Rust. I use Scala+Spark in my day job and I've been learning Rust on and off over the past two years. I also have a background in building distributed systems.

Here's a link to the project. Contributors welcome!

https://github.com/andygrove/distributed-query-rs

5 Likes

Can you try to split this up into SQL layer and underlying layer, where you have functions similar to map, filter, group_by, ...?

Not that I want to stop you from trying things out, but if you want to give timely dataflow a spin you would be welcome. It isn't the same as Spark (intentionally) but manages to be a fair bit more efficient (depending on the application, 0-2 orders of magnitude). Differential dataflow is the Spark/SQL-like layer atop which has the additional virtue of being fully incrementalized (so, low-latency updates).

I think you'll learn a bunch building a Spark-like system, but it would be great to get input on what sorts of things you are doing (or want to be doing) in Spark that timely doesn't do for you. Good luck!

2 Likes

My main priority is that I need a project to build from scratch in Rust where I can go at my own pace and try and understand each new concept thoroughly as I learn it. If the resulting project ends up being useful to me and/or others then that's a bonus.

It's not that Spark doesn't do what I need. It's a great tool, but I'd like something faster and more efficient. I'm working on geospatial projects in my day job right now and I've seen that Rust is typically 4-5x faster than Scala for doing some of the basic number crunching and that's without the overhead of Spark.

Spark is obviously optimized for large data sets (i.e. whole stage code generation is expensive up front but then reduces the per-tuple execution cost) but I'm often dealing with tiny datasets (fewer than 100k rows) and the overhead of Spark makes it painful to use in this case, especially for interactive workloads.

I will definitely check out your project. It does look like you are solving the same problem. I will see if there is potential to use this in my day job. Unfortunately your project looks too advanced for me to contribute to given my current skills with Rust but hopefully that will change in the next few months.

1 Like

Cool. Let us know (e.g. on the timely gitter) if you have questions. There is also a local here at ETHZ working on PySpark -> LLVM, and it could be that if you just like that dev experience it could get you where you want without the distributed systems overhead; we have an appt to hash that out. But, !! learning about distributed systems is great; I recommend building your own thing for as long as you are learning. :slight_smile:

So, this has motivated me to have a look at timely dataflow again, and now I think I understand it a bit better. From an API design point of view, it is definitely quite a bit different from Spark.

Timely dataflow basically uses a streaming computation model, where you work with a continuous flux of data, kind of like in ReactiveX. To this, it adds the ability to insert synchronization point in the data stream, in the form of monotonically increasing timestamps, and the ability for workers to query whether everyone has reached a timestamp.

This is quite different from how Spark usually does things. While Spark has a streaming API, the most commonly used APIs (RDDs, DataFrames...) give the user a facade which looks like a global vue of data, and automatically inserts synchronization points after each stage of the computation. In this sense, the Spark API is intrinsically less efficient (more implicit synchronization) and higher latency (more processing barriers where early results await late results). But it is also easier to get used to, because tables are an easier abstraction to grasp than synchronized data streams.

In this sense, I would say that there is a place for a Spark-like high level API built on top of timely dataflow's lower-level building blocks. But maybe I'm misunderstanding things.

1 Like

Just to add a bit here: you are totally right about the relationship between timely and Spark, I think. One of the differences is that they try to do different things; timely dataflow is "lower level" than Spark, and so you can do more with it, which is both good and bad depending on what you need.

Just to offer up another point of comparison, I would say differential dataflow is more "like Spark" than is timely dataflow. In differential dataflow you define computations and then load up timestamped "changes" to the inputs. One timestamp you can use is the empty type (), meaning roughly "just load this, no more changes afterwards." You get a batch processor when you do this (though, one that pipelines execution to do in-place reduction of produced data, at the expense of keeping multiple operators live at the same time). Or you could load all your Spark data up at time 0, then chill out until you get an answer, then change the inputs if you want. It's meant to be the same (edit: at least as much) functionality, if substantially less ergonomic at the moment because that is hard and stuff. :slight_smile:

There is a work in progress for a differential dataflow "server" in which you can define computations that stash indexed collections and update streams, sort of like how Spark lets you stash RDDs in memory except these are changing RDDs, and then load up new computations that use them (via shared libraries you build). It kinda works at the moment, though the shared library thing could really use some love (not sure what the Rust story for loading code once your binary is up and running; the FFI and allocator issues suggest this isn't really intended to work the way I'm using it; also, it's apparently UB to unload a shared library on OSX).

2 Likes

There exist a new implementation formerly called "fast_spark" or "native_spark" and now known under the name Vega:

vega is a distributed computing framework inspired by Apache Spark.

See its documentation