I am trying to build a distributed task engine. The idea:
suppose you need to perform a task for every leaf node you find in a given tree structure (e.g. a directory on NFS server)
so you write an async code that enumerates the root, creates corresponding sub-task for each subnode, waits for their completion (and maybe log some stats) -- in this way you will recursively traverse the tree and achieve the goal
for now let's ignore the need for depth-first traversal and "piecewise" enumeration (to avoid running out of memory if tree is too big)
next step -- add a layer in between sub-task creation and execution that will farm submitted tasks to a pool of worker processes (connected to "task coordinator" via network)
coordinator will receive sub-tasks, distribute them between workers and facilitate sending responses to their parent tasks (that wait for completion of sub-tasks)
All this quickly goes complicated when you consider possibility(propensity?) of a worker process to die (or lose connectivity) at random moment. Coordinator needs to track sub-tasks relationships, cancel sub-tasks of a dead task, resubmit dead tasks when their side effects (aka sub-tasks) are complete/canceled and so on.
Questions:
is there any existing solution/crate that does smth similar? I would not want to reinvent this (rather complex) wheel.
is there a better design? For example, design which can tolerate death of "Coordinator"...
Note that this problem also exists in local tasks; They can panic [1]. The thread JoinHandle from the standard library returns a Result when you join on it for this reason. Tokio's task JoinHandle does the same. There is no network involvement for this requirement.
Which leads to an intuitive observation that async tasks are fallible. Handling failures is up to the caller, and in some domains (such as workloads distributed to other hosts as you might do with Erlang) one prospective failure handler may retry by rescheduling the task (potentially with some sub-task granularity, if the system allows for it).
I don't personally know of any frameworks for Rust that are exactly designed around the Erlang principles of fault-tolerance [2], if that's what you're looking for. But the generalized term of art is Actor, and there are several actor frameworks available.
And at a higher level, they can timeout. But handling timeouts is very tricky! Risk retrying a timeout while the task is already in flight, perhaps with no way to actually cancel it. ↩︎
Hmm... It is not the same -- local task can fail only if your code has a bug. To avoid complications related to handling task failures it is perfectly acceptable to simply die -- sooner or later you'll fix all these bugs and your local tasks will be fine. In a distributed system task can die simply because someone stepped on the cable -- i.e. for reasons outside of your control. Therefore you are forced to modify your design to deal with this.
I spent couple of days watching presentations and reading up on actor model. I just can't figure out how to implement my requirements using actors -- in my case actors (tasks) running on same VM need to share some job-specific stuff (e.g. connection to NFS server), tasks enumerating a directory don't get to create a subtask (sub-actor) for each element immediately (we will run out memory) -- they are supposed create a bunch and wait until system signals availability of resources (i.e. total number of tasks is below some threshold)... and even in this case -- tasks that process "deeper" directories get to have a priority.
Also, apparently, general idea is that actors are long-lived. Creating and destroying an actor for every file out of a billion doesn't seem to fit conceptually...
So, yeah... I doubt actors will help me here -- it is more about managing a hierarchy of tasks + distribution.
It’s true that faults are less likely in a local context (e.g., a set of tasks doesn’t just normally go out of existence because you randomly lose half of your application at runtime). But I claim that bugs are likely. And if that is true and you need fault tolerance, then you need to plan for it.
Did you look at the thread in the footnote? It’s a discussion about actor runtimes inspired by Erlang, which is what you are looking for. The thing is that actor systems can be simplified to nothing more than tasks and message passing, or they can include supervisors or even be supervisors (in the case of Erlang) to restart the supervised tasks when they fail. My point is that just because you found some presentations doesn’t mean you’ve explored the entire problem space.
Yes, I've read those and much more. I guess you can have a "Coordinator" actor that will handle task dispatching and parent-child relationships between them... I need to think about this.