Rayon but for multiple computers!


Say that I have 1,000 tasks of work that I would like computed in parallel, but I would like to distribute them across a network, because the network has more cores than my individual computer. This is easy enough (famous last words) to write.... However, the tasks I send may spawn child tasks that need to be distributed and now I have the problem of distributed deadlocks :cry:

I'm sure that I am reinventing the wheel here, but at the same time I'm a little bewildered by the options and jargon, and general unfamiliar territory when I search for distributed computing, etc.

Has anyone had a similar need and can write me a little about how they solved it?

Can you tell us a bit more about the nature of your program? What is a task doing? What does the data you are dealing with look like? Are tasks all the same or heterogenous (i.e. is Task A doing the same thing as Task B but with different data)? Do you already have a cluster manager installed or how do you spawn processes on each computer in your network?

Umm, okay, well in my first iteration of the problem a dispatcher is given one of these:

pub trait DispatchRequest: Serialize + Send + Sync + Clone {
    const ROUTE: DispatchRoute;
    type Response: DeserializeOwned + 'static + Clone + Send + Sync + Debug;
    fn cache(&self, v: Self::Response);
    fn cached(&self) -> Option<Self::Response>;

and in its config.yaml is listed the addresses of the various workers.

and the worker is just a HTTP server living at that address that knows what to do when a connection is made at a certain route, and responds to the HTTP connection with the response.

It also has a /status route where it it returns its CPU load, and the dispatcher picks and chooses the server with the lowest CPU load and/or highest number of cores.

But say that I designed a task, that itself wants to call into the/a Dispatcher, now I'm screwed. because if all servers are busy, I will deadlock myself.

Would switching to async be an option for you? Seems to me you deadlock, because you call the dispatcher from a worker and wait for the dispatcher to synchronously respond to you. If you don't wait for the response, your worker is free to handle other tasks.

