I'm trying to build something similar to Rust's Crater and wondering how I should structure my application.
The rough workflow looks like this:
Set up the correct version of the toolchain
Query a registry for metadata and download links for all packages (possibly with some filtering, possibly including every version ever published)
Download each package
Run some sort of experiment on each package using the toolchain
Collate the results into a report
I'm expecting most of my time to be spent on the trivially parallelisable steps 3 and 4. Technically, the toolchain doesn't need to be completed until we start running experiments, and we can start downloading packages once we get the first page of download links from the registry query.
In the past I've found that writing lots of async code with ad-hoc concurrency can make the code hard to reason about because there tends to be lots of method chaining, spaghetti code, and concurrency that makes it hard to track the ordering of things in your head.
Instead of diving into the code head first, I want to go into this with a bit of a plan because making a reliable Crater-like tool means I've got to deal with some tricky problems...
Runs should be resumable without needing to start from the beginning
Querying the registry or downloading packages may fail and require some retries
I'll want to add several caching layers (e.g. because individual experiments may take less than a second, but downloading the package from us-east-1 to my laptop in Australia will take ¾ of eternity)
If you've ever worked on something like this, what are some architectures or design tricks you would suggest? I'd also be keen to hear about things that didn't work for you and why, or problems I might want to take into consideration.
I can't say that I have written a lot of large-scale applications (of this sort or any other), but I am fairly confident that a good place to start, architecturally, is idempotent, or memoized, operations. Essentially all of your operations are "fetch some data" or "perform some computation on some data and record the outcome", so you can generally structure your high-level operations as:
Is this data stored? If so, return the stored data. If not, fetch/compute it.
You may need an explicit dependency graph scheduling the computation, or you may be able to simply start async tasks that depend on other tasks. Either way, you will need deduplication: if some data is asked for twice, compute it only once, even if the second request arrives while the computation is in progress.
The important part is that by taking this overall perspective, the system automatically supports resumption and retry: checking for data from the cache is implicitly part of all operations, so you never need explicit state machines to track what needs to be done in the current circumstances. All the state is in your data store(s): either they have an entry for the data you need, or they don't.
This is the approach used in my automation framework (Peace):
For every operation in the process, instead of only having a "do work" function, you also have functions for:
a. read the current state (e.g. file download: read the MD5 sum of the file on disk (if it exists))
b. read the goal state (e.g. file download: read the ETag from a HEAD request's response headers from the server, assuming they use the file's MD5 sum as the ETag)
c. if the current / goal match, then the "do work" function doesn't have to be run
That way, when you re-run the operation, it skips the things that are already done. (memoization, idempotence)
This means having a State type per operation. I don't use PartialEq, because sometimes there are "physical / generated values" that you don't care about when comparing states.
For example, an operation for "make sure this package version is published" may have a goal State of "this name, version is published", and a current state of "this name, version is published, and its publish ID is random-generated-thing", which are still "equal" for the purposes of current state matching goal state (and so the work can be skipped).
Chain all of the operations in a graph, and have a filter for which operations to execute.
This should enable:
a. concurrent execution.
b. subset of graph execution -- but not possible if operation A produces data used by operation B. (rough workflow step 4)
Separate "progress output" from "outcome output".
Progress output is most useful in realtime (user sees what's happening, elapsed time, eta to completion).
Outcome output is useful to see what happens. This is where you'd collate the results into a report. (rough workflow step 5)
Depending on what you need, it may not only be useful to see the outcome values in the report, but also capture the "control flow" used in the operation (i.e. recording the control flow as you go). As in, "20 packages need to be downloaded, these 10 were already done, i'm downloading these other 10".
Add cancel safety, with "nice cancellation points".
e.g. when SIGINT is received, finish the current operation (or, subset of), which still updates the local state cache with what work has been done (or, state discovery of current state needs to discover it), so that the next the work is run, it knows where to pick up.
Fault tolerance -- collect errors for each operation, and only fail later.
If you have concurrent things, when one fails, save an Error, but allow other in-progress operations to finish, then fail the full execution. That way you don't get weird halfway-states.
You could try using my automation framework, which has a lot of that already set up[1], with the caveats:
It's a generic framework, so it may be more complex than you need, but arguably a lot of what you require is already built.
The APIs aren't stable -- I'm in the middle of supporting both a CLI frontend and a web server frontend.
The graph node filtering, and outcome output aren't implemented yet (but definitely in the pipeline).