Tips for writing a distributed productivity app in Rust

I've always wanted to learn distributed systems and now I have a toy project that can benefit from this. My question is very general in the sense it asks for all kinds of ideas, algorithms to use, crates available etc.

I want to write an app that is mostly able to work without the central server (think a mobile app that is functional offline), although it does use it to sync, for backup etc. (I am not writing a P2P network). Users (or several devices of the same user) may collaborate but consistency is a very relaxed requirement - conflicts are rare (you don't need to edit your data from 2 devices simultaneously), easily resolved and errors (as in inconsistent data) are acceptable to some extent (of course, except for a tiny amount of data, which in the extreme case can be read-only in offline mode).

I do not have the requirement for "hard" offline mode (eg. days), my main goal is speed, the "client" needs to have the data locally, both to offload the server and to increase analytical speed (will need to scan large parts of the data per operation) and for the purpose of sharding - one user does not need the data of other users. The central (backup) storage can be slow (no need to support analytical queries). Also the central server does not need to be able to read user data (privacy).

To make things more complicated, the total amount of data does not fit in a single web request - probably many megabytes, maybe gigabytes. So it's not acceptable to sync everything always, it has to have granularity (and the data naturally does provide this property)

I haven't done anything like this. Neither in Rust nor in any other language.
I would prefer to use ready libraries if available, or ready databases/services.
So all kinds of thoughts are very welcome!

1 Like

Something you might want to do is use Firebase to store data. It's a NoSQL database which can automatically deal with times when you may be offline and I believe it caches a lot of the data to avoid unnecessary trips to the database.

You didn't explicitly say how this app was going to run. Is it something that runs in the browser, as a native mobile app, GUI program on a desktop, etc.? That'll be #1 deciding factor on how you structure your app and the libraries that you can use.

I prefer not to have to deal with cloud services. I want to write the synchronisation on my own.
The app will run as a native app. Mostly on mobile. My idea is to write the core in Rust and the UI in something like React Native/Flutter (haven't decided yet).

Ooh this is a very fun space to play with. Apologies if you are looking for more concrete things; What follows is mostly a brain dump:

You're probably going to end up designing your data interactions with operations over a certain state; So: you don't send your hash table, you send "Set(Key,Value)", "Delete(Key)", etc.

One example would be how git works: it's essentially an ordered list of patches (which are themselves a bundle of add/remove-line operations) and you sync state simply by knowing the position you are at in that list:

  1. Don't have a position yet? Start from the beginning (git clone)
  2. Have a position? Ask the server to give you all changes since your position (git pull)
  3. Did you make changes and want them on the server? Send the position you were at plus all the operations you made (git push)

This pattern appears everywhere in software, in many different layers. It's how many databases do (bits of) replication; Kafka interactions are essentially via offsets (position) to a file that arbitrary data gets stuffed at; File synchronization is looking at byte length (again, position), and sending the delta; Blockchains too.

Things get more complicated once you introduce interactions from different devices. Say, can multiple devices modify the same thing on the server side? a user with multiple devices, for example?

That's because you end up with the curse of all reliable data: shared mutable state. In rust we have all the code in our hands (so we know the "truth") and it's already difficult- offline-first data-heavy apps only know about what you're trying to do right now- the server is nothing special, just another peer you're syncing with. So now you have to deal with merge conflicts :slight_smile:

Git solves this problem by saying it's always your problem: you merge it, you solve the conflicts and you advance it. Dropbox tries to solve some conflicts, but often fails and leaves garbage behind for you to deal with. Highly distributed things either expect some sort of quorum, either directly (via active voting) or through some other fancier mechanism (like proof of work).

Mobile apps are less general purpose that the applications I've been talking about so far, so you might be able to avoid or at least drastically minimize conflicts by merge heuristics and rely on one the coolest (IMO) pieces of tech in "recent" years: Conflict-Free Replicated Data Types (CRDTs).

I'd read up on how a software you like does synchronization- There's a lot of knowledge that can be reused regardless of the scope. I got into this several years ago via queries like "how does rsync/kafka/git work".

Some more pointers: Literature about Merkle trees will help with design; Of course the already mentioned CRDTs; There's also this Replicated Object Notation thing that I've been meaning to take a look at that seems super interesting, but I have no idea about it's current status.

4 Likes

Yes, I was just reading about CRDTs and started playing with the crate: https://crates.io/crates/crdts
Although it is probably a massive overkill for my use case, it will work, and it will probably be quite easy to write the code that does the conflict resolution.

Now I have to figure out how to do the networking... basically, how to guarantee that every change reaches every device in the cluster eventually.

Does the git model satisfy the requirements above ?

I am not sure yet. It’s definitely not a good idea to store the whole data as a single git repo, because it is large and changes quite fast. I don’t want every sync operation to involve downloading all of it. It might be a better idea to handle every logical entity as a git repo with history. The problem here would be that this will multiply the storage requirements several times, as the entities are themselves not much larger than a sha-256 hash. Also the “manifest” of entities will be as big as the data itself. I will need to do something in between.

With regards of “you fix it and advance it” for conflict resolution, it’s probably also not a good idea, as the freshness of data is more important than its consistency (data is statistical in nature). The various peers will arrive at the same entity values eventually on their own. The idea of sync is to save redundant work, which is resource consuming.

I think for your case what's interesting in the git model is the history concept, but you can do a lot better because: (1) You only care about the history for sync, git wants to preserve the whole thing always; And (2) you have full control of the domain model whereas git can't make any assumptions about what it's storing.

Since you mention that peers will eventually converge to the "truth" even in the face of inconsistent state (what a nice property to work with!), sounds like you can make use of a logical clock for versioning.

Instead of sending an operation like Set(Key, Value), you send Set(Key, OldVersionNumber, Value) and the server only actually applies the change if its local version number is exactly OldVersionNumber.

And here's where you can do a lot better than git: Since you don't really care about reading the history, the changeset that you send/receive doesn't need to be the history at all, you can coalesce them using Key,Version (you don't need to send all Set operations over the same key, just the most recent one).

Now your history is simply something to help out-of-date clients to get back on track quickly, which means you can trim/prune/compact/shrink (how many words do we have for this?) it: Instead of your lower bound being "the first commit to the repository", your lower bound is "the least up-to-date peer" (and likely a X-days-in-the-past cut-off point).

2 Likes

Yes, something like this will work, now I need an efficient way to figure out what is out-of-sync for a particular client. Maybe just a simple timestamp managed by the server where it will send all the changes since last check where this particular client is not the updater... and the client only has to verify "I am up to date till timestamp X". This turned out to be quite an easy task :). Thanks for the help!

Beware of using timestamps based on clocks from different machines. They don't all advance at the same rate, and they can sometimes jump backwards. Use an event counter as @errado recommends.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.