Database concurrency

I have been considering different approaches to database concurrency for the database I am implementing in Rust. Most traditional databases I know about seem to favour allowing multiple writers to run at the same time, with fine-grained locking, and deadlock detection - which seems complicated and slow. I have always been a bit uneasy about this idea, it seems to me that it's simpler, more robust and faster to only have one writer at a time ( and maybe even more Rusty :slight_smile: ). For read-only transactions I plan to simply have a "virtual" copy of the database pages at the time the transaction starts (so if there is a concurrent writer, the writer makes a page copy for any readers before writing), so readers never block at all.

These are my thoughts. Am I crazy? Am I missing something?

[ To make this more concrete, I expect a typical record insertion or update to take around 50 micro-seconds, which allows for up to 20,000 updates per second. Now in the applications I have in mind, you would not see that kind of level of activity. More like a few updates per minute! ]

1 Like

I think most databases will not write out changes immediately but queue them. In PostgreSQL, this is called background writer. Not sure if that helps in making decisions.

What is best might strongly depend on the kind of database that you construct. In SQL databases, you often have transactions that are relatively long lasting due to which can make sense due to the transaction isolation they offer. Such transactions will either need to hold some sort of lock(s) (thus be subject to potential deadlocks) or be subject to serialization anomalies, which cause may cause serialization failures (rollbacks).

This might, however, not matter for a simple key value store, for example.

1 Like

I don't think it makes much sense to have long-lasting write transactions. If you view a database as a "store of facts", how can it make sense for a large number of facts to suddenly all arrive at once? And if that happens, they can be added one-by-one or in batches ( of limited size ). It does make sense to have long-lasting read-only transactions, where a large store of data is being analysed or searched in some way.

[ This is my line of thinking, at any rate ]

It can make sense to have a longer lasting write transaction, e.g. if you need to check whether it's okay to transfer money from one account to the other (which may only be allowed if certain preconditions are met, which can change by other concurrent transactions). But not every application needs this, I think.

This idea sounds very similar to noria, a database that was based on @jonhoo's PhD thesis.

In particular, he came up with a concurrency primitive where you have two "views" of your data, one that is accessed by readers and one that is accessed by the writer. The idea is you avoid blocking readers by doing all your modifications on that second view, then when you are ready to commit those modifications the two views will be swapped and the writer waits until all readers have transitioned over to the other view before they can start making modifications again. This primitive is available on crates.io as the left-right crate.

I'd recommend watching Jon's stream on the topic. He does a much better job of explaining the idea than I ever could, and it might give you inspiration for your own database.

4 Likes

That sounds like a short ( perhaps not extremely short ) transaction, unless I am misunderstanding you. I mean it is all going to be over in a very small fraction of a second, maybe a few milli-seconds at most, right? By "long-lasting" I mean something that could take seconds or minutes (say).

[ I am not counting the time to write the changes to long term backing storage, the next transaction can begin long before that has happened, the way I am thinking about the problem. If a client wants to be sure the transaction has been saved permanently, it can wait, but no need to hold up more transactions during that wait. ].

I do remember looking at that. My idea is a little different, I would have potentially more than two views, each reader get a view based on the time it starts. I have actually sketched out an implementation of the cache, a few weeks ago but haven't integrated it yet. It's here.

It has functions such as "Get the value of the specified page for the specified (registered) time."

1 Like

Usually: yes, quite short. But there may still be computation time needed. But if you have a global mutex, even a few milliseconds may be a very long time if you have millions of write transactions per second.

Well, in some cases I/O may be involved, e.g. if you consider ATMs (cash machines). (Just as an example, in real life this might likely not be implemented through SQL transactions without further measures, see also my note on two-phase commits and the cited warning below).

Or if there is a transfer of money between several banks which do not share the same server, then an application which holds a write transaction in the open state might need to perform I/O until it can commit (or rollback, e.g. if the destination bank denied the transfer).

In those cases you might even want something like a two-phase commit, which usually requires some custom utility/cleanup tasks (to avoid unreleased locks and such). See also the warning in the PostgreSQL documentation linked above:

Caution

It is unwise to leave transactions in the prepared state for a long time. This will interfere with the ability of VACUUM to reclaim storage, and in extreme cases could cause the database to shut down to prevent transaction ID wraparound […]. Keep in mind also that the transaction continues to hold whatever locks it held. The intended usage of the feature is that a prepared transaction will normally be committed or rolled back as soon as an external transaction manager has verified that other databases are also prepared to commit.

Two-phase commits are an example where there will usually be I/O with other systems. Such we can indeed talk about seconds or minutes here in regard to how long these write transactions might be held open.

I'm not sure about the "slow" part (MVCC is usually touted for being faster when there are many writers!), but for sure it is pretty complicated. If you would like to go for some older but still solid techniques, you could go with WAL. Here is a description of WAL at the official SQLite site.

Ouch!

As an old project manager of mine said "Any solution whose correct operation depends on timing sucks". Which I was trying to explain him at the time anyway. Never mind what that problem was.

2 Likes

I should perhaps have explained what I meant by "slow", which is that when you use locking primitives extensively (fine grained locks), I think you likely suffer a performance hit in the common case where there is actually only one writer. I don't really know very well, but for example cloning an Arc and then dropping it, how expensive is that? Is it worth avoiding? What is the cost of a Mutex lock? What is the cost of deadlock detection?

I guess I should try and make the case for the traditional locking/concurrent approach. If some data is not in main memory, then you may have to wait for it to be loaded, and in the mean-time ( with the locking approach ) you have the chance to get on with another transaction, if it doesn't conflict ( due to locking ). My counter-argument would be this is too remote a possibility in practice, you expect to have a copy of the "hot" data ready to hand in main memory for a write transaction (in an amortised sense). Exactly why this is so would perhaps be hard to explain, but I think it's true for the use cases I come across.

I have benchmarked this specifically in the context of SQLite, because I wrote my own safe wrapper, and I was trying to avoid some extra mutex lockings, but I was not entirely confident the code would be sound without them. So I wrote a Criterion benchmark and the performance difference between a simple, single-row SELECT from a local file with and without a Mutex lock/unlock was negligible. So I'm guessing the Arc (which uses atomics) must be even less significant, compared to even a short file read, not to mention more complicated queries.

1 Like

In the code I have, cloning an Arc would be a very frequent operation ( if the data structure was shared among multiple threads - which it is not in my current design). However, I think this is a distraction from the bigger picture. It's not really about such micro-optimisation, it's about having predictable performance, read queries that never get deadlocked, a complete absence of any kind of deadlocks, and a simple system that just works. I am not explaining myself very well.

I have had bad experiences with deadlocks myself, sometimes of a trivial nature ( as in I didn't even know the server could fail with a deadlock, and there was no recovery in place ). But I think the underlying complexities go deeper than that. I also think it's all to no avail, as database updates are ( in the systems I make ) relatively rare events, and actually having two concurrent writers would be a very rare event, so there is no point in doing complex locking to try and optimise the situation. I think it's better just to do one update at a time.

In other words; suppose for 1 in 10,000 transactions a client gets a response that is 2msec faster (due to not having to wait), the performance gain (from the traditional fine grain locking) is absolutely insignificant and un-noticeable. Any gains would probably be lost due to the increased complexity - increased code size, memory use, etc. ). Latency is likely to be worse when complexity increases. Again, I am not explaining myself too well.

Here's a passage from the Rust Book:

" Here’s the idea in a slogan from the Go language documentation: β€œDo not communicate by sharing memory; instead, share memory by communicating.”"

I think it could be extended to saying "Don't have multiple threads operating on (writing) the same data", something along those lines. It mostly isn't ( I think ) a good way to go.

1 Like

Well, I have finally managed to implement my idea after thinking about it for months ( not much testing done yet ).

Read-only transactions are never blocked, and never block any other transaction, however long they take. They operate on a "virtual" read-only copy of the database file, taken at the time the read-only transaction starts.

On the other hand, write transactions block each other (but not read-only transactions), and are processed one-by-one. In practice, processing a write transaction should be very efficient, as there is a dedicated thread which will have a cached copy of all the write-active parts of the database ( that is parts that are being updated ).

I think it makes good sense.

I am thinking of publishing the virtual read-only file copy code as a stand-alone crate, as I think it is quite generally useful. It's currently here with a module that does the . "time-logic" here. I think I need a different word from "cache". Maybe "stash" or "archive" or something. It stores OLD copies of pages from the file (in memory). As the read-only transactions complete, the stashed pages are freed.

Feedback, questions, comments always welcome!

Well yesterday, I realised the approach I had taken caused a lot of extra copying of bytes around, so I have now re-written it so that the "virtual" read-only copy is taken at the "logical page" level rather than the lower "file page" level. My code now uses Arc<Vec[u8]> to represent the data for a logical page, and this can be shared with any number of readers, and also the single writer. When a logical page needs to be updated, I am using Arc::make_mut. The new module for logical page sharing is here. It's actually simpler this way.

Out of curiosity, how are you building your project? It looks like your Cargo.toml isn't committed to the repo, or you might be invoking rustc directly.

Locally with cargo, I don't have github integrated in my development, I just upload source files manually occasionally, using github as somewhere to publish code ( and for it's diff capabilities ) but don't bother with the toml files. I know it's probably not a the best way to work, something I just haven't got around to. I am bit vague on the relationship between crates.io and github. I mean you publish in both places? I suppose github has pull requests and that sort of thing for working co-operatively? That's not something I have ever used or understand yet. Github is a bit of a mystery to me.

Edit: I just published at https://crates.io/crates/rustdb

and an Axum example here: https://crates.io/crates/rustdb-axum-example

1 Like

The way GitHub tends to be used is you'll have your entire project within a self-contained git repository so all you need to do when getting started on a new computer is

$ git clone https://github.com/Michael-F-Bryan/include_dir
$ cd include_dir
$ cargo build

So don't necessarily think of it as a place for "publishing". It's more like a single place to manage your project's evolution over time and facilitate collaboration. A single repository might contain multiple crates.

On the other hand, crates.io is a package manager where you'll upload "official" releases for a particular version of your crate.

To explain why most software companies or open-source projects use some form of version control system (git) and hosting service (GitHub), it might be helpful to look at an example.

At a previous company, they'd had the same one software developer for about 18 years and his system was to keep all source code on a network drive. Then whenever he made a new release, he would zip up the entire folder and put it in a separate "archive" folder.

This workflow was perfectly fine when it was just him, but became a complete clusterf*ck when I joined the company because we would constantly step on each other's toes.

Imagine are hacking away in main.rs and I hit save after making some changes in database.rs, so now the project is broken because two people have added unfinished features to the project. We either need to coordinate who makes changes and when, or (what happened in practice) one person never works on that project. There was also no way to see what changes had been made between versions or roll back other than opening the backup archives and comparing the code by eye.

It was painful. 10/10 would not recommend.

Git fixes this problem by making all changes atomic ("commits") and letting you work in your own version of the project ("branches") until you are ready to merge it into the "master" copy.

GitHub builds on git by providing auxilliary tools for managing your workflow. It gives you a central place to store your repositories, you get a built-in issue tracker, you can browse code without needing to download it locally, you can create "pull requests" when merging someone's work in to make reviewing changes easier (using the diff view) and you can write comments that point at specific lines. You can also wire things up so it'll automatically build and test your project when you push code, immediately alerting you when you break the project.

Tags and Releases are just an easy way to say "when I say version 1.2.3, I'm talking about the project at exactly this point". You can use them to correlate a particular release on crates.io with the rest of the project.

TL;DR: Like Rust, Git and GitHub have a certain learning curve, but once you get past that it's a massive quality-of-life improvement. Especially when there is more than one person involved.

3 Likes

You don't use a Cargo.toml? How are you building you project?

It is customary to publish Open Source Software such that it can be built by who ever downloads it. That means including whatever build scripts, project files, make files, whatever are required to build. In the case of Rust that typically means the Cargo.toml.

2 Likes