Recommendation for cache-type database

I asked this two years ago, but ended up using a directory of files. Now I actually need a database with indices to store mostly image blobs, but some other things as well.

Back then, the best choice seemed to be sled. Two years later, that's been abandoned. Last update 3 years ago. Never made it out of beta.

There's Limbo, an attempt to rewrite SQLite in Rust. It's not done yet.

There's always SQLite, but I'd like to stay in Rust and not deal with the problems of a C subsystem.

Suggestions?

Do you know of any problems specifically with the rusqlite crate? There is a bundled feature that avoids linking issues, so there shouldn't be any build problems.

  • bundled uses a bundled version of SQLite. This is a good option for cases where linking to SQLite is complicated, such as Windows.

The crate is very popular (heavily used) and sqlite itself is extremely well tested and reliable.

You're misrepresenting its development. There have been around 25 alpha releases of 1.0.0 since then, the most recent being published 4 months ago: sled versions. It isn't abandoned. They are trying to stabilize.

Most of the pure Rust alternatives that I know of are KV stores, FWIW.

2 Likes

For large files the IMO disk option is still valid.

I don't know the details but if it is a cache it must be fast, the database, any one, will read the data from disk and then send it over an IPC, usually a TCP connection, which is slow and resource consuming.

1 Like

One would normally assume so. However I have recently seen it claimed that if one wants to store a large number of not so big files then keeping them in an SQLite database can be faster. It has no networking/IPC overhead and presumably looking things up in there is faster than finding files in a file system directory. Sorry I can't link to any evidence of this.

In some of my projects, I save the images in a DB and use a cache afterwards, if there is a lot of traffic, the performance is quite noticeable.

I speak from my personal experience, but the DB is going to read the data from the disk and then send it, won't it be faster to read it directly from the disk than to add additional steps no matter how fast they are?

An important issue is to avoid directories with thousands of files, it is better to make subdirectories, you can do it creating a hash of the file name and use the first characters as subdirectory, if there are many you can use several levels.

The same hash and the subdirectory levels serve as an index and you go directly to the file you need.

I know very little about data bases but I'm guessing that a DB might well map a lot of it's data ainto memory and then have fast indices to look things up, which might be quicker than looking things up in a file system by their file name. It also skips some user space to kernel space transitions and thread rescheduling I guess.

Also SQLite has no "send" overhead, it's all linked into your programs memory space.

At the end of the day one has to try these things for oneself on ones own data sets to make any performance comparisons.
I guess

Reading it directly from disk is twice as fast or more than from any database, and the consumption of resources such as CPU and memory less than half.

I already said I don't know the details, but the whole point of a cache is to speed things up....

Here's a link from the Sqlite website that seems to show for at least some work flows Sqlite can be faster: 35% Faster Than The Filesystem

1 Like

Huh. Compare

https://crates.io/crates/sled/

which says the current version is v0.34.7, 3 years ago, and

https://crates.io/crates/sled/versions

which says the latest version is 1.0.0-alpha.134

The forum system can't read those links, but a browser can.

Is crates.io messed up?

Edit: It looks like it's been three years since the last release, but with 124 alpha versions. What's going on?

1 Like

An important issue is to avoid directories with thousands of files...

The "test1.dir" directory created above puts all the blobs into a single folder. It was conjectured that some operating systems would perform poorly when a single directory contains 100,000 objects.

... No.

No, it's not messed up. The semver suffix changes default visibility and dependency resolution. These are pre-releases, you have to opt-in specifically if you want to depend on them.

I get that. But three years of pre-releases? Other users are puzzled, too.

1 Like

Exactly. What if the database, like SQLite, IS the cache?

Which ever way, some people at least have measured that for their workload on their platform SQLite performed better than relying on the file system. We should not make categorical statements about such things without actually doing the measurements.

At this point we are not helping to make a decision.

I speak from my personal experience, I do not doubt that you have obtained other results, particular cases, particular results.

I believe it is common to split cache on filsystem into a few layers to avoid too large directories. E.g. If every resource is a hash, then have a separate directory for each of the first two hex digits of the hash. Things like sccache for build caching does this.

I don't know the exact reason some file systems have issues with too large directories (and it probably varies from fs to fs), but I have heard for example that each directory has a separate mutex, which can be important if many threads are writing files in the same directory.

One of the reasons is that they are loaded into memory when they are read, also they are sorted and checks are performed, I learned this when many years ago I left a server hung because of a problem of this type.

I've been looking for one too, and concluded that for how SQLite is still the best option (although I also want access from multiple processes, so sled is not an option for me).

SQLite is not bad for a C dependency. As long as you stick to core functionality and not require plugins/extras or too-new features, it's available everywhere and easy to build from source.

4 Likes

I also recently looked into this, and SQLite is the way to go.

3 Likes

Right. I get that they are "pre-releases". But 124 of them, with no real release in three years? That's too strange to use.