I asked this two years ago, but ended up using a directory of files. Now I actually need a database with indices to store mostly image blobs, but some other things as well.
Back then, the best choice seemed to be sled. Two years later, that's been abandoned. Last update 3 years ago. Never made it out of beta.
There's Limbo, an attempt to rewrite SQLite in Rust. It's not done yet.
There's always SQLite, but I'd like to stay in Rust and not deal with the problems of a C subsystem.
Do you know of any problems specifically with the rusqlite crate? There is a bundled feature that avoids linking issues, so there shouldn't be any build problems.
bundled uses a bundled version of SQLite. This is a good option for cases where linking to SQLite is complicated, such as Windows.
The crate is very popular (heavily used) and sqlite itself is extremely well tested and reliable.
You're misrepresenting its development. There have been around 25 alpha releases of 1.0.0 since then, the most recent being published 4 months ago: sled versions. It isn't abandoned. They are trying to stabilize.
Most of the pure Rust alternatives that I know of are KV stores, FWIW.
For large files the IMO disk option is still valid.
I don't know the details but if it is a cache it must be fast, the database, any one, will read the data from disk and then send it over an IPC, usually a TCP connection, which is slow and resource consuming.
One would normally assume so. However I have recently seen it claimed that if one wants to store a large number of not so big files then keeping them in an SQLite database can be faster. It has no networking/IPC overhead and presumably looking things up in there is faster than finding files in a file system directory. Sorry I can't link to any evidence of this.
In some of my projects, I save the images in a DB and use a cache afterwards, if there is a lot of traffic, the performance is quite noticeable.
I speak from my personal experience, but the DB is going to read the data from the disk and then send it, won't it be faster to read it directly from the disk than to add additional steps no matter how fast they are?
An important issue is to avoid directories with thousands of files, it is better to make subdirectories, you can do it creating a hash of the file name and use the first characters as subdirectory, if there are many you can use several levels.
The same hash and the subdirectory levels serve as an index and you go directly to the file you need.
I know very little about data bases but I'm guessing that a DB might well map a lot of it's data ainto memory and then have fast indices to look things up, which might be quicker than looking things up in a file system by their file name. It also skips some user space to kernel space transitions and thread rescheduling I guess.
Also SQLite has no "send" overhead, it's all linked into your programs memory space.
At the end of the day one has to try these things for oneself on ones own data sets to make any performance comparisons.
I guess
An important issue is to avoid directories with thousands of files...
The "test1.dir" directory created above puts all the blobs into a single folder. It was conjectured that some operating systems would perform poorly when a single directory contains 100,000 objects.
No, it's not messed up. The semver suffix changes default visibility and dependency resolution. These are pre-releases, you have to opt-in specifically if you want to depend on them.
Exactly. What if the database, like SQLite, IS the cache?
Which ever way, some people at least have measured that for their workload on their platform SQLite performed better than relying on the file system. We should not make categorical statements about such things without actually doing the measurements.
I believe it is common to split cache on filsystem into a few layers to avoid too large directories. E.g. If every resource is a hash, then have a separate directory for each of the first two hex digits of the hash. Things like sccache for build caching does this.
I don't know the exact reason some file systems have issues with too large directories (and it probably varies from fs to fs), but I have heard for example that each directory has a separate mutex, which can be important if many threads are writing files in the same directory.
One of the reasons is that they are loaded into memory when they are read, also they are sorted and checks are performed, I learned this when many years ago I left a server hung because of a problem of this type.
I've been looking for one too, and concluded that for how SQLite is still the best option (although I also want access from multiple processes, so sled is not an option for me).
SQLite is not bad for a C dependency. As long as you stick to core functionality and not require plugins/extras or too-new features, it's available everywhere and easy to build from source.