X86_64 linux, crash safe log

  1. We are on x86_64 linux.

  2. Consider the following API:

pub struct Record(Vec<u8>)
pub struct SafeFileReader {}
pub struct SafeFileWriter {}

impl SafeFileWriter {
  pub fn new(s: &str) -> SafeFileWriter {}

  pub fn write(&mut self, r: &Record) {}

impl SafeFileReader {
  pub fn new(s: &str) -> SafeFileReader {}

  pub fn read_all(&mut self) -> Vec<Record> {}
  1. We want the following requirement. For any record r_0, r_1, .., r_k, r_{k+1} and any crash style C, we want the following guarantee:
  • the calls on write(r_0), write(r_1), ... write(r_k) return

  • during the call on write(r_{k+1}), the system crashes due to C

  • then, when we do a read on the file, we want it return either [r_0, r_1, ..., r_k] or [r_0, r_1, ..., r_k, r_{k+1}]


By writing out record r = Vec<u8> as (r.len(), checksum(r), r) we can easily detect partially written records (and ignore them.) That is not my concern.

My concern is as follows: when writing record r_{k+1}, and the machine crashes, what is it that guarantees us that the data written in r_0, ..., r_k are not corrupted.

We can assume that our program is the only program editing the file. My main concern here is what promises kernel / filesystem makes to us regarding writing records to files.

Since you are asking about (apparently) arbitrary failure modes, the easy (but non-reasonable) answer would be "nothing". If "any crash style" includes "a nuclear weapon explodes nearby", then anything might be corrupted.

However, if we stay within reasonable limits of "crashing" (e.g. assuming hardware functions correctly), then I think this would be a fairly basic expectation. Writing to one part of a file shouldn't corrupt another part of it.

Furthermore, if you are willing to sacrifice some writing speed by flushing the file after having written each record successfully, then – as far as I understand – it should be guaranteed that a crash occurring during write(r_{k+1}) would leave the system in a state whereby at most r_{k+1} may be corrupted, and all the previous k records are guaranteed to have been written correctly.

Unfortunately, OSes are known to contain bugs related to file flushing, so if you are interested in a more specific scenario, you could search for something like "x64 linux write syscall flush bug" to see if there are recent, relevant, and not yet fixed bugs.

Is this actually true? Suppose r_k ends on a non multiple of 4k. Then, when r_k+1 starts write, it might touch the last page that r_k wrote; is it possible, via a non-hardware-destruction crash, that that page gets corrupted ?

Also, the metadata for the log file itself is stored on disk somewhere right? So updating r_{k+1} requires updating the meta data info right? Can that, even during non-hardware-failure, get corrupted in a way that nukes the entire file?

I think you envision entirely unrealistic picture of how modern OSes work. Note that with HDD you only have, maybe, hundred opportunities to store something on disk each second.

And OSes had to adapt to that. Thus when you call write data is not send to disk. Far from it. It's send to the kernel buffer, then it's mixed with other data, requests to write something on HDD are put in the queue, reordered and merged, some may be postponed and so on.

When you can fflush you just add barrier to that queue.

Anything is possible, but that would be bug in kernel, too. In most cases kernel assumes it can update m metadata independently from file content. But you can always use truncate to reserve the space and fflush is supposed to be barrier for metadata updates, too.

This being said 99% of time it's better to use one of two choices: either use full-blown database and get ACID semantic for free, or, alternatively, just create a file, write everything into it, then do close/fflush/rename dance.

1 Like

I do not think we are discussing the same issues at all.

For example: consider a SSD write-erase cycle. In my very limited understanding of SSDs, if we try to write to a location that already has data, the SSD drive has to an erase, not only of the location we want to write to, but the entire block it is part of, and then rewrite that entire block.

What if there is a power outage after the erase-block but before the write-entire-block ? Is data lost there ?

Now, there is a number of 'solutions' to this problem:

  1. We don't care and pretend this never happens.

  2. The kernel assumes that SSDs may have this problem, and works around it.

  3. SSDs don't have this problem, because internally, they have a Journal or WAL of sort where they write out the new block first, then do some remapping.

I'm in the 1% of the time, where I need more durability than "write to file, and assume it is fine", but I need more performance than using a SQL database. Example: something close to bitcask is close to the reliability/performance tradeoff I am looking for.

Phrased another way, imagine a world where you have an adversary that can pull the power plug on the computer at any time.

Under such an adversarial world, what guarantees does the linux kernel file system calls / rust std lib file system calls offer, if any at all ? In such a world, how do we write the 'logger' described in the original question above ? What performance losses do we suffer when coding in a 'defensive' way ?

Not even remotely close. You can not write into the location where there are already some data exist.

SSDs have a flash translation layer which is needed to create an illusion that you can, actually, overwrite data. But that's not how it works at all: data is copied to another pre-erased place and then mapping table is updated to create illusion that you have changed one sector.

Practically speaking SSD is mini-computer, with it's own CPU, RAM and OS.

Depends on what kind of SSD you have. SSD, too, operates on the same principle as filesystem in OS and is supports barriers, too.

The biggest problem is not loss of data between when it's erased and replaced, that's not an issue. SSDs are doing caching and many cheap ones would report that data is written to flash while it's still in RAM.

This is very important to have good numbers on benchmarks and people are looking on these when buy SSDs.

How can kernel work around it? It doesn't even know what, when and how is actually written to flash!

Certain enterprise-grade ones offer such guarantees, yes. Most consumer ones just hope for the best. They try to minimize loss, but if you would power them off frequently under load they would start losing data and, eventually, would destroy internal data structures and would become undetectable. Sometimes you may get a refund.

I haven't recommended that. Create new file, store data, close and flush, then rename is a good way to guarantee consistency.

But if you really sure you want to open that can of worms then that article would be a good start, then you may want to see if you can afford enterprise-grade SSD with appropriate promises and so on.

Believe me: there are insane amount of complexity in that topic. That's why it's usually good idea to leave that to people who are making databases.

But if you need it then you need it, I guess.

Read the appropriate article. And appropriate links there.

This question is not related to Rust at all.

After certain number of cycles (if they are done when SSD is in active use) cheap SSD would stop responding to host requests and all data would be lost.

That's the only thing that can be guaranteed. If you need something more — prepare to deal with wonderful world of enterprise-grade SSDs, enterprise-grade SDs, enterprise-grade USB sticks and so on.

You'll want to look at your filesystem options too. I'm not an expert, so don't take this as gospel, but anecdotally ZFS has a good reputation for example, with data and metadata checksumming, copy-on-write blocks, and the like. You probably also want redundancy, some sort of RAID or other mirroring arrangement.

More generally, the line between software and hardware is vague, and there's also a gargantuan number of possible software arrangements and layers that fall under "x86_64 linux". In addition to the kernel and which filesystem you use, there are also say, raid card drivers, on chip firmware, etc etc. This is reflected in the thread already, which started with "ignore hardware errors" and quickly moved on to "how do (some) SSDs work internally". And no matter how many qualifiers get stacked up, the answer will never be "0 chance of errors", it will just be "acceptably low chance".

(And yeah, none of this really has anything to do with the Rust language.)

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.