When does the size of a file change?

I think I probably know the answer ("undefined") but I wanted to check my understanding.

The question is, if I have a file, and I start writing past the end of it, so it is getting bigger, and then the machine crashes (let's say the power fails or the virtual server is brutally terminated), before the file is flushed or closed, when the computer restarts, what will the size of the file be?

My guess is that it cannot be depended on. The reason I am thinking about the issue is I am implementing an optimisation of how a database file is updated. Up to now, I first save all the updates (write operations) into a temporary file, make sure that is saved, then apply the updates to the main file. In the event of a failure, the temporary file is used to apply all the updates (if the temporary file got to the save point).

Then this morning I thought of this optimisation : updates to the main file beyond the current end of the main file can be written directly, without going into the temporary file. This is "safe" because they are not overwriting anything.

However this could potentially result in the main file size changing before the temporary file is finalised. So... I think I mustn't rely on the file system for getting the size of the main file, instead I need to store the size somewhere in the file itself.

Right. Only an fsync (File::sync_all) is guaranteed to change the size persistently, but the size can change persistently anytime before that if you're appending. There are very few guarantees -- write order is not guaranteed either, which is why checksums are normally used.

Some specific filesystems have more guarantees, but that doesn't help us in the general case.

1 Like

Thanks. Incidentally I checked how sqlite works, and it does it a different way. It saves the "old" pages that are about to be over-written in a "rollback" journal file. So they do get my "optimisation" for free ( obviously data beyond the end of file doesn't need to be saved as it doesn't exist! ). However I feel writing the new data to the journal file is probably a better solution, as the sqlite method requires the old data to be read, which is potentially an extra cost.

Yes, the first time a page is modified (after a checkpoint) it is logged by many dbs -- Postgres does this and I understand the Sqlite used the Postgres design.

Some other dbs use a no-overwrite approach where they write the changes for a page to a new page, allocating the new page and copying the old page the first time it is dirtied after a checkpoint. The old pages are not reclaimed/reused until after the next checkpoint.

A drawback is that you log the page for every change in the checkpoint cycle, rather than just once per checkpoint cycle.

I am assuming here that you also have a transaction log (write ahead log), like most dbs, that is applied during recovery to the last checkpointed files.

1 Like

Something I have been wondering about - it seems like traditional rotating hard disks are pretty much dead ( as far as I know ) and everything is now in non-volatile memory (NVM) / solid state drives (SSD), which probably has some implications for the implementation of a database.

All IO is much faster with SSDs than spinning disks, however, SSDs still have much faster sequential IO than random IO. So there is still an advantage to designing to maximize sequential IO, although this is not as critical with SSDs as it is with spinning disk.

Spinning disks aren't dead because they cost so much less than SSDs per GiB. If you have strict read latency requirements then you either need SSDs or enough RAM for a page cache large enough to hold the working set. But sometimes you have large amounts of data and don't have strict read latency requirements, and then you can save money with spinning disk.

Even with plenty of RAM for the page cache and good read latency, write latency may be poor with spinning disk due to transaction commits. But you can choose to put your transaction log on SSD and the data on spinning disk.

2 Likes