Cost of File open/close

I have a server where, every so many requests, the state of the server is written to a file.

Currently, every time I write to the file the File is opened and closed.

The File could be opened at server start, stored in a struct that stays in scope for the life of the server (the File never closes), and then be written to without having to open/close it each time.

I think this would improve the performance of the writes, but Im not sure exactly what the performance cost of the open/closes that I'm eliminating are. Does anyone have insight into this?

Also, is keeping a File open too long dangerous and/or not worth the savings?

It's relative, which is why it is recommended to profile before optimizing, unless you have previous experience with something and you know in advance that it needs optimizing.

Opening a file is more expensive than IO, so often files are kept open when IO is frequent. The only drawback of keeping a file open is that there are OS limits to the number of open files, so it is a mistake to open an unbounded number of files. Each open file also occupies some memory, but it is a reasonably small amount so this is not usually an important factor.

Note that closing a file does not flush its contents to storage. You have to explicitly call File::sync_all to do that, if you rely on this being done for crash recovery. Without doing this, the file's contents are flushed to storage lazily by the OS.

PS. I assume you're not accessing the file from more than one process.

2 Likes

I've heard a story that the original motivation for inventing “file descriptors” in Unix (the thing that Rust's File is an interface to) was because it was too expensive to look up a filename and check permissions on every read or write. So, the costs were, back then, significant enough to warrant introducing a new abstraction to the kernel. Performance characteristics have changed a lot — large caches exist and SSDs no longer have the cost of seeking between directory data and file data — but expectations have risen too.

There is no general risk to keeping a file open, but you should consider these factors:

  • If another program (“the user”) moves the file while your program has it open, then your program will continue to write to it in the new location, rather than creating a new file in the old location, which might be desirable or undesirable.

  • Think about what you want to happen in the event that your program crashes (or the computer loses power) while in the middle of writing new data to the file. How do you arrange so that there is not a half-written file and zero good copies of the data?

    This is, in general, a very hard problem — database software engineers spend a lot of effort on ensuring data written to disk will always be recoverable. However, there are some cheap mostly-good-enough solutions, such as the “atomic write” trick: create a new file with the new data, and then once it is fully written, rename it to the existing name. This way, the old data is not deleted until the new data is ready. This is robust against program crashes, but not necessarily against power loss. If you use this technique, then you are necessarily not keeping the file open.

4 Likes

Thank you, this definitely answers my primary question/concern. I think the IO is frequent enough it warrants keeping the file open. All the other information you gave was also super helpful!

1 Like

Thank you too, this answers the second part of my question. I also appreciate the background on file descriptors! :slight_smile:

You've both given me a lot to think about. I've written a dhcp server (crate toe-beans) and the leases file is the file I'm reasoning about. Its tricky because I'm not ready for a full database just yet, the file varies in write frequency, and is ideally synced and atomically written (but not required to because dhcp is fairly fault tolerant).

Anywhooo, I have enough info to make some improvements. Thanks again!

What would being "ready for a full database" be?

"Use SQLite" is a very common recommendation. It means you don't have to think about any of these file IO problems.

4 Likes

The main reason is that I was originally replacing a dnsmasq instance and dnsmasq uses a file instead of database. So it was fairly trivial to match that behavior with serde. I'm also trying to be fairly conservative about dependencies.

I'm worried that the database will change how the leasing is handled a bit so I'm waiting for changes to the leasing logic to settle down a bit first.

I totally agree that a sqlite/turso/etc database would work better for most use cases, and will evaluate how it relates to this use case.

Filesystems are great at appending, so "open in append, write to it, close" is perfectly fine for any amount of output that you as a human would ever be willing to look at, even if you do the open-close for every line.

That said, this is also a great way to write a Schlemiel the Painter algorithm accidentally if you're on a machine with a virus scanner. They love to re-scan the entire file when you close it, turning your works-great algorithm into something that weirdly gets slower and slower over time as the output file gets bigger.

2 Likes