Is constantly opening the same file a better approach?

Hey there.
I am working with some code that depends on decoding and reading some file. After startup, the selected file to open should not change, so right now my code just saves the file handle in some struct, which means the program keeps the file opened the whole time while the it is running. The problem is that the struct that manages the file handle needs to be shared across multiple functions, which forces me to encapsulate the file handle with Rc + RefCell, and then expose an immutable interface like read(&self) -> u64. I don't like where this is leading, so I realized I can avoid all this together if instead I share the file path and then open + read the file where needed. Is this a better approach? Constantly opening the same file is significant performance hit?

Just multiple functions or also multiple threads? If you're not using threads, I don't see the need for Rc/RefCell; just make the struct mutable so you can call mutating read methods on the File within.

That's because I also have some structs that must use the file manager, so in practice I have pub struct FileManager { handle: File} and then pub struct CustomStruct { file_manager: Rc<FileManager>. Now this is a problem, because read needs a mutuble reference, so I wrap the file handle in a RefCell and then

impl FileManager {
    pub fn read(&self) {
        let handle = self.handle.borrow_mut();
        // calls handle.read and do some checks
    }
}

So are you using threads or not? Your answer doesn't mention threads, but then why are you using Rc?

Read and friends are implemented on &File as well as File, and various inherent methods take &self too. (From a Rust language POV, File contains interior mutability; more technically it's because a File just wraps some OS resource identifier which is never itself modified.)


As per the above, you can get a &File out of your struct and then pass a &mut &File. But on the other hand, multiple instances can interfere with each other by changing the current position within the file. So you still have a problem if reads could happen concurrently.

If they don't, you are perhaps fine, either due to how your code is structured, or by using the Seek trait to reset your position when sensible.

3 Likes

I am not using threads, but the FileManager is acting as a singleton to control the file access. So in order to share the FileManager, I need Rc.

Given that the program relies on the file's contents not changing anyways, would it be practical to read the file in full, once, on startup, and store the resulting structures instead of storing a handle to the file they came from?

Opening a file generally involves a bunch of work on the OS side - walking the directory tree to check permissions, for example, or network calls if the file is on a remote filesystem. I would generally want to avoid repeatedly opening the file, if it were me, though I might measure the actual impact to vet that preference before making a decision.

4 Likes

The file is +1MiB, I have no clue if it is a good idea to save all this as some Vec in memory.

How much memory do you expect your target devices to have available?

For a phone or laptop, or for anything more capable, I don't think I'd stop to consider the impact of holding that in memory. Even allowing a very generous factor of ten increase in size in turning the raw data into a meaningful data structure, that's far too little memory to be concerned about when the device has gibibytes of memory available.

For something smaller - watch applications, embedded systems that run in memory-constrained environments, or even some of the smaller system-in-a-box devices - then leaving that data on disk until you need it may well be worth the effort. That's where I'd start spending time measuring the impact of repeated open calls and repeatedly parsing the data vs. the impact of loading the data up front.

7 Likes

On a personal computer, with the file left open, the contents will be cached in memory anyway. Even if you close the file, unless there is memory pressure, the contents will still be cached in memory.

In other words, the three major operating systems really prefer to avoid non-volatile storage whenever possible to the greatest extent possible.

2 Likes

1 MB would have been a big deal in… what, 1990? Today it's nothing. (Unless you are developing for something really-really small, like a µC with a few kBs worth of RAM. But you would know if you were doing that.)


If what you actually want is to track changes to the file, ensure synchronization, etc. instead of just "assuming it won't change", then what you want is a database. (There are plenty of file-based databases to choose from nowadays, SQLite and Sled being two excellent options, for example.)

4 Likes

Is there a reason you don't just store references to the FileManager instead?

The whole code is too long to post here, but I also have a LRU caching system to avoid the OS round trip. Now, the thing with caching systems, at least in my limited experience, is that they work horribly with references. I ended up solving this problem with Rc and I find it really suits the idea of ownership when using some cache system.

No tracking, the whole file actually is used as some kind of assets bank where I can find images and audio data. The file can potentially become quite large, say 100 MiB at most. Would it be fine to store it as some array or Vec even in that scenario? I ask this as someone with little technical knowledge when dealing with data storage and performance related issues.

I am targeting modern computers, so 2GiB RAM at the very least.

If the the primary purpose of the application is directly supported by that file then that's just the cost of doing business.

What are you doing with the file contents? What format is the file?

If you need an independent seek position for each operation then use separate File instances. If the seek position should be shared or doesn't matter because you seek before each read or use read_at then you can use &File which can be shared.

You can also try_clone a file instead of opening it again and again. This is a relatively cheap operation (in this case it will also share a seek position).

Opening the same file multiple times doesn't consume all that many system resources. Primarily a file descriptor on the current process which on unixes can be limited by default (though often the limit can be raised on demand)

6 Likes

Interestingly, the reason the slow OS here, Windows, is slow is mostly because it synchronously scans a file for viruses when a handle used for writing is closed. Opening a file for reading is still not nearly as fast as Linux due to a bunch of small differences both historical and in design, but for the most part it's not a significant concern for general usage.

That said, there's no reason to make this more expensive than it needs to be, and there's a decently common design for implementing asset package readers:

  • open the file once
  • read whatever header/directory metadata needed to verify the file and skip reading the same stuff for every asset
  • wrap these in an API that uses read_at / seek_read on the unix/windows FileExts to expose a Read + Seek impl (possibly just with a Cursor over reading the whole thing as a vec to start with)
  • directly wrap the API in Rc/Arc as it now has no mutable methods
4 Likes

When needed, the contents of the file are parsed to specific structs and then properly processed in the video or audio system.

That's kind what I am already doing: open the file at the startup and save the handle in FileManager struct. Then wrap the handle in a RefCell in order to expose a immutable version of the read function. Where needed, pass the Rc<FileManager> to get some image/audio data. I wanted to eliminate the use of RefCell to more easily (and safely) share the file handle, but it does seem to be possible. Others pointed out I could just read the whole file in some Vec/array and copy portions of it where needed.