Reliably working with NFS shares (I'm on Linux)

Hello everyone!

I'd like to ask your advice about working with files on remote NFS shares. Compared to local file access, I see several problems:

  • O_APPEND flag does not work correctly on NFS -- it performs a non-atomic append, which can lead to race conditions resulting in overwriting someone else's data. It can be solved either through advisory locking (which must be used consistently by all clients), or through preventing opening NFS files in append-only (don't append NFS files with O_APPEND and don't set this flag on NFS file descriptors, check all inherited/received file descriptors that they don't both refer to NFS file and have O_APPEND set)
  • BSD-style locking (through flock()) doesn't work correctly on NFS -- it is emulated with whole file POSIX locks. On Linux, BSD-style locks are a separate file locking facility that does not interact with POSIX record locks and open file description locks (which mean, it's possible for two processes to hold an exclusive open file description lock and BSD-style exclusive lock on the same file at the same time) -- if both local processes and remote processes set a BSD-style lock on the same file, then local and remote processes will not be synchronized with each other. The solution is to avoid BSD-style locks, and use open file description locks instead, which avoid problems of both POSIX record locks and BSD-style locks (they interact with POSIX record locks, they support byte-range locking, they are owned by an open file description and are only auto-released when the open file description is closed)
  • NFS does not support remote inotify notification. However, attempting to register an NFS target in the inotify instance will succeed, but such registration will not generate any notifications. One of the possible solution for event loops based on file descriptors is to watch both an inotify instance and a timerfd instance and to have necessary actions to be triggered by both filesystem event and timer expiration. A timer can be disabled or set to a different interval after the first inotify notification (since it means that the file system supports inotify)
  • What's the most important issue for me, is that usual filesystem interfaces are blocking, and for network access, they may block for a quite long time. What's worse, sometimes a thread may enter an "uninterruptible sleep" state, which means it won't wake up even to handle a signal (they can still die if the main thread returns from main()). One of the solution is to use asynchronous input/output (if security policy allows it, the process can use io_uring, but if not, they can fall back to the kernel asynchronous input/output facility through wrappers around libaio). However, asynchronous input/output does not cover all operations. Submitting an I/O request requires opening a file descriptor, which can also block. My proposals are the following:
    • Linux applications that are sensitive to blocking file I/O can implement "NFS-aware" mode, probably making it optional
    • Caching file metadata and directory lists as well as keeping file descriptors can avoid blocking operations in some cases
    • Asynchronous input/output can help NFS-aware applications to avoid being blocked
    • Operations that are not supported by AIO (such as opening a file descriptor or requesting metadata) can be delegated to worker threads

What do you think about it?

1 Like

What exactly are you suggesting that Rust itself does as opposed to programs that use Rust?

I'm not suggesting about Rust itself, I mean programming operations with NFS to avoid being blocked for a long time (probably "safely" isn't the correct word here, the problem is with avoid getting threads blocked forever). However, Rust has features useful for asynchronous I/O, such as async functions as well as core::future::Future trait, which simplifies non-blocking operation (instead of coding the "if not ready, do something else", it will be done automatically by the asynchronous runtime). Unfortunately, right now tokio::fs::File only uses blocking operations on a thread-pool (which, as I believe, should only be done as a last resort, when there are no other options). The better option is to use operating system async I/O facilities, but that can mean an app would have to use different crates on different operating systems. It would be great to have an abstraction over asynchronous input/output that would use native OS-provided facility. So, I'm more interested in asynchronous input/output implementation, NFS is just an example where it can be necessary (and where the "blocking operations on a worker thread" way can cause problems related to worker threads being blocked for a long time).

1 Like

There are no other options if you're using the RHEL 8 or older versions whose extended support may end within 2033. Basic meaningful async file IO support is landed on linux kernel version 5.1 (released in 2019) with io_uring interface.

2 Likes

As far as I know, even if io_uring is present, access to it can be restricted by a security policy in the system (for instance, modern versions of AppArmor can restrict io_uring access). But there is also another asynchronous input/output facility in the Linux kernel, that is accessed through system calls io_setup(), io_submit() etc. These system calls have no glibc wrappers, but there is libaio that providers wrappers for it. I have managed to use it through the libaio-futures crate from crates.io. Other operating systems may provide different AIO facilities (or not provide anything like that at all). Providing cross-platform AIO support may require choosing the right facility for the target platform and provide general abstractions for all of them (which can be difficult due to differences between platforms), so running ordinary I/O on a worker thread seems to be the most generic and available, but the least powerful solution.

Linux kernel AIO doesn't provide true async file IO. It's basically useless with files without O_DIRECT flag. Even with this flag it doesn't guarantee non-blocking IO. It have too many hidden restrictions that makes it hardly a general solution.

1 Like

Why should that block libraries from using better things when they are available?

A cross platform abstraction over io-uring and similar concepts on Windows and OS X (to the extent that they exist) would be great.

You could have emulated fallbacks to thread pools for unsupported platforms (maybe some BSD or other has no support) and legacy platforms like RHEL 8. Even if the performance suffers a bit on those platforms: who cares? The performance wasn't good before either, and if you choose to use a legacy OS thats on you.

It just needs someone to write such a library (not an easy undertaking). Or more realistically: someones, you need platform specific expertise. I do think it would be a worthy undertaking.

Sure, and that's what happening - or at least intended to happen - with the tokio. The tokio-uring project is a testbed for such effort, and it needs/deserves more attention and feedback. Why don't you try it on your system in this weekend?

1 Like

This makes me wonder. Is there a reasonable and easy way to check for io_uring support at runtime (both if the kernel supports it and it isn't blocked by seccomp)?

Have you tried setting different mount options?
mount -t nfs -o soft,retrans=1,timeo=30 [...] should convert timeouts to EIO after 3 seconds.

Even when io_uring is supported, it is incompatible with the current interface used by tokio for IO. Io_uring is a completion based interface, while tokio uses a readiness based interface. The difference between the two is that for a readiness based interface you poll if data is ready to be read or the output buffer for writing to is empty, such that you only need to lend a reference to your buffer to the OS at the instant where you are reading/writing and not while waiting for you to be able read/write. For a completion based interface however, you need to transfer ownership of your buffer to the OS until the OS reports back that it is done reading/writing. This for example means that if you are using rust async (which allows cancellation of futures), you can't read/write things on the stack unlike with readiness based interfaces. Instead you are forced to copy to a heap allocation that doesn't get touched or deallocated unless the OS reports that it is done.

I've concluded that async file IO of any kind (including io_uring) can't help with this yet, not until all that is more mature, stabilized, and better supported by Rust crates as well as Linux.

For now I suggest using Tokio with its thread pool for blocking IO, or a separate thread pool (e.g., https://docs.rs/threadpool/) for doing file IO if you don't need Tokio for other reasons. That way when the IO blocks, at least other concurrent tasks/threads are not blocked. You may need a large thread pool, but that's currently the price of doing lots of file IO.

Tokio could still use io-uring to to make things like stat, sendfile, process.wait async.
And it has its own async BufReader impl, which owns a buffer. So maybe that AsyncRead impl could temporarily transfer ownership to the kernel? Ah, I guess that'd need min_specialization

1 Like

Asynchronous file I/O always has to use completion-based interface, because there is no clear "readiness event" on ordinary files like "read-readiness event" (there are unprocessed data) or "write-readiness event" (there are free space in the write buffer) for sockets (or pipes).

It's the same with "blocking workers" threadpool method tokio::fs module uses -- for instance write operations on tokio::fs::File take a copy of the data.