Using threads or async - faster file reading?

Hello!

I have a case in which, I fill a vector with about 200 filenames. For each file I load it and do some manipulation on it, which then exports a file. These 200 files are in no way connected, i.e. can be processed independent of each other. My question is then;

Should I use threads or async (tokio for an example)?

I think using something like tokio would be smartest, but is there something I am not considering? If it is indeed tokio, does anyone have a small code bit which does this functionality or how should I go about learning it?

Kind regards

Async can only make network IO faster. It provides no advantage for file IO. Use a thread pool.

1 Like

Asynchronous filesystem operations aren't usually possible on many platforms, but I'll focus on Linux. libaio, a C library for "asynchronous filesystem operations," literally spawns a thread pool for "asynchronous" filesystem operations on Linux. Tokio and async-std IIRC do exactly the same. Linux has had issue supporting true async filesystem I/O because asynchronous operations aren't always internally supported based on filesystem implementations and other kernel details.

However, withoutboats' new library ringbahn uses a new kernel API, io_uring, for doing asynchronous operations which does seem to support true asynchronous filesystem operations, though it works differently than many other standard async systems. Most async systems work like this:

  1. I ask the kernel to read/write a socket.
  2. The kernel returns me an ID.
  3. I poll that ID until the kernel says that the operation is ready.
  4. If it's a read operation, I can grab that info from the kernel.

io_uring works differently in that the kernel will actually write data into a buffer you ask for directly, and this has proved somewhat difficult to work with safely in Rust, but ringbahn does some cool Rust things in order to make this safe. io_uring is only supported in Linux kernel version 5.5 or later, which is fairly new. io_uring should generally provide true asynchronous IO for filesystem objects in addition to sockets and the like.

In summary, It really depends on what OS you're using and on what asynchronous system you're using, but at least on Linux, epoll doesn't support asynchronous filesytem operations, so threads are used under the hood. Tokio/async-std use actual OS threads to do filesystem reads/writes.

TL;DR whether you use an async executor or not, you're probably going to end up using OS threads under the hood for filesystem access, unless you use something based on io_uring.

4 Likes

It is true that there are some new experimental apis that only work on new Linux kernels, that would make file IO more usable in the async world. However, as there are no libraries in serious use that supply this kind of functionality, I do not think that it will help the OP of this thread.

2 Likes

Thanks to both of you @naftulikay and @alice ! It was very nice to get a good explanation of what async actually is in regards to file processing.

I will try to see if I can get it working via threads.

Kind regards

@AhmedSalih3d you can probably use rayon par_iter to get a fairly good result with minimal work.

3 Likes

Thanks! It worked very neatly.

Kind regards

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.