Ripuniq: fast async uniq

https://crates.io/crates/ripuniq

I made this a bit ago, mostly out of frustration that there was no way for me to deduplicate a file while preserving the order, and in the process somehow made it 5x faster than sort -u

it's mostly just a thin wrapper around a DashMap, with some manual tuning of parameters to improve performance.

anyone have any ideas on how i could make this faster/better?

Don't use async I/O for "speed". Async I/O is slow when your problem is not I/O bound.

Try this code (compile in release mode and run it locally). For a set of 4096 unique lines, each of 128 bytes, total count = 521204 lines), this runs in 0.06s, while your ripuniq runs in ~0.2s on my machine.

1 Like

you are changing the hash implementation, so that's not a fair comparison.

specifically, you are going from sip1-3 to ahash, which (according to the benchmarks from the ahash crate) should result in a ~12x speedup, less than what you observed.

ok, turns out that even with a fair comparison, blocking io is faster than tokio. tokio seems to be doing a lot more than is required to implement cooperatively scheduled coroutines, namely doing a bunch of calls to get the current time for some reason? i'm not convinced the problem is async io itself, but nonetheless i have removed it, as well as making several other optimizations.

2 Likes

With the default hasher of HashMap, it's still 0.08s vs 0.2s on my machine.

My implementation also causes less memory pressure by not keeping all lines, only the unique ones. Your impl leaks all lines, so its memory usage is O(input), while mine is O(unique). I didn't measure whether that helps with speed.

The slowness of async I/O was adequately explained a few days ago by tokio's author herself.

Just to be clear, what I explained in that thread is that Tokio is slow at file I/O specifically. Tokio is fast at network I/O — that's what it's designed for.

Regardless, you are of course correct that Tokio isn't the right choice for a program that primarily reads files and does CPU-bound work.

I may be the maintainer who does most of the maintenance today, but I'm not the original author of Tokio.

10 Likes

While ripgrep showed that memory maps aren't always the fastest way to process files, I wonder if they'd work great here. (So long as we ignore that it's super unclear whether they can even be used soundly, and just hope for the best.)

Because if you're just leaking every line to store it anyway, maybe you could make that the kernel's problem instead by just using memchr to split a mem-mapped &[u8] that represents the whole file. Then there's "no" file reading and "no" allocations.

How would that work with piping through stdin?

You could be generic over a Reader and Writer ? Or just duplicate some code for the special case of a mmap.

Edit: Yeah the MmapFileReader in fmmap — Rust filesystem library // Lib.rs implements Read and BufRead.

- Some(file) => filter(File::open(file)?, out),
+ Some(file) => {
+     let file = fmmap::MmapFile::open(file)?;
+     filter(file.reader(0)?, out)
+ },
1 Like

This could be fixed with io-uring support on Linux. I believe Windows now also have something similar.

I have seen io-uring used to great effect in C++ for things like plocate. Yet Rust support is lackluster at best (there is glomio I guess, but that is thread per core, which is rarely the right choice, and it also seems network focused). Tokio-uring exists but doesn't seem to be very active nor promoted (for some reason).

To me it seems like the premiere async framework should be at the forefront here on improving performance for async using io_uring.

(Also, while we are at it, it would be nice if tokio played better with other runtimes, code written for smol works ok under other runtimes, this is not true for code written for tokio.)

Minor comments:

  • Glommio is not network biased, it support file IO (including direct IO) fully.
  • Tokio_uring is also thread-per-core.

By thread-per-core I simply mean each executor runs on a single thread. Glommio additionally allows pinning an executor to a CPU; tokio-uring does not.

The fact that glommio and tokio-uring are both thread-per-core is a consequence of how io_uring is designed. It is difficult to use io_uring for network IO in a work-stealing runtime without giving up some performance.

I do want to see at least file IO to use io_uring in Tokio soon, though.

6 Likes

Have you see the approach in uring_fs? It is just a rough experiment, but I wondered if it may give you ideas for tokio because it already works with tokio and other runtimes. The basic idea is that a separate thread owns the uring and notifies the waker in the async threads; however a mutex is needed to do this, and I don't know if that would cause problems. I'm not even sure it's a viable approach, just thought it was an interesting variation.

I know more or less how to implement it. I just don't have the time.

4 Likes

Doh, I completely didn't think about that :person_facepalming:

Might still be an interesting experiment to see how it perfs in comparison, but yeah, I guess it's not going to be that helpful.

Definitely the other thing that comes to mind would be using a crypto-hash instead of needing to store the lines at all. Probably a waste of CPU most of the time, but an interesting time-space tradeoff.

2 Likes

How do you plan to produce the output summary if you irreversibly destroy the input?

Well if it's just uniq (not uniq -c), then something like

for line in line_reader {
    if table.insert(crypto_hash(line)) {
        println!("{}", line);
    }
}
1 Like

That was also what I did in my playground link a bit higher in the thread. I did it mostly by reflex because I've had memory issues in the past. You could replace the hasher by a better one. But on the playground I was playing it safe.