Waiting for detached threads to complete?

I actually have working code for this problem, but would love to hear any suggestions for a cleaner approach, or if there is a crate that does this, or if it might be good to split it off into a different crate. And now for the problem (and my solution)...

I have a long-running daemon (intended to run for months to years), which spawns threads. These threads must complete (unless power goes down or something), so the main thread must outlive them. In some cases, however, the main thread wants to exit (specifically, when the daemon is restarted due to an upgrade), so it needs to wait for the spawned threads.

As I see it, I have two options. I could either store JoinHandles and join all the threads, or use something like a semaphore to count them. The former approach seems awkward, since I never want to wait on the threads until I'm exiting, and if I only wait when I exit then I have a resource leak due to threads accumulating. The latter is made somewhat more challenging due to Semaphore being eliminated, so I need to use Condvar, which so far as I can tell is the correct thing to do.

I thought about just creating a semaphore (which in effect I did), but instead decided to create a Threads struct that would spawn the threads, wrapping each spawned function in a closure that decrements the thread count. Then I implemented a Drop on the Threads struct which waits until the count gets to zero. This seems like a pretty reusable bit of code (possibly needing better names), so I am curious what y'all think of it.

This is my first time using Condvar so it seems possible I've gotten something wrong. It's also possible there is a conceptual problem here. And I wonder if there is a common pattern (or crate implementing some such pattern) for this sort of use case that I could do better to use instead? It seems odd writing such low-level and (to me) fiddly code for something as simple as waiting for the job threads to complete!

Here is the code in question:

https://github.com/droundy/roundqueue/blob/master/src/longthreads.rs

Maybe I misunderstood your problem, but isn't it exactly what's described in the Rust Book Version 2?

https://doc.rust-lang.org/book/second-edition/ch20-06-graceful-shutdown-and-cleanup.html

He'll create a threadpool with a JoinHandle that can wait for all Threads to be terminated. If you want to do this for only one thread it should be even simpler.

That does look similar, but does require that your pool have enough threads for the maximum number of possible jobs. For my purposes, a thread pool seems like significant overkill, since each job is slow enough that spawning a thread is negligible, but there could be very many jobs at times.

Are these threads IO bound? You don't want to limit concurrency of the jobs?

I also think that a threadpool is going to make it easier for you. You've kind of built an adhoc one, albeit without an explicit handle to the threads but rather just a counter.

The threads are each just waiting for a single child process to complete and then moving a file and exciting. The number of processes is indeed limited, but I don't see what a thread pool would gain me. It would introduce more code, communicating and synchronization between threads, etc, when all I need is to know how many are left.

What would the advantage of a thread pool be?

Maybe nothing for your usecase. But, it would give you a handle to all alive threads so you can wait for their completion. It would also allow you to reuse threads rather than spawning new ones each time. It would allow you to limit parallelism so you don't have too many processes at once. It would also allow you to get completion info - e.g. do you care about errors if a file fails to move?

If none of the above is of interest, then indeed a simple countdown latch is sufficient.

In the case of errors, I don't know what I could do besides log them to stderr and ignore them. The cost of spawning a thread should be utterly negligible, compared to all the other work being done, and the daemon itself is a job queuing system, working across multiple hosts and users (in a shared home directory situation), so coordinating how many jobs are being run is its entire job. And we reallly don't want the queuing system to decide to run a job but have it stalled on an available thread. Not that that is likely if we introduce as many threads as available logical CPUs.