Hi, what's the approach for exiting an app which runs multiple threads in actor model, i.e. each thread lives forever? I have an app which I am working on that runs in a docker container and it hangs when I try to exit it, because of a fatal error which requires the app to abort.
I tried few approaches:
call process::exit(1) on the main thread - hangs, process marked defunct in ps
call panic! on main thread - hangs
send all the actor threads STOP message, which cause them to break from the endless loop - hangs, from some reason all the actors seems to exit (i print the last line), but one of them, not always the same, keeps showing in ps
same as (3) but actor thread either call panic! or process::exit(1) when getting STOP message - hangs
The first and urgent question is how do I suppose to stop this thing??
The second question, which is more fundamental, how do I stop multithread app in rust? because it may be that the 3rd party libs I use or will use include threads which will not behave nicely and will hang my app on exit?
To be clear, the reason for the app to abort is abnormal, the app needs to exit no matter what, the state of the system is of less priority, and it shouldn't happen much.
For threads that you have access to, you need to JoinHandle::join() them or else they'll be defunct. The following is a good question though:
For misbehaving threads that fail to exit, you may need to employ platform-specific techniques, like sending signals.
stdexplicitly detaches the thread on Drop. Maybe there should be an option to not do that, and let the threads die when main thread exits. But then that still wouldn't help with 3rd party code that's doing its own thing with threads.
Calling pthread_detach just marks the thread to clean up immediately when it's done, without waiting for a pthread_join. The thread is still part of the process, and should still be terminated when the main thread exits. (But they could be zombies until the parent reaps them, as @jonh suggests.)
The detached attribute merely determines the behavior of the system
when the thread terminates; it does not prevent the thread from being
terminated if the process terminates using exit(3) (or equivalently,
if the main thread returns).
Yeah, that's almost certainly it because Docker doesn't run a pid1 by default. Web search for "docker pid1 zombie". If you're using the Docker CLI directly you can use --init - however, that's not exposed by e.g. Kubernetes. Best practice is currently to use e.g. dumb-init.
But anyways at this point the question has nothing to do with Rust - std::process::exit() will on Linux end up calling the exit_group() syscall which will terminate all threads correctly. The problem is your Docker setup.
AFAIK threads on their own are unlikely to stop a process exiting. Ideally though you should wait for them to finish. Using some timeout you trust if needing extra guarantee, then going multiprocess if needing more safety.
It more the case of what exit does that this doesn't. Depends on underlying platform, e.g.libc::atexit hang.
I have been able to reproduce this issue with running the app on the host without docker engine, i.e. cargo run.
I am using approach (4) and also added handle.join() on each Actor handle before calling process::exit(1) on the main thread. I have also added the process::abort()
I still get the main process to hang (running on host, without docker), and now I can't see any thread leftovers... just the main process stuck, and no way to kill it, even kill -9 doesn't work on it.
this means:
it is not related to docker (pid 1 issue)
it's not about crazy threads ?? but some issue with the parent not being able to exit ?? is it correct, to say that?
I've dealt with a similar headache in the past, and the solution was to redesign how the actors performed certain tasks, and to add a "kill" command to them. Since I then needed a few, I ended up with KillAbort, Kill (for in-simulation work), and KillUnwind (the original wasn't in Rust, I wanted different branches of behaviour for each so I could ensure certain logic would never fail in the unwind).
Regardless of whether you're using real threads or coroutines of some kind, look into:
set_hook: apply it to the thread that's handling the processing so you know, you can even move in an identifier so you know which thread tanked and the task if you want;
catch_unwind: I've done very little with this, but it's a starting poitn if you don't want to set it.
Personally I'd redesign part of it (if you can) to enable killing on fault, and to try to eliminate any panic in general. If you can't (third party code or whatever other reason), that's fine, there's always the hook/catch option.
If I were tied down I'd use set_hook or catch_unwind on my own threads, even the main thread if I were particularly worried, at least until I narrowed down why it was happening.
btw, to shed some light on the context, the abnormal event the app is experiencing related to a physical USB device being unplugged from the host computer.
Are you experiencing write locks? In htop that would be the "D" process status. I've seen that quite a lot lately when some process tries to access an usb thumb drive which does not respond e.g. because it overheats.
I agree with @Fiedzia, if SIGKILL can't kill it, nothing Rust (or any other programming language) can do will help.
I was surprised that this was actually possible (sigkill is supposed to be unignorable), and I learned about the "uninteruptable sleep" state, which is the "D" in ps that @YBRyn asks about.
It means that your program has gotten stuck waiting for Kernel I/O to finish, which fits nicely with unplugging a USB-device.