I'm trying to design an actor library, and I'm running into a bit of an issue about when things need to (not be) dropped.
Going through all the pieces:
the actor has a message queue, the sending half of which is exposed to clients by giving them Arc handles. For ergonomics reasons, the handle has to implement CoerceUnsized, so AFAIK that means it must be exactly Arc, it can't be a custom type.
the receiving half of the queue is owned by a separate task, which is dedicated to running the message loop.
messages are processed by user-provided code, and I want to let that code get a copy of the queue's sending half, because e.g. from there you can add a stream of messages to your own queue.
I have tried providing this by having an extra stored copy of the Arc as a task local, so a static function can then be called from the user's code and provide it. (Although it would be an option, I'd prefer not to provide as a parameter to the user's function because most uses won't need it - but I'm not sure how doing that would solve this problem either)
Unfortunately, this is where things start to go wrong.
In principle, once everyone's done sending messages and drop their Arcs, the message loop will clear out remaining values and then break because the receivers will return Closed, and that unwinding drops the actor.
but if I do this naively with the static value being a strong Arc the loop locks open because the receivers never close down even once all the "external" handles are dropped.
I tried working around this by disabling the receivers if I had only "internal" arcs outstanding, but since the conditions in a select! are only checked when something is received (either a value or Closed), not when the Arc drops, I couldn't get it to work
If I use a weak reference, there's a race condition where the sending half can already be dropped at the point the message loop picks up the next value, so if the handler for that value wants a copy of the sender, the Weak can no longer be upgraded.
You can add a timer/timeout to the select! that periodically checks the Arc strong count. If it's 1, you know your task-local copy is the only one left. It can then be dropped (e.g. moved out of Option) to close the channel and trigger your clean-up phase.
It isn't a great compromise, because there is still a discrepancy with the reason for strong_count == 1. Is it because the last client has dropped its send-half, or because no clients have ever connected? Another boolean that is set when the first client acquires a copy of the send-half can resolve the ambiguity, and then you just have the actor task waking periodically to check if all of the clients have gone.
If you need to handle the case where clients come and go arbitrarily and you don't want to shutdown the actor prematurely, you will need to bake some shutdown logic into the "protocol", where a task can explicitly shutdown the actor by sending it a shutdown message.
The process for spawning a new actor hands out an initial Arc, so I don't think that ambiguity exists in practice?
Similarly, they can come and go arbitrarily, but only by copying that original handle - there's no way to "conjure" one from a static registry or anything. I think between these two, that implies that once the strong count reaches 1, clients can never increase it again
I found a solution that seems to work well but still has a wrinkle
guarding each of the recv calls on needing either more than the one local handle being live or on the queue being non-empty means the select call would fall through to the else case except...
this still leaves a window where the call can be entered and then later on, the last external handle disappears. To fix this there's an explicit sleep. I don't particularly like this because I don't like the overhead involved in having the timeout firing continously while the actor is alive even when there's no messages, but I suppose I can tune it later.
The sleep is also guarded on the number of handles, so that if the select call happens when there are no more incoming messages, the loop exits immediately rather than having to wait for the timeout to fire again