Task silently stops listening - how to investigate?

I have a server application that can be configured to listen for incoming client connections. It supports both plain tcp and tls (using tokio-rustls).

The issue I'm running into is that while the plain tcp interface seems to work without issues (but I can't be certain that is the case) the tls works fine for a while (on average it seems to work for a day or so (probably doing between 10-40 connections in one day)), but then the listener simply stops accepting incoming connections (client gets an immediate error 61 (macos, but I think this is a pretty universal unixy error code) trying to connect). Here's why it confuses me: The listener is just a loop that accepts an incoming connection, and immediately spawns a new task (using the multithreaded tokio runtime) to handle the connection. I've seen this sort of behavior when I accidentally left an unwrap() in the listener task, but I don't see any panics and I have searched through the code for any left-over unwrap()/expect().

In fact, I don't see any errors at all(!), and I'm not ignoring any Results. That's what's most confusing -- it behaves like the listener task would have panic'd, but there's no trace that such a thing would have happened.

Due to the lack of any kind of warning/error, I not sure how I should approach this. I would like to check if the listener task is still alive when these problems occur. I could add some kind of heartbeat and a watchdog, but are there any tools that can help debug these kinds of issues without needing a bunch of temporary code changes? Is this something tokio-console can be used to investigate?

hard to tell without looking at the code can you share a some relevant snippets?

Check on OS for listening netstat -tnl
Handle::dump might be useful but I expect you need to add logging into the listener task.

1 Like

Is the rest of your code still alive? Are you sure you have not hit a blocking call someplace that has held up your Tokio runtime?

1 Like

Have you heard about dial9? I think it might help with your problem.

2 Likes

If you don't see any errors, maybe the traffic is 'intercepted' outside your application? What is error 61? Is it 'connection refused'? Maybe try curl -vvv to connect to your application and see the responses.

This story has taken two turns.

First, I was wrong about all all Result's being checked. I was referring to the application code, it never occurred to me that an external crate would ignore a Result, so I didn't check for that -- but it turned out to be the case. Catching the error and logging it yields the error received corrupt message of type InvalidContentType.

It looks like it is receiving a broken request. While searching around for an explanation I found some posts suggest that it could mean that a client is connecting without enabling TLS, but this is hard-coded in a configuration file so I don't think that is the case. Why does it happen roughly daily, but not all the time? I have observed that it seems more likely to happen if I wait a while. I.e. if I actively use the server, it doesn't appear to break, but it most often happens the first time I made a request in the morning, or after I haven't used the application in a few hours.

The second turn is that I shut down all applications that use our postgresql server for maintenance. We've been running our postgresql server for 15+ years, and have never had any issues with it, until now. After a few hours when I tried connecting to it it has stopped responding to client connections (the processes were still running). After restarting the postgres service, it started accepting connections again.

I don't know if the client is sending broken requests (sometimes) or if there's some problem with the server. The fact that postgresql suddenly stopped accepting client connections in a vaguely similar way makes me wonder if there's an issue with the server.

This sucks, but at least I can see the error now -- it's not just silently failing, as I thought it was from the beginning.