I have a server application that can be configured to listen for incoming client connections. It supports both plain tcp and tls (using tokio-rustls).
The issue I'm running into is that while the plain tcp interface seems to work without issues (but I can't be certain that is the case) the tls works fine for a while (on average it seems to work for a day or so (probably doing between 10-40 connections in one day)), but then the listener simply stops accepting incoming connections (client gets an immediate error 61 (macos, but I think this is a pretty universal unixy error code) trying to connect). Here's why it confuses me: The listener is just a loop that accepts an incoming connection, and immediately spawns a new task (using the multithreaded tokio runtime) to handle the connection. I've seen this sort of behavior when I accidentally left an unwrap() in the listener task, but I don't see any panics and I have searched through the code for any left-over unwrap()/expect().
In fact, I don't see any errors at all(!), and I'm not ignoring any Results. That's what's most confusing -- it behaves like the listener task would have panic'd, but there's no trace that such a thing would have happened.
Due to the lack of any kind of warning/error, I not sure how I should approach this. I would like to check if the listener task is still alive when these problems occur. I could add some kind of heartbeat and a watchdog, but are there any tools that can help debug these kinds of issues without needing a bunch of temporary code changes? Is this something tokio-console can be used to investigate?
If you don't see any errors, maybe the traffic is 'intercepted' outside your application? What is error 61? Is it 'connection refused'? Maybe try curl -vvv to connect to your application and see the responses.
First, I was wrong about all all Result's being checked. I was referring to the application code, it never occurred to me that an external crate would ignore a Result, so I didn't check for that -- but it turned out to be the case. Catching the error and logging it yields the error received corrupt message of type InvalidContentType.
It looks like it is receiving a broken request. While searching around for an explanation I found some posts suggest that it could mean that a client is connecting without enabling TLS, but this is hard-coded in a configuration file so I don't think that is the case. Why does it happen roughly daily, but not all the time? I have observed that it seems more likely to happen if I wait a while. I.e. if I actively use the server, it doesn't appear to break, but it most often happens the first time I made a request in the morning, or after I haven't used the application in a few hours.
The second turn is that I shut down all applications that use our postgresql server for maintenance. We've been running our postgresql server for 15+ years, and have never had any issues with it, until now. After a few hours when I tried connecting to it it has stopped responding to client connections (the processes were still running). After restarting the postgres service, it started accepting connections again.
I don't know if the client is sending broken requests (sometimes) or if there's some problem with the server. The fact that postgresql suddenly stopped accepting client connections in a vaguely similar way makes me wonder if there's an issue with the server.
This sucks, but at least I can see the error now -- it's not just silently failing, as I thought it was from the beginning.