Linux/Mac SIG TERM and ssh2

TL;DR I catch SIG TERM with the signal_hook crate but my ongoing ssh connection fails on the signal and the program exits before my signal handler has a chance to shut down cleanly.

I have a program that is talking to some equipment using SSH from my Mac or a Linux box. Turns out the ssh on the equipment has a problem. If I repeated ssh into it from my program it will eventually fail. When it fails I can connect and log in but having done so it immediately closes the connection. At which point the only way to recover is to power cycle the equipment. This is a major problem as the equipment will be located at a remote inaccessible location.

Now, using ssh to that same equipment from the command line and then exiting cleanly does not have this problem.

So I thought it would help to send a SIG TERM to my program and have it shut down the session cleanly. Basically it should send ESC to stop the program running at the other end and then send "exit\n". At which point the connection closes.

Now I have the problem that although I can catch the SIG TERM with signal_hook and cause a clean shut down, that never happens. Rather communication fails with "Error waiting on socket" and the program immediately exits.

I catch the SIG TERM with this:

fn ctrl_channel() -> Receiver<()> {
    const SIGNALS: &[c_int] = &[
        SIGTERM, SIGQUIT, SIGINT, SIGTSTP, SIGWINCH, SIGHUP, SIGCHLD, SIGCONT,
    ];

    let mut sigs = match Signals::new(SIGNALS) {
        Ok(sigs) => sigs,
        Err(e) => {
            error!("creating Signals: {e}");
            std::process::exit(0);
        }
    };
    let (sender, receiver) = bounded(100);

    std::thread::spawn(move || loop {
        if let Some(signal) = sigs.into_iter().next() {
            error!("Got signal {}", signal);
            if let Err(e) = sender.send(()) {
                error!("Signal handler thread failed {}", e);
                std::process::exit(0);
            }
        }
    });
    receiver
}

Then in my programs main loop I check the other end of that channel with

if !ctrl_c_events.is_empty()

I'm reading from the ssh2 channel like so:

channel.read_exact(&mut buf)?

So why do I fail with a socket error rather than? Or why does a SIG TERM even cause a socket error?

You've mixed backticks with "forward ticks"(?) - not sure what character they are exactly. Are you able to fix the formatting for us? :slightly_smiling_face:

Oops. Sorry. Fixed it.

1 Like

My first guess is that your socket call is returning EINTR in response to the process-wide SIGTERM signal, and nothing is handling that for you. Unless something calls sigprocmask or pthread_sigmask, it’s normal behaviour for (at least) one system call to fail to make way for the signal handler to run, and it’s the application’s responsibility to retry.

I think a few Rust standard library calls will retry on EINTR, but not all.

Can you get an error code and stack trace for the “Error waiting on socket” error?

1 Like

BINGO. You got it. I should have remembered I've suffered from this problem back in my old C days. That will teach me not to just hide errors under the carpet of ? and throw them out to the void from main().

Unfortunately fixing the issue is a lot tougher in Rust than C. No simple testing of a return value. I ended up creating a ReadByte trait and implementing it for the ssh Channel. This checks the channel read for interruption by the signal. That took all morning to figure out how to get hold of the actual channel error kind and fighting with Anyhow to return that channel error.

Am I over thinking this?

trait ReadByte {
    fn read_byte(&mut self) -> Result<u8, Error>;
}

impl ReadByte for Channel {
    fn read_byte(&mut self) -> Result<u8, Error> {
        let mut buf = [0u8];
        match self.read_exact(&mut buf) {
            Ok(_) => Ok(buf[0]),
            Err(e) => {
                let error_kind = e.kind();
                match error_kind {
                    ErrorKind::TimedOut => {
                        error!("SSH got Timeout");
                        Ok(0u8)
                    }
                    _ => {
                        error!("SSH got error: {}", e);
                        Err(e.into())
                    }
                }
            }
        }
    }
}

This returns a null byte when the signal happens. Which is OK for me as there are null bytes coming down the stream through ssh which I deal with any way.

I use it like so:

            loop {
                let byte = channel.read_byte()?;
                // Do stuff with received byte.
                ...

                if !ctrl_c_events.is_empty() {
                    // Do stuff to shutdown cleanly
                    ...
                }
            }

Oddly the channel says it got TimeOut when the signal happens.

Thanks.

You should be able to get the error kind out of the std::io::Error (assuming whatever library you are using doesn't hide this from you). Interrupted is one of the supported error kinds.

Yeah, but how? It took me some hours to get this far!

I found I can get a raw operating system error with the ´.raw_os_error()´ method on the ssh error. That gets me a u32. But that sounds like going too low level.

The way I've always done it is to use .kind() on the error (assuming it's an std::io::Error or similar), which returns an enum I can match on. I can thus write something like:

match err.kind() {
    std::io::ErrorKind::Interrupted => todo!(),
    _=> todo!(),
}

to work out which error I'm facing and handle it in a portable fashion.

Edit: This may not be helpful - it looks from your code like you've already been looking at err.kind()

1 Like

After futzing around I have this:

impl ReadByte for Channel {
    fn read_byte(&mut self) -> Result<u8, Error> {
        let mut buf = [0u8];
        match self.read_exact(&mut buf) {
            Ok(_) => Ok(buf[0]),
            Err(e) => match e.kind() {
                ErrorKind::TimedOut => {
                    error!("SSH got Timeout: {e}");
                    Ok(0u8)
                }
                _ => {
                    error!("SSH got error: {e}");
                    Err(e.into())
                }
            },
        }
    }
}

Which is the same as before but skipping the redundant ´error_kind´ variable. Unfortunately that causes ´rust fmt´ to make a mess of the formatting. Oh well.

What bugs me is that it outputs the following when getting the signal:

ERROR rust_ssh_test] SSH got Timeout: Error waiting on socket

I just don't think getting a shutdown signal is an error on the socket. Neither is it a time out. But there it is mixed up with the other errors. But what can we do, that is the posix way.

The good news is that this has been cycling all day for a script that runs it and terminates it repeatedly and the application at the other end did not die. That only leaves network failures to cause the same issue. But, hey, that is not my fault anymore ?

Thanks all.

1 Like

Further problems with SSH2 and signals.

I created a "chaos monkey" program that runs the program described above as a child process and then sends it a SIGTERM after a random delay between about 1 and 10 seconds.

That original program seems to now successfully find all places where the ssh2 conversations are interrupted by the signal, at which point it successfully exists cleanly from that conversation.

The new problem I am seeing on some rare occasions are errors happening as my threads terminate and get joined in the main thread. That's like once per 5,000 runs!

For example the assertion failure seen here:

Sending SIGTERM to rust_ssh_test...
Waiting for rust_ssh_test to terminate...
[2024-08-14T07:24:39Z WARN  rust_ssh_test] read_status_line got Signal.
[2024-08-14T07:24:39Z WARN  rust_ssh_test] read_status_line got Signal.
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sending Esc 2 ...
[2024-08-14T07:24:39Z WARN  rust_ssh_test] read_status_line got Signal.
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sending Esc 2 ...
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sending Esc 2 ...
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sending 'exit' 2 ...
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sending 'exit' 2 ...
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sending 'exit' 2 ...
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sent 'exit' 2 ...
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sleeping 1 sec ...
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sent 'exit' 2 ...
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sleeping 1 sec ...
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sent 'exit' 2 ...
[2024-08-14T07:24:39Z INFO  rust_ssh_test] read_status_line sleeping 1 sec ...
[2024-08-14T07:24:40Z INFO  rust_ssh_test] read_status_line waking and returning ...
[2024-08-14T07:24:40Z INFO  rust_ssh_test] read_status_line waking and returning ...
[2024-08-14T07:24:40Z INFO  rust_ssh_test] read_status_line waking and returning ...
[2024-08-14T07:24:40Z INFO  rust_ssh_test] Tasks ended with:
        t1: Err(Signal recieved)
        t2: Err(Signal recieved)
        t3: Err(Signal recieved)

Assertion failed: (session), function _libssh2_channel_free, file channel.c, line 2736.

As all my threads have ended with Err(Signal recieved) that shows they al shutdown cleanly as expected.

Or here we see malloc throwing an error when it's told to free something that was not allocated:

Waiting for rust_ssh_test to terminate...
[2024-08-14T07:32:39Z WARN  rust_ssh_test] read_status_line got Signal.
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sending Esc 2 ...
[2024-08-14T07:32:39Z WARN  rust_ssh_test] read_status_line got Signal.
[2024-08-14T07:32:39Z WARN  rust_ssh_test] read_status_line got Signal.
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sending Esc 2 ...
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sending Esc 2 ...
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sending 'exit' 2 ...
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sending 'exit' 2 ...
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sending 'exit' 2 ...
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sent 'exit' 2 ...
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sleeping 1 sec ...
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sent 'exit' 2 ...
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sleeping 1 sec ...
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sent 'exit' 2 ...
[2024-08-14T07:32:39Z INFO  rust_ssh_test] read_status_line sleeping 1 sec ...
[2024-08-14T07:32:40Z INFO  rust_ssh_test] read_status_line waking and returning ...
[2024-08-14T07:32:40Z INFO  rust_ssh_test] read_status_line waking and returning ...
[2024-08-14T07:32:40Z INFO  rust_ssh_test] read_status_line waking and returning ...
rust_ssh_test(93700,0x16c13f000) malloc: *** error for object 0x6000030e75c0: pointer being freed was not allocated
rust_ssh_test(93700,0x16c13f000) malloc: *** set a breakpoint in malloc_error_break to debug

I tried looking Ito this with valgrind to no avail. I guess it is time to report a bug somewhere....

I woke up to some new and different errors today:

94172568:[2024-08-16T21:08:02Z ERROR rust_ssh_test] read_banner got error: transport  read
94172649-malloc_consolidate(): invalid chunk size

This is occurring after my three threads running three ssh channels over a single session have shut down joining the main thread.

Occurred three times in 6000 runs of my chaos_monkey. On a Jetson Nano running Ubuntu this time.

1 Like

Some small progress. Seems that ssh2 is potentially not thread safe. I found an issue raised on it: Concurrent channel usage not thread safe? · Issue #322 · alexcrichton/ssh2-rs · GitHub

1 Like

If you're OK with using async and Tokio for this, and you've got a simple reproducer, you could see if it's practical to port your reproducer to russh; that way, you have a version where only the cryptographic primitives use FFI, and if that doesn't reproduce the fault, you know it's something in ssh2 that's causing issues.

Sounds like an idea, I have nothing against going async.

Ultimately my problem is that the thing at the server end, an embedded system, gets confused if the ssh connection disappears while chatting to its application, eventually causing it to permanently refuse logins if it happens too many times. I cannot fix that embedded system so that inherent unreliability will not go away. I can only try to ensure it happens as little as possible by being sure to cleanly exit the embedded application and closing shells when my program is terminated. Just now that happens very rarely so I think I will live with it. When the thing is in real use it is intended to run forever anyway.

Time pressure dictates I don't get diverted by async or other alternative solutions. Hopefully that ssh2 issue will get resolved before I'm done :slight_smile:

1 Like

So I took a look at russh and decided it was too complicated to deal with. But I found async_ssh2_tokio which looks much easier to use and wraps russh.

I can connect to my remote unit with:

    let mut config = async_ssh2_tokio::Config::default();

    let client = Client::connect_with_config(
        (ssh_address, 2222),
        user_name,
        auth_method,
        ServerCheckMethod::NoCheck,
        config,
    )
    .await?;

I'm using connect_with_config()because I need to use some odd algorithms to make this ssh connection. From the command line I would do it like:

ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 -oHostKeyAlgorithms=+ssh-dss  me@server -p 2222

Problem is I have no idea how to create a config with those algorithms in it.

Any ideas?

Completely untested, but something like:

let mut config = async_ssh2_tokio::Config::default();
let mut kex: Vec<_> = config.preferred.kex.to_owned();
kex.push(russh::kex::DH_G1_SHA1);

should add the diffie-hellman-group1-sha1 config to the list. It looks like russh has no support for the legacy ssh-dss host keys, which rules it out if you need that.

Thanks. Sadly there is no such thing as config.kex. No ssh-dss means no go anyway.

And re-reading this, I realise that it should have been config.preferred.kex, not config.kex.

I noticed that already and tried it. But:

expected struct `Vec<_>`
                found enum `Cow<'_, [russh::kex::Name]>`

:frowning:

There's supposed to be conversions between Cow<'_, [T]> and Vec<T> - I'd have to spend a bit more time working out what I've done wrong (I suspect a missing into_owned() or similar). But as you need DSA host keys (which are legacy because they're cryptographically risky), I'm not intending to dig deeper.