Segfaulting, don't know where!

nv0skar · August 27, 2023, 7:20pm

Hi,
It has been well over a month since I first had this issue, but I can't figure out how to fix it because I can't even locate the root of the problem.
When running tests::server() at /server/src/lib.rs I get this:

$ Nearssage/server ❯ cargo test -- --nocapture
warning: `nearssage_client` (lib) generated 1 warning
    Finished test [unoptimized + debuginfo] target(s) in 18.69s
     Running unittests src/lib.rs (/Users/nv0skar/Projects/Nearssage/target/debug/deps/nearssage_server-696d14a9cec0d2b1)

running 1 test
error: test failed, to rerun pass `--lib`

Caused by:
  process didn't exit successfully: `/Users/nv0skar/Projects/Nearssage/target/debug/deps/nearssage_server-696d14a9cec0d2b1 --nocapture` (signal: 11, SIGSEGV: invalid memory reference)

I really appreciate your support!

claytonwramsey · August 27, 2023, 7:25pm

It's hard to debug all your code, so I'll just give some tips on how I debug segfaults.

First of all, it looks like there's no unsafe code in the source you published. If you have any dependencies that rely a lot on unsafe code, then try auditing them.

Next, try running your test with valgrind. Valgrind is a memory debugger which can do a lot. For running a test binary, you can run the following:

cargo test --no-run # will print out a path to a test binary. record it
valgrind ./target/debug/deps/my_test_binary # substitute path to the test binary here

This will track the sources of illegal memory accesses.

tbfleming · August 27, 2023, 7:36pm

Maybe this is it?

$ cargo audit
    Fetching advisory database from `https://github.com/RustSec/advisory-db.git`
      Loaded 562 security advisories (from /home/todd/.cargo/advisory-db)
    Updating crates.io index
    Scanning Cargo.lock for vulnerabilities (264 crate dependencies)
Crate:     time
Version:   0.1.45
Title:     Potential segfault in the time crate
Date:      2020-11-18
ID:        RUSTSEC-2020-0071
URL:       https://rustsec.org/advisories/RUSTSEC-2020-0071
Severity:  6.2 (medium)
Solution:  Upgrade to >=0.2.23
Dependency tree:
time 0.1.45
└── chrono 0.4.26
    ├── nearssage_storage 0.0.1
    │   └── nearssage_server 0.0.1
    ├── nearssage_schema 0.0.1
    │   ├── nearssage_storage 0.0.1
    │   └── nearssage_protocol 0.0.1
    │       ├── nearssage_server 0.0.1
    │       └── nearssage_client 0.0.1
    │           └── nearssage_server 0.0.1
    └── nearssage_commons 0.0.1
        ├── nearssage_storage 0.0.1
        ├── nearssage_server 0.0.1
        ├── nearssage_schema 0.0.1
        ├── nearssage_protocol 0.0.1
        └── nearssage_client 0.0.1

error: 1 vulnerability found!

Vorpal · August 27, 2023, 7:53pm

The best tool for segfaults tend to be debuggers. On Linux you can use rust-gdb (which is a wrapper for gdb). I would run the test binary under rust-gdb:

$ rust-gdb path/to/program
(gdb) run

Once it hits the error use the command "bt" (back trace) to print a trace. There are additional commands to examine local variables etc (see docs).

Now, it is possible the back trace is missing/useless (just question marks for example). If so you need to ensure you use a debug build and also download debug symbols for any libraries you use (such as glibc). The details on how to do that vary between Linux distros, but typically it involves either installing special -dbg packages or enabling something called debuginfod. Refer to your distro documentation for details.

Presumably something similar is also possible on Mac or Windows, but I know very little about these.

claytonwramsey · August 27, 2023, 8:00pm

cargo audit is a tool for looking up known security vulnerabilities. In this case, I mean audit in the plain-English sense - scan through your dependencies and see if they have unsafe code. If you have a lot of dependencies, though, this approach doesn't scale well.

tbfleming · August 27, 2023, 8:02pm

This wasn't a reply to your comment about auditing unsafe. Instead this shows an out-of-date dependency with a known possible segfault.

claytonwramsey · August 27, 2023, 8:07pm

ah, ok! then that could definitely be the cause. Sorry for misreading that.

nv0skar · August 27, 2023, 10:18pm

Thanks for all your quick replies!
Just tested a few things you suggested:

Ran $ cargo audit, vulnerability not present when disabling chrono's default features. Still getting segfault, so that's not the issue.
I've already been messing with the rust-lldb debugger, doesn't give a lot of insight, maybe because is multithreaded and I might be overlooking something...
I'll try with Valgrind but that will take some time because I'm am on macOS (arm64) and there's no support for that yet.

user16251 · August 27, 2023, 10:42pm

Isn't there unsafe code in object.rs (link)? It looks pretty fishy; it's getting around the borrow checker to pass a slice to a spawned task.

tbfleming · August 27, 2023, 11:14pm

Very fishy. Tracing lifetimes...

pub trait ReadableTable {
    ...
    fn iter(&self) -> Result<Range<K, V>> ...
}

pub struct Range<'a, K: RedbKey + 'static, V: RedbValue + 'static> ...

Applying lifetime elision rules:

fn iter<'a>(&'a self) -> Result<Range<'a, K, V>> ...

'a is the lifetime of the borrow needed by table.iter(). It ends when the iterator goes out of scope at the end of the loop. The smuggled references can clearly be accessed past that point.

Definitely UB, so it's a candidate for the segfault.

tbfleming · August 27, 2023, 11:43pm

search is an async function. Async functions can be canceled via select! or a large number of other ways. Consider:

search is canceled (its future is dropped)
This drops the JoinSet. This cancels all the tasks. The documentation says "When the JoinSet is dropped, all tasks in the JoinSet are immediately aborted." Question: if one of the tasks is currently executing in another thread, does JoinSet's drop handler wait for it to hit an await boundary? This would pause the current thread, which is bad. Pausing a thread causes a known issue in tokio's io. The doc is unclear, so let's assume it doesn't pause the current thread.
This drops table while one of the threads is executing. Boom.

nv0skar · August 28, 2023, 7:52am

I see what you mean, however that can't be the problem because:

tests::server() doesn't yet implement the database transactions, so that code is never called.
Even if it was implemented, I haven't caught any undefined behavior when tests::object() is ran, which does call that function.

The reason of the code being like that.
Regarding to the use of unsafe, I used a similar approach to async_scoped, also in Tokio's docs regarding JoinSet.

When the JoinSet is dropped, all tasks in the JoinSet are immediately aborted.

So there shouldn't be a risk of the function being cancelled and the spawned threads accessing table because these threads would have been cancelled too.

SkiFire13 · August 28, 2023, 8:51am

The problem is that in this loop id and value are dropped at the end of the loop iteration (and they implement Drop, so this is very relevant), but you're keeping references to their data in the JoinSet, which runs even after the next loop iteration has started. I'm not sure what would be the proper fix for this. Transmuting the lifetime of id and value themselves and move them into the task might work, but it won't solve the JoinSet problem. Effectively what should be needed here is to wrap everything in a self-referential struct (actually, multiple, one for each iterator item) to remove the lifetimes, but that's likely a lot of work.

There are a couple of misconceptions here:

async_scoped usage is inherently unsafe, so your code is too. In particular it requires the scope to always be dropped, but defining it in an async function doesn't guarantee that (the future could be leaked after being polled). Realistically this won't happen, so it should not cause your issue, but it is unsound anyway and maybe a new version of tokio may break your code (and that will be for sure very fun to debug, right?)
immediately aborted doesn't mean it won't execute after the drop call has returned. The task could be executing while it gets aborted, in which case it will actually stop only once it yields to the executor, which may happen after the drop call has finished Rust Playground
you are not spawning/cancelling threads, but tasks. Tasks execute on threads, but there can be many more tasks than threads running. This is the whole point of async, being able to have many (tens of thousands) of tasks without also having the same number of threads (which would waste a lot of resources).

nv0skar · August 28, 2023, 9:28am

I think you misunderstood me: I'm not comparing my code to async_scoped to say that my code isn't unsafe, I know it's unsafe, but unsafe code isn't the only way to achieve memory issues in Rust, that's the whole point of this thread, I'm pretty certain that code isn't causing the issue, it may cause another set of issues when updating as you said, but I know when I'm updating, and it's as easy as removing it to test if that's the problem, but as I said, that code isn't even being called.
Regarding to the abortion of the task, I'll convert that function from async to a normal one.
It's weird that id and value are dropped and still can get accessed, maybe because the future is executed before dropping the values, IDK, I'll look into that more in depth.
Also, I used the wrong term: referring to tasks not threads.

zirconium-n · August 28, 2023, 9:45am

Well, it is the only way. (sans compiler bug)

How could you be sure? Causing issue does not means "segfault right here".

SkiFire13 · August 28, 2023, 9:48am

I meant the id and value declared at line 108. Then you create two new variables id and value that technically borrow from the two at line 108, but you make the compiler forget about this by using unsafe. Finally, you use the new id and value in the task while the first two id and value have already been dropped.

user16251 · August 28, 2023, 10:58am

Can you please share a few things?

The backtrace from the debugger?
Disassembly from the crash site?
The contents of the register file?

In gdb you just have to run bt, disassemble, and info registers when you get SIGSEGV.

tbfleming · August 28, 2023, 11:01am

Fixing this may save you segfaults or other nasty bugs caused by UB in the future.

The trigger condition is dropping the future at just the wrong time. It may manifest as a Heisenbug, except if no code paths currently drop the future until completion, in which case it would be easy to accidentally hit it with future code changes. (It's easy to accidentally drop a future before completion).

From that issue:

Extending lifetimes like that is pretty much always undefined behavior

It has a very serious warning:

The only completely (without any additional assumptions) safe API is the scope_and_block function, which blocks the current thread until all spawned futures complete.

"Immediately" may not be as strong as you'd like. See my discussion about this same quote. Keep in mind that Rust doesn't allow one thread to kill another.

tbfleming · August 28, 2023, 11:11am

This will help. It will still be UB, which makes it sensitive to changes in the implementation of redb and to rustc. Make sure you handle panics; there's some tricky cases I didn't cover there.

tbfleming · August 28, 2023, 11:35am

There's another possibility: stack overflow. Do you have any recursion? Any large arrays?

If you're on Linux, you can check this by bumping the stack limit. Requires root:

ulimit -s unlimited

Topic		Replies	Views
How to debug a randomly segfaulting Rust program? help	9	826	November 6, 2024
Safe code got SIGSEGV, using Serde help	16	2006	January 12, 2023
Rust guarantees no segfaults with only safe code but it segfaults (stack overflow)	33	21452	July 3, 2022
Trouble identifying cause of segfault help	9	3206	January 12, 2023
Library tests pass, but sometimes segfault afterwards? help	7	1812	January 12, 2023

Segfaulting, don't know where!

Related topics