Hi,
It has been well over a month since I first had this issue, but I can't figure out how to fix it because I can't even locate the root of the problem.
When running tests::server() at /server/src/lib.rs I get this:
$ Nearssage/server ❯ cargo test -- --nocapture
warning: `nearssage_client` (lib) generated 1 warning
Finished test [unoptimized + debuginfo] target(s) in 18.69s
Running unittests src/lib.rs (/Users/nv0skar/Projects/Nearssage/target/debug/deps/nearssage_server-696d14a9cec0d2b1)
running 1 test
error: test failed, to rerun pass `--lib`
Caused by:
process didn't exit successfully: `/Users/nv0skar/Projects/Nearssage/target/debug/deps/nearssage_server-696d14a9cec0d2b1 --nocapture` (signal: 11, SIGSEGV: invalid memory reference)
It's hard to debug all your code, so I'll just give some tips on how I debug segfaults.
First of all, it looks like there's no unsafe code in the source you published. If you have any dependencies that rely a lot on unsafe code, then try auditing them.
Next, try running your test with valgrind. Valgrind is a memory debugger which can do a lot. For running a test binary, you can run the following:
cargo test --no-run # will print out a path to a test binary. record it
valgrind ./target/debug/deps/my_test_binary # substitute path to the test binary here
This will track the sources of illegal memory accesses.
The best tool for segfaults tend to be debuggers. On Linux you can use rust-gdb (which is a wrapper for gdb). I would run the test binary under rust-gdb:
$ rust-gdb path/to/program
(gdb) run
Once it hits the error use the command "bt" (back trace) to print a trace. There are additional commands to examine local variables etc (see docs).
Now, it is possible the back trace is missing/useless (just question marks for example). If so you need to ensure you use a debug build and also download debug symbols for any libraries you use (such as glibc). The details on how to do that vary between Linux distros, but typically it involves either installing special -dbg packages or enabling something called debuginfod. Refer to your distro documentation for details.
Presumably something similar is also possible on Mac or Windows, but I know very little about these.
cargo audit is a tool for looking up known security vulnerabilities. In this case, I mean audit in the plain-English sense - scan through your dependencies and see if they have unsafe code. If you have a lot of dependencies, though, this approach doesn't scale well.
Thanks for all your quick replies!
Just tested a few things you suggested:
Ran $ cargo audit, vulnerability not present when disabling chrono's default features. Still getting segfault, so that's not the issue.
I've already been messing with the rust-lldb debugger, doesn't give a lot of insight, maybe because is multithreaded and I might be overlooking something...
I'll try with Valgrind but that will take some time because I'm am on macOS (arm64) and there's no support for that yet.
'a is the lifetime of the borrow needed by table.iter(). It ends when the iterator goes out of scope at the end of the loop. The smuggled references can clearly be accessed past that point.
Definitely UB, so it's a candidate for the segfault.
search is an async function. Async functions can be canceled via select! or a large number of other ways. Consider:
search is canceled (its future is dropped)
This drops the JoinSet. This cancels all the tasks. The documentation says "When the JoinSet is dropped, all tasks in the JoinSet are immediately aborted." Question: if one of the tasks is currently executing in another thread, does JoinSet's drop handler wait for it to hit an await boundary? This would pause the current thread, which is bad. Pausing a thread causes a known issue in tokio's io. The doc is unclear, so let's assume it doesn't pause the current thread.
This drops table while one of the threads is executing. Boom.
I see what you mean, however that can't be the problem because:
tests::server() doesn't yet implement the database transactions, so that code is never called.
Even if it was implemented, I haven't caught any undefined behavior when tests::object() is ran, which does call that function.
The reason of the code being like that.
Regarding to the use of unsafe, I used a similar approach to async_scoped, also in Tokio's docs regarding JoinSet.
When the JoinSet is dropped, all tasks in the JoinSet are immediately aborted.
So there shouldn't be a risk of the function being cancelled and the spawned threads accessing table because these threads would have been cancelled too.
The problem is that in this loopid and value are dropped at the end of the loop iteration (and they implement Drop, so this is very relevant), but you're keeping references to their data in the JoinSet, which runs even after the next loop iteration has started. I'm not sure what would be the proper fix for this. Transmuting the lifetime of id and value themselves and move them into the task might work, but it won't solve the JoinSet problem. Effectively what should be needed here is to wrap everything in a self-referential struct (actually, multiple, one for each iterator item) to remove the lifetimes, but that's likely a lot of work.
There are a couple of misconceptions here:
async_scoped usage is inherently unsafe, so your code is too. In particular it requires the scope to always be dropped, but defining it in an async function doesn't guarantee that (the future could be leaked after being polled). Realistically this won't happen, so it should not cause your issue, but it is unsound anyway and maybe a new version of tokio may break your code (and that will be for sure very fun to debug, right?)
immediately aborted doesn't mean it won't execute after the drop call has returned. The task could be executing while it gets aborted, in which case it will actually stop only once it yields to the executor, which may happen after the drop call has finished Rust Playground
you are not spawning/cancelling threads, but tasks. Tasks execute on threads, but there can be many more tasks than threads running. This is the whole point of async, being able to have many (tens of thousands) of tasks without also having the same number of threads (which would waste a lot of resources).
I think you misunderstood me: I'm not comparing my code to async_scoped to say that my code isn't unsafe, I know it's unsafe, but unsafe code isn't the only way to achieve memory issues in Rust, that's the whole point of this thread, I'm pretty certain that code isn't causing the issue, it may cause another set of issues when updating as you said, but I know when I'm updating, and it's as easy as removing it to test if that's the problem, but as I said, that code isn't even being called.
Regarding to the abortion of the task, I'll convert that function from async to a normal one.
It's weird that id and value are dropped and still can get accessed, maybe because the future is executed before dropping the values, IDK, I'll look into that more in depth.
Also, I used the wrong term: referring to tasks not threads.
I meant the id and value declared at line 108. Then you create two new variables id and value that technically borrow from the two at line 108, but you make the compiler forget about this by using unsafe. Finally, you use the new id and value in the task while the first two id and value have already been dropped.
Fixing this may save you segfaults or other nasty bugs caused by UB in the future.
The trigger condition is dropping the future at just the wrong time. It may manifest as a Heisenbug, except if no code paths currently drop the future until completion, in which case it would be easy to accidentally hit it with future code changes. (It's easy to accidentally drop a future before completion).
From that issue:
Extending lifetimes like that is pretty much always undefined behavior
It has a very serious warning:
The only completely (without any additional assumptions) safe API is the scope_and_block function, which blocks the current thread until all spawned futures complete.
"Immediately" may not be as strong as you'd like. See my discussion about this same quote. Keep in mind that Rust doesn't allow one thread to kill another.
This will help. It will still be UB, which makes it sensitive to changes in the implementation of redb and to rustc. Make sure you handle panics; there's some tricky cases I didn't cover there.