Random failure of C->Rust interface (invalid memory reference)

Dear all,
I was in need of using the SuperLU Linear equation system solver in Rust. Since I only found and abandoned and defective crate for bindings (superlu-sys) I forked it and wrote a top level wrapper (sprs-superlu) to interface with sprs and ndarray.
Both can be found on Github:

Both repos contain a test for behavior against the same singular matrix (test_singular in superlu-sys and test_solver_singular_matrix in sprs-superlu). These (and all other) tests pass in general.

However, when I run cargo test for sprs-superlu there is a 40% chance of test_solver_singular_matrix in sprs-superlu failing with "invalid memory reference" at the line ffi::dgssv(...) in the solve_super_lu function. The test always woks when run standalone. When running RUST_TEST_THREADS=1 cargo test to get rid of potential thread safety issues, the failure chance RAISES to about 80%. It levels down to 40% as I add more threads.

I am deeply puzzled and would be grateful for any Idea on what is happening.
All the best
Henrik

PS: I did not add the code to avoid bloating the thread, since I do not know where to search for the issue.

1 Like

Well, it's clear that you've hit Undefined Behaviour. However, what is the specific error that caused it is hard to identify. Maybe you've passed some uninitialized memory to a routine which reads it. Maybe you've violated one of the many soundness rules of Rust (e.g. you can never create uninitialized memory representing a fully initialized type, even if you don't access it, so e.g. let x: u8 = MaybeUninit::uninit().assume_init(); is instant UB; or perhaps someone somewhere creates a trap representation of a Rust enum, i.e. not corresponding to a valid enum variant). Maybe you have misused the library API.

I took a look at your wrapper code for a while, and I couldn't find anything suspicious, so it's possibly not an UB in your Rust wrapper. I can't guarantee it, but I didn't notice anything strange either. Unfortunately, due to your heavy use of FFI running Miri likely won't help.

You can run your test executable with Valgrind. The executable itself can be found in target/debug/deps subfolder of your sprs-superlu crate. It will have a name like sprs_superlu-2315a0c4e10d1371. The value of hash will, of course, differ, and you need to make sure that you're actually using the latest build of your test suite. I suggest running cargo clean && cargo test --no-run, and then finding the relevant file (its path is written at the end of the cargo test output). On my system the invocation is simply

valgrind target/debug/deps/sprs_superlu-2315a0c4e10d1371

Valgrind finds an out of bounds read in the compiled code. The error report, with irrelevant details stripped, looks like

==16833== Invalid read of size 4
==16833==    at 0x1E792C: dpivotL (dpivotL.c:136)
==16833==    by 0x1E2717: dgstrf (dgstrf.c:342)
==16833==    by 0x1DC5E5: dgssv (dgssv.c:225)
==16833==    by 0x16A14E: sprs_superlu::solve_super_lu (lib.rs:76)
...
==16833==  Address 0x59aa940 is 0 bytes after a block of size 0 alloc'd
==16833==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==16833==    by 0x1E8650: dexpand (dmemory.c:507)
==16833==    by 0x1E8CB5: dLUMemInit (dmemory.c:244)
==16833==    by 0x1E2245: dgstrf (dgstrf.c:262)
==16833==    by 0x1DC5E5: dgssv (dgssv.c:225)
==16833==    by 0x16A14E: sprs_superlu::solve_super_lu (lib.rs:76)
...

You can see that the invalid read happens in dpivotL.c:136, because the relevant buffer was allocated with size 0. The allocation happens in dgstrf.c:262. Getting the line numbers requires building the library with debug info enabled, which required adding -g3 -gdwarf to CFLAGS in the make invocation.

Unfortunately, this is as much help as I'm able to give you. I tried studying the superlu source to understand the reason for this OOB read, but it's way, way more convoluted than I can stomach in an evening. I have no idea how it's related to the parameters you've called it with. Reading the documentation on the functions doesn't indicate anything wrong on your side, or any possibility for critical errors.

It's possible that the invalid reallocation itself happens due to some UB on Rust side. I can't rule out that possibility. You can try building the project with the sanitizers, if you're feeling adventurous. Using them on the Rust side requires recompiling the stdlib, I don't know how easy this process is at this moment. You also need to compile your C sources with the sanitizers.

Compiling the superlu sources with clang and -Wall -Wextra also gives a number of disturbing potentially uninitialized memory access problems:

sgssv.c:194:9: warning: variable 'AA' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
    if (A->Stype == SLU_NC)

slangs.c:103:14: warning: variable 'value' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
  } else if (strncmp(norm, "I", 1) == 0) {

smemory.c:446:11: warning: enumeration values 'LLVL' and 'ULVL' not handled in switch [-Wswitch]
  switch (mem_type) {

smach.c:86:14: warning: variable 'rmach' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
  } else if (strncmp(cmach, "O", 1) == 0) {

dmach.c:86:14: warning: variable 'rmach' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
  } else if (strncmp(cmach, "O", 1) == 0) {

I don't know whether one of them could be relevant to your issue.


By the way, I recommend you to compile superlu sources with CMake rather than plain make. It is easier to configure for more advanced setups, like out-of-source builds, debug/release profiles and parallel compilation. You can also use the cmake crate to drive the process, rather than manually fiddling with subprocesses. It will make sure to integrate with Cargo in the best possible way.

Also, consider using std::mem::zeroed() instead of MaybeUninit::zeroed().assume_init(). Besides being shorter and more to the point, it also includes a special compile-time check, which will throw a compile error if you try to zero-initialize a type which doesn't have all-zeroes as a valid representation.

4 Likes

Thank you very much for your extensive effort in helping me. With your suggestions on where to look I could narrow the problem down to my test case being an all zero matrix. Other singular matrices work. In This case the arrays used for creating the SuperMatrix struct have zero length which together with them originating from rust is an issue. I will fix the issue with handling the special case differently. However, I will also incorporate your other suggestions in the coming days such as using std::mem::zeroed().
I will also change the way the C source is build as it currently relies on ancient version of SuperLU. Furthermore I will probably swap it for the multi threaded version SuperLU_mt. For doing so your suggestion of using the cmake crate will probably come in handy.
Thanks a lot!

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.