Corrupted enums

I have an enum in my program. One variant holds a string, the other 8 of them are empty. I am getting all sorts of issues around the same routine. In some cases, I try to clone or even debug-print it, and the program segfaults. In others, my debug statements print a variant that is never constructed in the program. I have hooked it up to lldb, and it is always in the same spot, with the same variants doing this, and the segfault is definitely in the derived clone or debug impls. There is no unsafe code in this part of my program, it's simply constructing a Vec-recursive data structure with Option<this enum> in the leaf nodes, and then walking it to produce another structure.

I have discovered that when compiled with an old rust nightly (nightly-2020-04-30-x86_64-apple-darwin - rustc 1.45.0-nightly (fa51f810e 2020-04-29)), it does not have this issue. But somewhere along the line, before nightly-2020-09-11, this started happening.

What's the right process for this? My code is open source, but it takes a while to build, and I don't really have the time to bisect the nightlies since April to figure out when this was introduced. I do have a minimal test input to my program that reliably segfaults. Should I file a rust-lang/rust issue? Will someone there have the CI time to pin down a nightly?

(Edit: I should add that it only happens in release mode.)

To clarify, this does mean however that you are using unsafe code in “other” places, right? Which mean that there could very well be undefined behavior in your program? Is it a lot of unsafe code? Otherwise someone could try to audit the unsafe parts (e.g. you request some code review here) and find out if they are sound. Is the unsafe code for performance reasons or necessary? If it’s the former, you could try to get rid of it entirely. If there are segfaults without any unsafe code at all, you should file a bug report on rustc. If you are fairly confident that all your unsafe code is sound you should also file a bug report, people there could e.g. help with the bisection.

Perhaps as a first step, you could already link this open source project (and perhaps the branch/commit) in this thread.

Also some relevant questions: How long is "takes a while to build"? How big is the project? Is it easily possible to trim the project down so something significantly smaller that still produces the segfaults / erraneous behavior?

2 Likes

This is almost certainly due to a bad unsafe block, either in your code or a dependency.

1 Like

These are good questions. The unsafes are in code that is not executed at all for the minimal test input. I checked that already. There are many more unsafes in dependencies, but this section of code really is dependency-free except for petgraph, which is the second structure being created by walking. I only discovered this when I tried to replace the string with the smartstring crate, and ran my test suite in release mode to see if it was faster, and there was a while there I thought it was smartstring's fault, but no, turns out it wasn't necessary. Seriously, I will link my code below, but it goes like this:

// RefIR is the recursive structure; EdgeData is the enum in question
return (RefIR::Edge(Some(EdgeData::LocatorLabel)), GroupVars::Plain);
// step out to calling function:
let (ir, _) = call_that_function(...);
eprintln!("{:?}", ir);

And THAT segfaults in the Debug impl for EdgeData. There are literally zero lines of code in between. It won't segfault if you construct and print it immediately. It's a wonder I haven't pulled all my hair out already.

The project is citeproc-rs, and I will get you a branch name just as soon as I clean up the two hundred different things I've tried to fix it with.

Oh, and probably about 5 minutes build time, about 20kloc. Trimming would be annoying to do, but possible I think.

Is it possible to strip any code not used by this test case, to begin with?

If that’s the case, perhaps one could remove those unsafe sections entirely (e.g. replacing them with panic!()s) to at least determine that it’s either because of dependencies or because of rustc. If you can pin down a single dependency that can be used to get a segfault, you should open an issue on that dependency (no matter whether it is because of unsoundness of them, because of a transitive dependency or because of rustc, they’d be interested).

5 minutes build time is not too bad since bisecting can be automated (so you’d easily finish over night). The main problem is that it is more likely due to some unsound unsafe code and not due to bugs in rustc.

1 Like

That is irrelevant. Once you've unleashed the hyper-optimizing LLVM backend to trash your program by failing to meet its strict input requirements everywhere in the code—usually with regard to not ever having coexisting &mut references or references to uninitialized memory—you should have no expectation of correct behavior of the program.

The fact that this happens only in release mode is evidence that the problem is due to LLVM's release from its normal "do as I wrote" constraints due to violation of its non-negotiable input requirements.

Ok, but when I say "I checked" I mean "I know they are in code that isn't executed, and I commented them out just to be sure." I'm aware this is how UB is exploited, and in fact I have been reading the LLVM IR to hunt for clues, but I think cargo llvm-ir is chopping off half-lines in places so I can't get anything from that.

In that case you should definitely file a bug report against cargo llvm-ir, independent of whatever turns out to be the cause of the segfault.

Wait. I beg to differ, AFAIK undefined behavior is only relevant if it is actually executed.

To be more precise, undefined behaviour is only relevant if it would eventually be executed. It can cause havoc before the actual invocation of UB, as long as that invocation is guaranteed.

4 Likes

Good point. Which means that you cannot reliably check whether an unsafe code block, which you suspect might contain undefined behavior, gets executed by inserting some print statement in front of it. You can however do the check by inserting something that panics in front of the unsafe block.

1 Like

Unless LLVM optimize out the whole branch containing UB as "unreachable due to UB".

1 Like

No, that would be an incorrect optimization (unless there's a second unsafe block floating around that causes UB before the panic). Code that panics does not have undefined behavior.

1 Like

Ah. In other words, since everything after the unconditional panic is unreachable, LLVM will not consider this code's behavior when optimizing. I see, thanks.

1 Like

The Rust compiler will throw an error when it encounters unreachable code. The unreachable code after the panic will have to be commented out, i.e. the potentially UB code will never reach the optimizer, anyway.

It is only a warning unless you explicitly turned that warning into an error.

I've had such problem when I created an enum with an invalid value (e.g. enum had 8 options, part of the program did mem::transmute(9)).

Optimizations assume that enums only have valid values, and replace match on them with lookup tables, so invalid values jump into random crashy places.

If you're not doing anything shady with this enum, then it may just be a symptom of memory corruption elsewhere that happens to hit the enum. Check your dependencies with cargo-crev, or cargo-audit and cargo-geiger. Run the program under Valgrind.


If you believe it's a rustc bug, then you can bisect with:

cargo bisect-rustc --start 2019-01-01 --end 2020-09-27 -- test
2 Likes