No issues on Linux, segfault on OS X (but only when building with `--release`)

jjpe · May 9, 2018, 3:18pm

I have 2 crates written for work, one containing a parser and the other an interpreter.
Both crates have unit tests, and the interpreter has integration tests as well.
The interpreter also has a binary defined which essentially acts as a CLI-based REPL.
When I run any of the tests on Linux, all is well and it completes fine.

The same tests however, segfault on OS X when I build the crate with cargo build --release.
When I build the binary in release mode and start using the repl, the binary also segfaults.

Here's the odd part: the parser crate uses no unsafe code at all, nor types such as Cell and RefCell.

Now, I've traced the issue to this function:

pub fn read_regex(&mut self, regex: &Regex) -> LexResult<Token> {
        self.ensure_has_input()?;
        let (m_start, m_end) = match regex.find(self.remaining_source_code()) {
            Some(regex_match) if regex_match.start() == 0 =>
                (regex_match.start(),  regex_match.end()),
            _ => return Err(LexErr::ExpectedRegexMatch {
                position: self.position()
            }),
        };
        let start = self.position() + m_start;
        let end   = self.position() + m_end;
        self.position = end;
        Ok(Token::new(start..end))
    }

Specifically, the segfault seems to happen during the regex.find(self.remaining_source_code()) part of the code, and I've verified that self.remaining_source_code() does not segfault. Thus deduction would point to the regex.find() as the origin of the segfaults. However, the regex crate is likely one of the most-used crates in the Rust ecosystem. So while a bug is definitely not impossible, it's also not particularly likely.

There is a possible alternative, which is a bug in rustc.
However, I'm not sure how to find out if it's in the regex crate or rustc at this time, or perhaps a 3rd alternative I haven't even considered.

Does anyone here have any ideas?

HadrienG · May 9, 2018, 3:56pm

I would like to point out that your investigation does not yet rule out the possibility of a bug in self.remaining_source_code(). This function may have succeeded in apparence, but returned an invalid string which regex will choke upon.

Segfaults only tell you the point in your program where a certain type of invalid operation (e.g. invalid memory access) occurred. They do not tell you which part of the code made the mistake which led this invalid operation to occur.

EDIT: Also, if you enable debug information with something like RUSTFLAGS="-g", a debugger might be able to tell you where in regex.find you are segfaulting, which could provide additional diagnosis information. However, the output of debuggers may be difficult to interprete in optimized builds. Does the crash also occur in debug builds?

jjpe · May 9, 2018, 4:10pm

The input string can be any valid String i.e. it is valid UTF-8, as far as the String ctors are concerned.

Aside from that, I did some println-style debugging.

That essentially changes the method to:

    pub fn read_regex(&mut self, regex: &Regex) -> LexResult<Token> {
        self.ensure_has_input()?;
        println!("self.remaining_source_code(): {}", self.remaining_source_code());
        let (m_start, m_end) = match regex.find(self.remaining_source_code()) {
            Some(regex_match) if regex_match.start() == 0 => {
                println!("MATCH");
                (regex_match.start(),  regex_match.end())
            },
            _ => {
                println!("NO MATCH");
                return Err(LexErr::ExpectedRegexMatch {
                    position: self.position()
                })
            },
        };
        let start = self.position() + m_start;
        let end   = self.position() + m_end;
        self.position = end;
        Ok(Token::new(start..end))
    }

I then provide a valid input: "foo" (i.e. a string literal).
The output then becomes:

`self.remaining_source_code(): "foo"`
[1]        77417 segmentation fault    cargo run --release

Note that neither MATCH nor NO MATCH is printed out, indicating that the program has segfaulted before then.

HadrienG · May 9, 2018, 4:24pm

Thanks, that does unambiguously rule out remaining_source_code() as a possible culprit as far as I can tell

Does the crash still occur in non-release builds?
EDIT: Ah, sorry, saw in the title that it doesn't.

Next question: can you rebuild with debug info (e.g. with RUSTFLAGS="-g"), run the program in GDB, and ask for a backtrace at the point of segfault?

roflcopter · May 9, 2018, 4:35pm

Can you provide the source code of your application or is it closed source ?
If yes, can you compile you application with clang sanitizers? That would definitely help to detect where exactly invalid memory is read.

Edit: Apparently you can only do the sanitizer thing on Linux, but can you try it anyway to see, in case it just happens to run by chance on Linux

jjpe · May 9, 2018, 4:45pm

I'm attempting to do so now, but I'm running into a wall: lldb is borked, to update it I need tot update XCode (oh joy), which in turn requires an OS update...
And when I tried to install gdb, brew basically told me the same thing for once.

At the very least it'll be a few hours in order to get all that done, as OS X is not particularly speedy when doing major upgrades in my experience.
That's assuming it all goes well of course, and I'm not particularly trusting of Apple's competency regarding OS X stability, especially the last few years.

I'll post an update when I have it all sorted out.

@roflcopter The license is proprietary, but additional context wouldn't help anyway since it's all 100% safe code. Segfaults should be impossible.
Besides that though, I don't see how clang's sanitizers can help with rust code? Aren't those kinds of analyses rather PL-specific?

roflcopter · May 9, 2018, 4:52pm

vitalyd · May 9, 2018, 4:55pm

In addition to the other suggestions, can you try a few different versions of regex (are you on the latest?) and/or rustc itself?

jjpe · May 9, 2018, 4:56pm

I tried regex 1.0 (what I'm currently using) and 0.2.
I also tried regex 0.1 but that is API-incompatible so I can't use it.

At the very least, using either 1.0 or 0.2 the issue occurs.

As for rustc, I tried a few different versions:

most recent nightly: issue occurs
nightly-2018-05-03: issue occurs
beta: issue occurs
stable: Can't compile: impl trait isn't stable for another week and a half or so.

vitalyd · May 9, 2018, 4:59pm

Ok cool. How about rustc versions?

jjpe · May 9, 2018, 5:00pm

You just beat me by half a second. See my last post, I edited it.

vitalyd · May 9, 2018, 5:02pm

There was an LLVM upgrade recently IIRC - would be cool to try a much older (say a couple of months old) compiler.

HadrienG · May 9, 2018, 5:08pm

That reminds me: stack overflows are one common source of segfaults in safe code, as the part of the rust runtime which detects them and translates them into aborts is not yet bullet-proof.

A debugger backtrace would tell you immediately if that is the issue, but if that is not available, another way is to adjust your OS' stack size limit and check if it affects program behaviour (not sure how that is done on OSX).

Beyond that, well, since a segfault is a thin abstraction of a CPU fault, it can mean many other things: invalid instructions (e.g. use of a vector instruction set which is inappropriate for the host machine), memory safety bug in some other code that you are indirectly using...

jjpe · May 9, 2018, 6:16pm

Good idea! In fact, I think this may have been a critical insight in understanding issue.

I went back to rustc 1.25.0-nightly (616b66dca 2018-02-02) and added the various now-stablish feature attributes back in.

Then I compiled again, and bingo, my REPL works as it should.

I think that from this we can conclude it's a rustc issue.
But I don't have enough information to even submit a bug report. What exactly is going wrong here? Why do newer rustc versions cause segfaults at all on OS X? And why does it manifest only then?

In order to gain more information, I'd like to do something like a bisect install for the range of rustc versions between 2018-02-02 and now. Is anything like that available in rustup?

ExpHP · May 9, 2018, 6:27pm

There is a utility called rust-bisect that automates some of the process.

Honestly, I myself just do it manually and call rustup override set nightly-YYYY-MM-DD && cargo run with different dates. Unless your project takes forever to build, it's not that tedious, and I'm there to catch anything unexpected (like a "failure" for the wrong reason).

jjpe · May 9, 2018, 9:32pm

Ok so the bisection results are in.

The Last Known Good nightly was rustc 1.26.0-nightly (392645394 2018-03-15).
Something in the next nightly changes the behavior to cause the segfault.

EDIT:

I have opened a bug about this on GitHub.

uberjay · May 9, 2018, 9:49pm

Well, that was eerie -- seeing that date (2018-03-15) reminded me of when I tracked down what was, ultimately not a bug in rustc. It was, ultimately, a bug it the winit crate. Perhaps it's also related to the ! type?

https://github.com/rust-lang/rust/issues/49275
https://github.com/tomaka/winit/commit/559681b0ed35100ad8418e09e4f35873caa33d7b

jjpe · May 9, 2018, 10:14pm

There are a few differences that I can see:

The crate in the issue by palango uses dyld, a library that provides a neat way to load a dynamic library. My crate has no need for this feature at the moment.
It also manages to produce a backtrace, something my binary doesn't do when it segfaults.
My crate doesn't use !, at least not directly.
My crate doesn't use any unsafe code directly.
The Cargo.lock of my crate hasn't changed during e.g. the bisection I performed earlier.

FTR, other than point 5 I do not know which of these differences are significant and which are not.

Given the list above, if this issue turns out not to be a rustc bug then the segfault would very likely originate in a (transient) dependency of my crate. And in that case, I would look at the regex crate again (see one of my earlier posts).

uberjay · May 9, 2018, 10:21pm

Yeah, it sounds unlikely to be related, but the regression timing seemed like too much of a coincidence to not mention!

daschl · May 15, 2018, 4:36am

@jjpe as niko mentioned in https://github.com/rust-lang/rust/issues/50586 setting lto = true worked for me as a workaround, you might want to try that as well?

Topic		Replies	Views
Segfaults when compiling with `--release` help	6	313	June 23, 2023
Weird 'nonlocal' segfault with docopt help	4	580	January 12, 2023
Crates maintained by the Rust Project	3	359	October 4, 2023
Cargo Seg Fault Mac M1 help	5	1158	April 2, 2022
Can't find crate for `regex_macros` in doctest	3	1093	January 12, 2023

No issues on Linux, segfault on OS X (but only when building with `--release`)

Related Topics