Invalid opcode / segmentation fault

Hi,
Can anyone help me debug this, please?

kernel: [108919.606652] traps: bin-name [12704] trap invalid opcode ip:561a96a06db9 sp:7f6246893688 error:0 in bin-name[561a96986000+226000]

On top of that, I sometimes get segmentation fault, but I have only binary for the bug above.
It's possible that I have damaged RAM as recently I'm getting some random freezes.

Thanks,
Lukasz

I think this is impossible without furhter context, e.g. some code (preferable not the complete code, but a reduced version of it, see mcve @ stackoverflow).

Hi,
It's not reproducible as it happens randomly and actually quite rarely. I was hoping someone can direct me how to get exact asm and / or line of code for this ip address.

You can use the "usual" tools such as addr2line. If you build your binary with debug symbols (default in debug build), then it will also print the line in the source code.

What is correct usage? I get:

lw@lukasz-home ~> addr2line -e bin-with-invalid-opcode
561a96986000
??:0
0x561a96986000
??:0
561a96a06db9
??:0
0x561a96a06db9
??:0
0x226000
??:0
226000
??:0

Ok, I did some progress:
So 561a96a06db9 - 561a96986000 gives 80DB9. addr2line:

addr2line -e binary
80DB9
/rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/hash/sip.rs:327

Is it even possible?
Edit: I use 1.37.0, this leads to:

let b: u64 = ((self.length as u64 & 0xff) << 56) | self.tail;

Should I file a issue on github?

Yeah, I use HashMap heavily.

Interestingly, using using the same technique and looking at previous segmentation fault, leads me to /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/slice/mod.rs:5353

5341 impl<A> SlicePartialEq<A> for [A]
5342     where A: PartialEq<A> + BytewiseEquality
5343 {
5344     fn equal(&self, other: &[A]) -> bool {
5345         if self.len() != other.len() {
5346             return false;
5347         }
5348         if self.as_ptr() == other.as_ptr() {
5349             return true;
5350         }
5351         unsafe {
5352             let size = mem::size_of_val(self);
5353             memcmp(self.as_ptr() as *const u8,
5354                    other.as_ptr() as *const u8, size) == 0
5355         }
5356     }
5357 }

Edit: shouldn't there be a check for size of `other`? As it seems now, `other` can be smaller, resulting in segfault?

I think I is always wise to seek the cause in your own code first. Time to review the unsafe code blocks :slight_smile:.

Re that code, the length check should guard your concern.

1 Like

The thing is, I don't have any unsafe in my code.

Do errors persist on other computers? Had you ran memory diagnostics?

It's a long running process and I've had multiple crashes on other computer (different processor, code compiled with target=native for each machine). It's the first time when I started investigating it, so I don't know what caused other crashes.
I've run memory check on the main machine and it seems fine.

Can you share the code? It would simplify a lot of things.

Not yet, but it will be open source. The thing is I'm not sure what to look for - I use only safe code.

Invalid opcodes sounds like the processor is either jumping to a random location and trying to execute garbage, or the compiler generated opcodes which aren't valid for the processor.

You mention you use target=native when compiling, so it may even be a bug in rustc or LLVM. For example, it may try to use platform-specific instructions (SIMD, etc.) that aren't actually available.

If possible, you may want to run the program under GDB (or any other debugger). If the program hits an illegal instruction GDB should halt things and let you look at the assembly or backtrace. You mention this being a long-running process, so it may be a case of leaving it to run overnight or periodically checking it while doing other things.

rust-lang/rust#38218 may also be relevant here. In particular, this comment is promising:

@EFanZh The issue is that Rust will generate code for your CPU family, which is the Haswell family.
In theory, AVX is architecturally guaranteed on Haswell, but some CPUs like your Pentium lack it.

Any crates / ffi or special in Cargo.toml?

lib crate deps:

rand = "0.7"
scoped_threadpool = "0.1.*"
no-panic = "0.1.10"
rand_xorshift = "0.2.0"
num_cpus = "1.0"

app deps:

rand = "0.7"
serde = { version = "1.0.91", features = ["derive"] }
serde_json = "1.0.39"
dirs = "2.0.2"
fs_extra = "1.1.0"
human-panic = "1.0.1"
rand_xorshift = "0.2.0"

global Cargo.toml:

[profile.release]
overflow-checks = true
debug = true
lto = false

If you're segfaulting on linux, you should be able to get a core dump for the process. Then just load it into gdb and get a stack trace. Here's a little tutorial if you're not familiar with the workflow:
https://jvns.ca/blog/2018/04/28/debugging-a-segfault-on-linux/

I'm getting things configured to get core dumps, but Mint stubbornly refuses. In the meantime I got another crash at 0x76298 which looks like this:

0x000000000007628d <+253>: lea 0x10(%rsp),%rdi
0x0000000000076292 <+258>: mov %r14,%rsi
0x0000000000076295 <+261>: callq *0x3da2fd(%rip) # 0x450598
0x000000000007629b <+267>: jmp 0x762aa genotick::exec::node::Node::execute+282
0x000000000007629d <+269>: lea 0x10(%rsp),%rdi
0x00000000000762a2 <+274>: mov %r14,%rsi

In rust code it's my own code, called by my other my code, that later calls also my code (all safe). Seems a bit weird to me.