Access violation on i686-pc-windows-msvc

Hi,

I'm currently working on a bridge between node.js and java. And everything is working fine so far but only on windows 32-bit, I'm getting weird Access Violations (writing location 0x000000).

The best example would be this commit, which, if you look at it, doesn't add a whole lot of code. The only rust file getting changed is src/node/java.rs, but these 3 lines getting changes cause the program to crash on startup, which can be seen in the associated GitHub actions runs.

I've tried using the debugger to find out what is wrong here but it only shows some assembly, which brings me to the conclusion that the issue is not in my code but somewhere else as I don't have debug symbols. And as I'm not quite good in reading Assembly, I don't know what to do about this.

Maybe someone in here knows how to solve this.

While I can find a failed test I cannot find one that failed with an access violation.

That would be a NULL pointer. Or possibly an uninitialized pointer. Does any of your code work on raw pointers?

The return code may actually be the one for an access violation, but as the tests are running through npm, the return code may just be 1. Or something else is breaking, but when running the code on my local machine, I get an access violation.

Yes, with quite a few, actually. As I wrote a wrapper around the JNI apis, I had to work with a lot of raw pointers. But, here's the thing: On every other system except windows 32-bit I've never encountered an access violation even once. Better yet, in the commit I've referenced only a Vec containing possibly some strings gets added and this causes the access violation. But when removing these 3 lines, everything works flawlessly. I've also encountered an access violation while adding other things, but this one was the easiest to show that clearly, something is wrong here (duh).

And it gets even better: If I add a println! in the method where I've added the 3 lines for the Vec, the access violation is magically gone. It's truly wild.

So I don't think a null pointer not being handled by me correctly is the issue, as it just works™ on any other system/configuration.

So in summary, the access violation here is caused by me adding a vector to another one:

loaded_jars.extend(cp.clone());
// or
loaded_jars.append(&mut cp.clone());

But again, there are many other things that caused, another great one would be this:

Something like this caused an access violation somewhere

fn not_ok(dirs: Vec<String>, ignore_unreadable: bool) -> napi::Result<Vec<String>> {
    let mut res = Vec::<String>::new();

    for dir in dirs {
        let glob_res = glob(dir.as_str())
            .map_napi_err()?
            .into_iter()
            .map(|f| f.map_napi_err())
            .collect::<napi::Result<Vec<_>>>();

        match glob_res {
            Ok(f) => {
                for file in f {
                    if file.is_file() {
                        res.push(file.to_str().unwrap().to_string());
                    }
                }
            },
            Err(e) => {
                if !ignore_unreadable {
                    Err(e)?;
                }
            }
        };
    }

    Ok(res)
}

But changing it to this worked fine:

fn this_is_fine(dirs: Vec<String>, ignore_unreadable: bool) -> napi::Result<Vec<String>> {
    dirs.into_iter()
        .map(|f| glob(f.as_str()).map_napi_err())
        .collect::<napi::Result<Vec<_>>>()?
        .into_iter()
        .flat_map(|f| f)
        .map(|f| f.map_napi_err())
        .filter_map(|f| match f {
            Ok(f) => Some(
                f.to_str()
                    .ok_or("Failed to convert path to string".into_napi_err())
                    .map(|f| f.to_string()),
            ),
            Err(e) => {
                if ignore_unreadable {
                    None
                } else {
                    Some(Err(e))
                }
            }
        })
        .collect()
}

Both methods should do exactly the same: Iterate over a list of glob patterns and find all files matching those patterns. But one causes a kind of untraceable access violation, the other doesn't.
And I'm not even working with raw pointers at this point, that's the only thing I've changed.

Which is an indication that the heap is corrupt.

Based on the evidence I'm going with uninitialized pointer.

This is almost certainly you screwing up. You have a heisenbug, and when it decides to show up it's showing up when your program tries to dereference a null pointer. This is textbook undefined behavior. Most likely, you forgot to do a null check on a pointer returned via ffi, and the end result is your program tries to dereference a null pointer somewhere, and your code may or may not crash, depending on how the compiler feels like compiling your code.

Because you are getting spooky action at a distance, it may be worth using an address sanitizer to see if you can track down the source of the bug.

There's also a chance that you're doing everything right and either Java or Node is returning a null pointer somewhere they shouldn't, although that's not particularly likely.

Ok, great, thank the both of you for your input, I'll try using an address sanitizer to find the bug.

But there's one thing I still don't quite get: Why is this only happening on windows 32-bit? Is it just the compiler handling the code differently on different architectures/operating systems? Did I just get lucky on all of those other architectures and operating systems (linux x86_64, linux aarch64, darwin x86_64 and windows x86_64)?

I guess it is how you look at it. If the violation you get on i686-windows leads you to find the heisenbug then the good luck is that you had the violation message. You may have released this on the other systems and had it bite you back very hard later.

1 Like

Different optimizer orders the machine instructions differently. Different instruction set means the code path is different. The operating system interacts with your program differently leaving different artifacts on the stack.

Yup. That's the rub with undefined behaviour. Sometimes it's well buried.

Ok, I don't know if anyone noticed, but I'm quite new to the "address sanitizing" business and I don't quite know how to use them.

I've tried using the AddressSanitizer like described

$ export RUSTFLAGS=-Zsanitizer=address RUSTDOCFLAGS=-Zsanitizer=address
$ npm run build -- -- --target x86_64-unknown-linux-gnu --cargo-flags="-Zbuild-std"
# And running the tests using
$ LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so) npm run testOnly

But after running the program I only get memory leaks and all of them are false positives outside my code (no rust file is ever mentioned in the output). Is the AddressSanitizer not the right one for me?

So in the next step I tried using the MemorySanitizer:

$ export \
  RUSTFLAGS='-Zsanitizer=memory -Zsanitizer-memory-track-origins' \
  RUSTDOCFLAGS='-Zsanitizer=memory -Zsanitizer-memory-track-origins'
$ npm run build -- -- --target x86_64-unknown-linux-gnu --cargo-flags="-Zbuild-std"
$ npm run testOnly

But I'll always get the following error:

java.linux-x64-gnu.node: undefined symbol: __msan_va_arg_overflow_size_tls

As I needed to set the path to some library for the AddressSanitizer I searched for something similar but as msan is linked statically, the fix for that seemingly isn't that easy. So something may be breaking the re-export of msan symbols. But I don't know what and how to fix this. Any suggestions?
Or is my approach completely wrong?

I've never used any of that so I'm going to be little to no help.

Ok, so I've moved my whole code to a new crate as napi-rs breaks the re-export of msan symbols, but no I'm getting false positives in my jdk, so imma rebuild the jdk with msan...yay

That also didn't work. My guess is that the MemorySanitizer may be the correct one but I never got it working.

I've read that valgrind may also be able to do such things, and it actually works, but it didn't report anything useful. I've tried the memcheck tool but that didn't give me anything.

So does anyone know how to use valgrind in order to find this bug?

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.