Wrong method is called by vtable of trait object. Compiler bug?

Dear, All.
I need help of seasoned rustaceans because it seems that I stumbled upon a compiler bug or some very obscure problem that I can't debug myself.

I have a very weird issue when working with the trait objects - wrong methods are called by vtable and the program crash with memory corruption. The biggest problem is that I can't isolate the bug - it only happens inside rather complex code base of molar crate and it's impossible to provide a small self-consistent example.

In essence, I have the traits like this:

pub trait TopologyWrite: RandomAtomProvider + RandomBondProvider {}
pub trait StateWrite: RandomPosProvider + BoxProvider + TimeProvider {}
pub trait TopologyStateWrite: TopologyWrite + StateWrite {...}

Then I have a method that takes TopologyStateWrite trait object like this:

fn write_state(&mut self, data: &dyn StateWrite) -> Result<(), super::FileFormatError> {
...
let b = data.get_box();
...
}

This should call a method get_box() from the BoxProvider trait, which is a bound of StateWrite. However, when I step over it in debugger a completely different method num_bonds() is called instead, which is from RandomBondProvider trait (a bound of TopologyWrite)!

After some experimentation I found that whatever I do the compiler is calling the N-th method of TopologyWrite (in the order of declaration) when I'm asking to call the N-th method of StateWrite! So it literally uses the wrong vtable or wrong offset in a vtable, or whatever.

This looks like a compiler bug because I don't see anything so much cursed in the code (although it uses some unsafe things under the hood).

Can someone familiar with vtables magic help me with this to trace it either to compiler bug or to my own mistake?

In order to reproduce:

  1. git clone -b dyn https://github.com/yesint/molar.git
  2. cargo test --package molar --lib -- core::selection::tests::test_write_to_file --exact --show-output --nocapture
  3. See the error "signal: 11, SIGSEGV: invalid memory reference" because wrong method is used and the memory is screwed up.

Any help will be deeply appreciated!

1 Like

It's a lot more likely to be your own unsafe that's wrong than a compiler bug. :slight_smile:

1 Like

how much unsafe code are in the code base? the unsafe keyword is the language feature to help isolate memory safety bug.

a compiler bug is not impossible, but it's most likely your unsafe code is the culprit. audit every line of unsafe code carefully.

I did. According to my (limited) knowledge I'm not doing anything that can corrupt the vtables like this - I never mess with them in any way, all unsafe is in different unrelated places.

Sure, but I have zero idea what can corrupt the vtables like this. I never touch anything related to vtables or trait object conversion.

that's why unsound code is bad, because their effect is unpredictable, supprising, and can be not local.

as it is usually said: when your program contains UB, literally ANYTHING could happen.

2 Likes

From your description, it sounds like one trait object has been transmuted or otherwise interpreted as another trait object.

It is not. There are no unsafe mangling with type whatsoever anywhere in the code.

As has already been said, incorrect unsafe code can have effects entirely unrelated to what it was supposed to be doing. You should not assume that just because the purpose of your unsafe code is unrelated to vtables that it isn't having an effect on them.

If possible, you should test your unsafe code with Miri. If you run the test case where the wrong vtable is being used, under Miri, then either it misbehaves without Miri reporting undefined behavior (in which case it might be a compiler bug), or Miri will point right to the erroneous operation. (However, if your test program involves non-Rust code, this will not be feasible, because Miri can only check Rust.)

In any case, you should review your unsafe code, not from the perspective of “could this unsafe code cause the problem I am having”, but rather “is this unsafe code sound? is there any way whatsoever it could perform any UB?”. If you do not approach it this skeptically, you will miss problems.

Then, if you still have no leads, it is time to start working on bug minimization. Make a local git branch of molar's code. Copy your test program into the repo if it isn't already. Then, start removing code that is not relevant to reproducing the bug. You must give up on the idea that the library should correctly perform its duties; you don't care if the program produces the right answer. You only care whether it exhibits the vtable misbehavior. Remove entire modules. Stub out functions. Simplify code until you can delete more of it. Cut and cut and cut. Regularly git commit so that if you go too far and the problem vanishes, you can roll back. Keep doing this and you will find the problem — whether it is a problem in your unsafe code or in the compiler.

8 Likes

Unfortunately, I do not use Cargo, so can't test your case, but I can recommend to try it with other version of the compiler or on other platform. For example, I have rustc built from git, rustc - stable from rustup, and rustc nightly from rustup on Windows, Linux, Mac and processors: x64, Arm64, Arm32, and Apple silicon, to make sure that my app works consistent.

If I had to guess, this sounds to me like maybe something somewhere might be transmuting between different fat pointer types instead of properly upcasting.

Specifically, given you have

pub trait TopologyStateWrite: TopologyWrite + StateWrite {...}

it might be an improper conversion of dyn TopologyStateWrite to dyn StateWrite, given that AFAIR in the current compiler implementaiton, vtables tend to start with the methods of the first supertrait. [So Nth method vtable entry of TopologyWrite would probably also be Nth method vtable entry of TopologyStateWrite.]

As someone myself who has found compiler bugs related to trait objects – and hence the bugs I could find were fixed[1] – I’d tend to believe incorrect unsafe code is more likely in this case. Edit: My belief was wrong, see the following replies :sweat_smile:

I'll try to take a glance at the code base and see if I can spot something obvious.


Edit: So, thus far I did find an example of a properly compiler-supported upcast

    fn write(&mut self, data: &dyn super::TopologyStateWrite) -> Result<(), FileFormatError> {
        self.write_topology(data as &dyn TopologyWrite)?;
        self.write_state(data as &dyn StateWrite)?;
        Ok(())
    }

So if there was for some reason something seriously bugged in the compiler, and the data as &dyn StateWrite here somehow didn’t do what it’s supposed to do (which is dereferencing the TopologyStateWrite to read the pointer to the StateWrite vtable out of there, and constructing a new fat pointer using the address part of the original &dyn TopologyStateWrite vtable, and the vtable pointer just read); then this could be a place of interest.


  1. the search link in this sentence does currently contain one issue still open, but that's merely a lifetimes related one, so irrelevant for the issue you’re seeing ↩︎

2 Likes

Playground

pub trait RandomAtomProvider {
    fn foo(&self) { unreachable!() }
}

pub trait BoxProvider {
    fn foo(&self) { unreachable!() }
}

pub trait TimeProvider {
    fn foo(&self) { unreachable!() }
    fn get_time(&self) { println!("TimeProvider::get_time was called correctly"); }
}

pub trait TopologyWrite: RandomAtomProvider {}
pub trait StateWrite: BoxProvider + TimeProvider {}
pub trait TopologyStateWrite: TopologyWrite + StateWrite {}

/* ... */

fn main() {
    TimeProvider::get_time(&() as &dyn TopologyStateWrite as &dyn StateWrite);
}

Exited with signal 11 (SIGSEGV): segmentation violation
7 Likes

Ouch – okay, that does look more like a compiler bug now. And nightly doesn’t do it though… Edit: Nor does beta.

I think it’s


Edit: I see. Looking at the discussion, it sounds like it made it into a beta backport (which is thus why beta does not show the issue in the playground), so the fix will be stable with Rust 1.90, which gets released 4 days from now. Specifically, at some point between Invalid date and Invalid date.

4 Likes

in stable, a release build triggers a rustc panic, but not in debug build:

assertion failed: &vtable_entries[..vtable_entries_b.len()] == vtable_entries_b

EDIT:

it's the same assertion failure reported by miri.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.