Dear, All.
I need help of seasoned rustaceans because it seems that I stumbled upon a compiler bug or some very obscure problem that I can't debug myself.
I have a very weird issue when working with the trait objects - wrong methods are called by vtable and the program crash with memory corruption. The biggest problem is that I can't isolate the bug - it only happens inside rather complex code base of molar crate and it's impossible to provide a small self-consistent example.
Then I have a method that takes TopologyStateWrite trait object like this:
fn write_state(&mut self, data: &dyn StateWrite) -> Result<(), super::FileFormatError> {
...
let b = data.get_box();
...
}
This should call a method get_box() from the BoxProvider trait, which is a bound of StateWrite. However, when I step over it in debugger a completely different method num_bonds() is called instead, which is from RandomBondProvider trait (a bound of TopologyWrite)!
After some experimentation I found that whatever I do the compiler is calling the N-th method of TopologyWrite (in the order of declaration) when I'm asking to call the N-th method of StateWrite! So it literally uses the wrong vtable or wrong offset in a vtable, or whatever.
This looks like a compiler bug because I don't see anything so much cursed in the code (although it uses some unsafe things under the hood).
Can someone familiar with vtables magic help me with this to trace it either to compiler bug or to my own mistake?
I did. According to my (limited) knowledge I'm not doing anything that can corrupt the vtables like this - I never mess with them in any way, all unsafe is in different unrelated places.
As has already been said, incorrect unsafe code can have effects entirely unrelated to what it was supposed to be doing. You should not assume that just because the purpose of your unsafe code is unrelated to vtables that it isn't having an effect on them.
If possible, you should test your unsafe code with Miri. If you run the test case where the wrong vtable is being used, under Miri, then either it misbehaves without Miri reporting undefined behavior (in which case it might be a compiler bug), or Miri will point right to the erroneous operation. (However, if your test program involves non-Rust code, this will not be feasible, because Miri can only check Rust.)
In any case, you should review your unsafe code, not from the perspective of “could this unsafe code cause the problem I am having”, but rather “is this unsafe code sound? is there any way whatsoever it could perform any UB?”. If you do not approach it this skeptically, you will miss problems.
Then, if you still have no leads, it is time to start working on bug minimization. Make a local git branch of molar's code. Copy your test program into the repo if it isn't already. Then, start removing code that is not relevant to reproducing the bug. You must give up on the idea that the library should correctly perform its duties; you don't care if the program produces the right answer. You only care whether it exhibits the vtable misbehavior. Remove entire modules. Stub out functions. Simplify code until you can delete more of it. Cut and cut and cut. Regularly git commit so that if you go too far and the problem vanishes, you can roll back. Keep doing this and you will find the problem — whether it is a problem in your unsafe code or in the compiler.
Unfortunately, I do not use Cargo, so can't test your case, but I can recommend to try it with other version of the compiler or on other platform. For example, I have rustc built from git, rustc - stable from rustup, and rustc nightly from rustup on Windows, Linux, Mac and processors: x64, Arm64, Arm32, and Apple silicon, to make sure that my app works consistent.
If I had to guess, this sounds to me like maybe something somewhere might be transmuting between different fat pointer types instead of properly upcasting.
it might be an improper conversion of dyn TopologyStateWrite to dyn StateWrite, given that AFAIR in the current compiler implementaiton, vtables tend to start with the methods of the first supertrait. [So Nth method vtable entry of TopologyWrite would probably also be Nth method vtable entry of TopologyStateWrite.]
As someone myself who has found compiler bugs related to trait objects – and hence the bugs I could find were fixed[1] – I’d tend to believe incorrect unsafe code is more likely in this case.Edit: My belief was wrong, see the following replies
I'll try to take a glance at the code base and see if I can spot something obvious.
Edit: So, thus far I did find an example of a properly compiler-supported upcast
fn write(&mut self, data: &dyn super::TopologyStateWrite) -> Result<(), FileFormatError> {
self.write_topology(data as &dyn TopologyWrite)?;
self.write_state(data as &dyn StateWrite)?;
Ok(())
}
So if there was for some reason something seriously bugged in the compiler, and the data as &dyn StateWrite here somehow didn’t do what it’s supposed to do (which is dereferencing the TopologyStateWrite to read the pointer to the StateWrite vtable out of there, and constructing a new fat pointer using the address part of the original &dyn TopologyStateWrite vtable, and the vtable pointer just read); then this could be a place of interest.
Edit: I see. Looking at the discussion, it sounds like it made it into a beta backport (which is thus why beta does not show the issue in the playground), so the fix will be stable with Rust 1.90, which gets released 4 days from now. Specifically, at some point between Invalid date and Invalid date.