What doesn't Miri catch?

I am reading here:

All that said, be aware that Miri does not catch every violation of the Rust specification in your program, not least because there is no such specification. Miri uses its own approximation of what is and is not Undefined Behavior in Rust. To the best of our knowledge, all Undefined Behavior that has the potential to affect a program's correctness is being detected by Miri (modulo bugs), but you should consult the Reference for the official definition of Undefined Behavior,

I am having some trouble understanding this, as to me it almost seems to contradict itself. I think it would be clearer if it said Miri does in principle detect all UB, but it may have bugs, or there may be gray areas where it isn't entirely clear was is UB and what isn't. Also Miri (especially I think without Tree borrows enabled) might reject valid programs. But anyway, is there an example of a program with clear UB that Miri doesn't catch?

3 Likes

One interesting case is that of layouts. You might write code that makes assumptions about layout that are not guaranteed. If the layout happens to match what you assumed, then miri will accept it. For instance, casting &Fieldtype to &StructType works for

struct StructType {
    field: FieldType,
}

even though StructType does not have a #[repr(transparent)] annotation. This is because although StructType could have a layout with padding at the beginning, this does not happen in practice.

7 Likes

Another is that of language vs library UB. Miri only catches language UB, so this will pass miri:

let mut v: Vec<u8> = Vec::with_capacity(10);
unsafe { v.set_len(10) };

even though its documentation says:

Safety

  • new_len must be less than or equal to capacity().
  • The elements at old_len..new_len must be initialized.

The call to set_len clearly violates the second safety requirement and is therefore UB. But miri only catches language UB, and this is only library UB.

(Of course, once you actually read the uninitialized bytes, that turns it into language UB.)

4 Likes

(member of T-opsem but this is not normative)

Miri does catch all instances of clear language UB; if a case were clear and Miri didn't catch it yet, we'd update Miri to catch it, as it's clearly UB (and as a corollary, to be clearly UB, it is relatively straightforward to detect).

The closest to a counterexample would likely be latent library UB, where the library contract declares something to be UB but no immediate detectable language UB occurs yet.

This statement is in significant part hedging. As it says, we believe Miri accurately diagnoses any language UB in an execution (at least with strict provenance (i.e. absence of int2ptr casts) and assuming a specific borrow model), but Miri is not the determiner of what is UB and what isn't. As much as we have an actual definition, that determiner would be the Reference, and the Reference is actually quite lax at the moment both in what it confidently declares certainly UB and what it declares certainly not UB.

3 Likes

I think when I first read it, I read the first sentence, and took it to be 100% true, and didn't manage to read (or at any rate comprehend) the other sentences leaving me with an understanding that was more wrong than right. Perhaps the hedging could be relegated to a footnote rather than the headline, to help people like myself start with the right general impression.

This library UB appears to be a case of having a "safety line" which is not meant to be crossed, but the actual UB doesn't occur unless you go a bit further on, rather than stepping back to safety.

For example I imagine calling setlen with an incorrect value, then immediately calling setlen to restore the old correct value probably will not produce UB in spite of what the documentation says. I can certainly understand that Miri will only detect the actual UB rather than crossing the safety line.

[ It also seems possible my imagination here is wrong, so I wouldn't want to rely on this ]

There’s a related point regarding alignment requirements: If you ask the allocator for a lower alignment than actually required, you might still get lucky and be given a block of memory that is of the alignment you need. Miri will only detect an issue if the alignment is actually wrong on the run that it sees.

This is slightly more problematic than the type layout case because it can become instantiated UB between runs, and not just between compilations.

3 Likes

This is a good analogy and I think you do correctly understand the difference between "library" and "language" UB here. The one caveat that I'll add is that std (and any other library) implementation can change, and with that change can change when library UB is turned into language UB.

The overly pedantic may also get into "latent" UB versus "manifest" UB, especially when it's a class of UB reserved by Rust but not communicated to LLVM. But it's still UB and shall be avoided just the same.

Miri has an optional symbolic alignment mode which prohibits ever taking advantage of such "happenstance" alignment, which is nice to have. Unfortunately it prohibits perfectly allowed code which checks the runtime alignment (e.g. why align_to is allowed to not provide an aligned segment) thus being off by default.

Maybe now that the warning infrastructure exists for int2ptr casts (exposed provenance), the default symbolic alignment mode could warn when happenstance alignment is relied on?

3 Likes

I've seen that with my code. Placing things on the stack: 1. Was more stable. If the alignment was correct in a given function it tended to stay correct. 2. Was more likely to trigger Miri. The compiler seems to want to waste as little stack space as possible. For example, it really did seem like, when there were two or more 1-byte aligned things, a coin flip determined if they ended up at odd addresses.

On 64 bit computers, everything heap allocated seems to be 8-byte aligned. Which makes sense. And also makes it difficult to impossible to find alignment mistakes.

That would have helped me and would be appreciated!

1 Like

std could read all or some of the values there and then forget them when you do the first set_len call, which would make this immediate UB. This behavior is not precluded by the API contract, therefore it is a permissible future implementation, perhaps one added for debug purposes. Even if the method said it's O(1) that would not save your skin since it could randomly select some element and read that.
Without explicit API guarantee you can't distinguish between immediate and deferred UB.

1 Like

Yes, but in this context Miri can only check the code ( without treating the std library as special in some way, or reading and somehow comprehending the safety comment ), and since the actual code is not UB by itself ( it simply assigns the assigned length after checking it against capacity ) there is no way for Miri to detect it is UB.

1 Like

I was saying that your assumption

For example I imagine calling setlen with an incorrect value, then immediately calling setlen to restore the old correct value probably will not produce UB in spite of what the documentation says.

is incorrect because you can't assume a particular implementation.

But if the library did something inside set_len that's immediate UB then yeah, miri would likely detect that.

Vec could use specialization to read the contents on set_len whenever T = u32. So it's not true that this kind of library UB can't be language UB.

Of course, that's a pretty contrived example, but the same principle applies for other examples of library UB without being contrived. Library UB can be made into language UB without it being a breaking change.

1 Like

That's why I also said:

[ It also seems possible my imagination here is wrong, so I wouldn't want to rely on this ]

I think it is pretty obvious though that the std library UB comment is really stating the related language UB economically, and it is really an academic point whether it has the same meaning or an extra meaning, as (1) nobody is going to call setlen with an invalid value (except by mistake) - it isn't useful, and also the code for setlen is not going to change either.

On x86-64, it should be 16 byte aligned, because this is the alignment guaranteed by malloc (or HeapAlloc on Windows). malloc is guaranteed by the C standard to return pointers sufficiently aligned for all native C types. That would seem to be 8 on 64-bit systems, but x86-64 always has SSE2 instructions and types, and SSE2 types require alignment of 16 for full performance.

You can look up the constant MIN_ALIGN in the sources of Rust stdlib. Allocations with alignment not exceeding MIN_ALIGN are just forwarded as-is to the OS memory allocation routine.

2 Likes

Imagine a variant on Vec that checks a global "low memory" state, and immediately reduces capacity if possible when you call set_len, resize, remove or other methods that could reduce the number of elements in the Vec while in low memory state. This complies with the current API contracts of Vec, but would then break your code, since the capacity would have reduced after the first call to set_len.

1 Like

Caveat: it only applies to the default allocator. For example, with a simple bump-based global allocator it would not be true.

1 Like

Another interesting case is references to uninitialized memory. All currently proposed models permit such, but it's also current consensus that Rust as currently defined (i.e. by the Reference) doesn't permit such, and we'd be able to say &mut *MaybeUninit::uninit().as_mut_ptr() is immediate language UB if we wanted to.

Note, however, that mu.assume_init_mut() is still library UB even if &mut *mu.as_mut_ptr() isn't language UB, as there's a library-level assertion that the contained value has been initialized. It could even be implemented to cause language UB by doing something along the lines of mem::forget(mu.assume_init_read()) which asserts initialization without any further side effects.


The Reference effectively lists a number of superpowers which unsafe is permitted to do, and anything which isn't on that list is "undefined behavior" in the weak and informal sense that nothing has provided a definition for what the behavior is. But Rust is attempting to treat UB in a more principled manner and "define" exactly when UB operationally "happens" on the abstract machine that we're emulating. This gives us a lot of nice properties, like the ability to have Miri be a perfect sanitizing implementation in the first place.

C++ is perhaps slowly approaching this — anything constexpr does have perfect UB sanitization possible — but it's still imperfect. C++20 permits allocation during sanitized consteval, but even though Rust's const still significantly tails C++ constexpr, what C++ constexpr can evaluate is still behind what Miri unleashed can do. (The most obvious example being that the <cmath> header functionality isn't made constexpr until C++26.)

(I'd be quite surprised if it doesn't exist yet, but I'm not aware of any cargo-miri equivalent for C++ which takes the constexpr evaluator and tries to run the runtime program with that sanitizing virtual machine interpreter. Nor to make stdio work in such an environment.)

3 Likes

That's the kind of pedantic distinction that I hope won't be officially in the language. What are the reasons to permit &mut uninit other than "people think they are allowed to do that with io::Write"?

It's, not, it's part of the library, not the language :stuck_out_tongue_closed_eyes:

But in practicality, the method will have the straightforward definition. Relying on that is technically relying on implementation details, but std isn't adversarial.

The std rarely documents "validity" and "safety" independently, because it's a subtle detail that doesn't matter in 99% of cases. Just because something has defined behavior doesn't mean that deliberately relying on that behavior is endorsed practice.

Because any scheme to make it invalid at a language level is worse without conferring any benefits to compiler reasoning. (The compiler can only reason based on validity requirements, but developers should be reasoning based on safety requirements.) I wrote about &mut uninit specifically previously:

In reductive short, the reason to permit it is that there isn't any compelling reason to forbid it, and, all else being equal, less UB is always preferable.

It is actively seeming like we're wanting for a category of "erroneous behavior" for things like this — behavior which has a defined result, but which is still considered an unsafe programming error that sanitizing environments should be able to diagnose without being considered incorrect false positives.

Exposed provenance is another notable candidate for "erroneous behavior," although probably the most contentious one since it's standard practice on mainstream targets (and doesn't have a practical stable alternative yet).

2 Likes