What doesn't Miri catch?

geebee22 · May 11, 2024, 6:51pm

I am reading here:

All that said, be aware that Miri does not catch every violation of the Rust specification in your program, not least because there is no such specification. Miri uses its own approximation of what is and is not Undefined Behavior in Rust. To the best of our knowledge, all Undefined Behavior that has the potential to affect a program's correctness is being detected by Miri (modulo bugs), but you should consult the Reference for the official definition of Undefined Behavior,

I am having some trouble understanding this, as to me it almost seems to contradict itself. I think it would be clearer if it said Miri does in principle detect all UB, but it may have bugs, or there may be gray areas where it isn't entirely clear was is UB and what isn't. Also Miri (especially I think without Tree borrows enabled) might reject valid programs. But anyway, is there an example of a program with clear UB that Miri doesn't catch?

alice · May 11, 2024, 8:10pm

One interesting case is that of layouts. You might write code that makes assumptions about layout that are not guaranteed. If the layout happens to match what you assumed, then miri will accept it. For instance, casting &Fieldtype to &StructType works for

struct StructType {
    field: FieldType,
}

even though StructType does not have a #[repr(transparent)] annotation. This is because although StructType could have a layout with padding at the beginning, this does not happen in practice.

alice · May 11, 2024, 8:17pm

Another is that of language vs library UB. Miri only catches language UB, so this will pass miri:

let mut v: Vec<u8> = Vec::with_capacity(10);
unsafe { v.set_len(10) };

even though its documentation says:

Safety

new_len must be less than or equal to capacity().

The elements at old_len..new_len must be initialized.

The call to set_len clearly violates the second safety requirement and is therefore UB. But miri only catches language UB, and this is only library UB.

(Of course, once you actually read the uninitialized bytes, that turns it into language UB.)

CAD97 · May 11, 2024, 9:05pm

(member of T-opsem but this is not normative)

Miri does catch all instances of clear language UB; if a case were clear and Miri didn't catch it yet, we'd update Miri to catch it, as it's clearly UB (and as a corollary, to be clearly UB, it is relatively straightforward to detect).

The closest to a counterexample would likely be latent library UB, where the library contract declares something to be UB but no immediate detectable language UB occurs yet.

This statement is in significant part hedging. As it says, we believe Miri accurately diagnoses any language UB in an execution (at least with strict provenance (i.e. absence of int2ptr casts) and assuming a specific borrow model), but Miri is not the determiner of what is UB and what isn't. As much as we have an actual definition, that determiner would be the Reference, and the Reference is actually quite lax at the moment both in what it confidently declares certainly UB and what it declares certainly not UB.

geebee22 · May 11, 2024, 9:22pm

I think when I first read it, I read the first sentence, and took it to be 100% true, and didn't manage to read (or at any rate comprehend) the other sentences leaving me with an understanding that was more wrong than right. Perhaps the hedging could be relegated to a footnote rather than the headline, to help people like myself start with the right general impression.

geebee22 · May 12, 2024, 7:11am

This library UB appears to be a case of having a "safety line" which is not meant to be crossed, but the actual UB doesn't occur unless you go a bit further on, rather than stepping back to safety.

For example I imagine calling setlen with an incorrect value, then immediately calling setlen to restore the old correct value probably will not produce UB in spite of what the documentation says. I can certainly understand that Miri will only detect the actual UB rather than crossing the safety line.

[ It also seems possible my imagination here is wrong, so I wouldn't want to rely on this ]

2e71828 · May 12, 2024, 7:49am

There’s a related point regarding alignment requirements: If you ask the allocator for a lower alignment than actually required, you might still get lucky and be given a block of memory that is of the alignment you need. Miri will only detect an issue if the alignment is actually wrong on the run that it sees.

This is slightly more problematic than the type layout case because it can become instantiated UB between runs, and not just between compilations.

CAD97 · May 12, 2024, 9:07am

This is a good analogy and I think you do correctly understand the difference between "library" and "language" UB here. The one caveat that I'll add is that std (and any other library) implementation can change, and with that change can change when library UB is turned into language UB.

The overly pedantic may also get into "latent" UB versus "manifest" UB, especially when it's a class of UB reserved by Rust but not communicated to LLVM. But it's still UB and shall be avoided just the same.

Miri has an optional symbolic alignment mode which prohibits ever taking advantage of such "happenstance" alignment, which is nice to have. Unfortunately it prohibits perfectly allowed code which checks the runtime alignment (e.g. why align_to is allowed to not provide an aligned segment) thus being off by default.

Maybe now that the warning infrastructure exists for int2ptr casts (exposed provenance), the default symbolic alignment mode could warn when happenstance alignment is relied on?

Coding-Badly · May 12, 2024, 3:18pm

I've seen that with my code. Placing things on the stack: 1. Was more stable. If the alignment was correct in a given function it tended to stay correct. 2. Was more likely to trigger Miri. The compiler seems to want to waste as little stack space as possible. For example, it really did seem like, when there were two or more 1-byte aligned things, a coin flip determined if they ended up at odd addresses.

On 64 bit computers, everything heap allocated seems to be 8-byte aligned. Which makes sense. And also makes it difficult to impossible to find alignment mistakes.

That would have helped me and would be appreciated!

the8472 · May 12, 2024, 3:56pm

std could read all or some of the values there and then forget them when you do the first set_len call, which would make this immediate UB. This behavior is not precluded by the API contract, therefore it is a permissible future implementation, perhaps one added for debug purposes. Even if the method said it's O(1) that would not save your skin since it could randomly select some element and read that.
Without explicit API guarantee you can't distinguish between immediate and deferred UB.

geebee22 · May 12, 2024, 4:51pm

Yes, but in this context Miri can only check the code ( without treating the std library as special in some way, or reading and somehow comprehending the safety comment ), and since the actual code is not UB by itself ( it simply assigns the assigned length after checking it against capacity ) there is no way for Miri to detect it is UB.

the8472 · May 12, 2024, 8:52pm

I was saying that your assumption

For example I imagine calling setlen with an incorrect value, then immediately calling setlen to restore the old correct value probably will not produce UB in spite of what the documentation says.

is incorrect because you can't assume a particular implementation.

But if the library did something inside set_len that's immediate UB then yeah, miri would likely detect that.

alice · May 12, 2024, 9:07pm

Vec could use specialization to read the contents on set_len whenever T = u32. So it's not true that this kind of library UB can't be language UB.

Of course, that's a pretty contrived example, but the same principle applies for other examples of library UB without being contrived. Library UB can be made into language UB without it being a breaking change.

geebee22 · May 12, 2024, 9:28pm

That's why I also said:

[ It also seems possible my imagination here is wrong, so I wouldn't want to rely on this ]

I think it is pretty obvious though that the std library UB comment is really stating the related language UB economically, and it is really an academic point whether it has the same meaning or an extra meaning, as (1) nobody is going to call setlen with an invalid value (except by mistake) - it isn't useful, and also the code for setlen is not going to change either.

afetisov · May 13, 2024, 12:59am

On x86-64, it should be 16 byte aligned, because this is the alignment guaranteed by malloc (or HeapAlloc on Windows). malloc is guaranteed by the C standard to return pointers sufficiently aligned for all native C types. That would seem to be 8 on 64-bit systems, but x86-64 always has SSE2 instructions and types, and SSE2 types require alignment of 16 for full performance.

You can look up the constant MIN_ALIGN in the sources of Rust stdlib. Allocations with alignment not exceeding MIN_ALIGN are just forwarded as-is to the OS memory allocation routine.

farnz · May 13, 2024, 11:11am

Imagine a variant on Vec that checks a global "low memory" state, and immediately reduces capacity if possible when you call set_len, resize, remove or other methods that could reduce the number of elements in the Vec while in low memory state. This complies with the current API contracts of Vec, but would then break your code, since the capacity would have reduced after the first call to set_len.

newpavlov · May 13, 2024, 4:47pm

Caveat: it only applies to the default allocator. For example, with a simple bump-based global allocator it would not be true.

CAD97 · May 13, 2024, 8:34pm

Another interesting case is references to uninitialized memory. All currently proposed models permit such, but it's also current consensus that Rust as currently defined (i.e. by the Reference) doesn't permit such, and we'd be able to say &mut *MaybeUninit::uninit().as_mut_ptr() is immediate language UB if we wanted to.

Note, however, that mu.assume_init_mut() is still library UB even if &mut *mu.as_mut_ptr() isn't language UB, as there's a library-level assertion that the contained value has been initialized. It could even be implemented to cause language UB by doing something along the lines of mem::forget(mu.assume_init_read()) which asserts initialization without any further side effects.

The Reference effectively lists a number of superpowers which unsafe is permitted to do, and anything which isn't on that list is "undefined behavior" in the weak and informal sense that nothing has provided a definition for what the behavior is. But Rust is attempting to treat UB in a more principled manner and "define" exactly when UB operationally "happens" on the abstract machine that we're emulating. This gives us a lot of nice properties, like the ability to have Miri be a perfect sanitizing implementation in the first place.

C++ is perhaps slowly approaching this — anything constexpr does have perfect UB sanitization possible — but it's still imperfect. C++20 permits allocation during sanitized consteval, but even though Rust's const still significantly tails C++ constexpr, what C++ constexpr can evaluate is still behind what Miri unleashed can do. (The most obvious example being that the <cmath> header functionality isn't made constexpr until C++26.)

(I'd be quite surprised if it doesn't exist yet, but I'm not aware of any cargo-miri equivalent for C++ which takes the constexpr evaluator and tries to run the runtime program with that sanitizing virtual machine interpreter. Nor to make stdio work in such an environment.)

afetisov · May 14, 2024, 5:17pm

That's the kind of pedantic distinction that I hope won't be officially in the language. What are the reasons to permit &mut uninit other than "people think they are allowed to do that with io::Write"?

CAD97 · May 15, 2024, 3:58am

It's, not, it's part of the library, not the language

But in practicality, the method will have the straightforward definition. Relying on that is technically relying on implementation details, but std isn't adversarial.

The std rarely documents "validity" and "safety" independently, because it's a subtle detail that doesn't matter in 99% of cases. Just because something has defined behavior doesn't mean that deliberately relying on that behavior is endorsed practice.

Because any scheme to make it invalid at a language level is worse without conferring any benefits to compiler reasoning. (The compiler can only reason based on validity requirements, but developers should be reasoning based on safety requirements.) I wrote about &mut uninit specifically previously:

github.com/rust-lang/unsafe-code-guidelines

Document current justification for not requiring recursive reference validity (in particular, `&mut uninit` not being immediate UB)

opened 04:47AM - 05 Jul 22 UTC

CAD97

This post is a draft of my understanding of the problem space, and justification… for `&mut uninit` not being UB to hold or pass around between functions. ### What is UB? This is a very brief summary; see [the glossary](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#undefined-behavior), [various blog posts](https://raphlinus.github.io/programming/rust/2018/08/17/undefined-behavior.html), and https://github.com/rust-lang/unsafe-code-guidelines/issues/253 for more. In short, UB is a *language-level contract* that some situation *does not happen*. Formally speaking, a *program* cannot "have UB;" UB is a property of some *execution* resulting in some [behavior considered undefined](https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html). Note that this is the *formal* meaning of **undefined**; encountering UB *retroactively* removes *any and all* guarantees about the program execution[^1]. [^1]: The *smallest possible caveat* may informally apply here: external synchronization via observable effects. At each point the Abstract Machine does something observable outside of the Abstract Machine (i.e. does FFI e.g. IO), the observable state of the Abstract Machine must be in the state defined by the execution to this point. The only way for UB to retroactively unguarantee the already observed behavior is if it is not a valid execution to stop the AM at the observable point before the UB occurs. This is, however, merely an informal argument; the guarantee does in fact no longer exist, it is just that there is no known way for a compiler to take advantage of this. ***However***, note that not all behavior you may expect to be externally observable necessarily is, and neither is all behavior implemented via FFI. So long as the Abstract Machine has a definition of the operation which does not leave the AM, it is not considered externally observable. The canonical example of this is that while allocation is implemented by calling into the host OS, the AM has an internal definition of allocation and an implementation may arbitrarily call the OS allocation APIs in any defined manner. Depending on who you ask, UB in C++ may have originally been about allowing *implementation-defined behavior* and implementations to diverge on how they implement the language. Even if this *is* the case, though, every commercial C++ compiler uses UB under the modern understanding for optimization, and a language without UB is one that is *very difficult* if not *impossible* to optimize[^2]. And more importantly, ***Rust is not C++***. [^2]: You may disagree here, and point to scripting languages like ECMAScript or even Safe Rust as languages that can be optimized while not having UB. But the insight here is that they *still have UB*; the difference is solely that the UB is statically prevented from happening in the surface language. As soon as you go to lower the higher-level language to another target, the set of syntactically valid possibilities extends beyond that of the surface language, and any operation which would be forbidden in the surface language is UB. With the benefit of hindsight from the experience of C/C++ and other language design in the past 50 years, Rust takes a much more deliberate approach to UB. In particular: - UB should be *detectable.* It should be practical to write a perfect sanitizing implementation of the Abstract Machine which can say with certainty whether a specific execution is defined (did not attempt to execute UB). - UB should be *justified.* All other things being equal, it is better for more programs to be defined, because we want it to be reasonable to write programs without UB. As such, making some operation UB should be backed by the properties learned about the program outweighing the cost of developers having to manually prove that invalid operations do not occur. - UB should be *operational.* This ties into the previous two points; an axiomatic assertion of some property prevents clear diagnosis of UB and doesn't serve the language-provided-guarantee property of promising that some set of operations are invalid and will not occur. ### Why isn't `&mut uninit` currently considered UB by Miri? In short: because it does no operation that is undefined. Expanding on that a little: - Validity of memory (e.g. that bytes are not undefined) is asserted when the AM does a [typed copy](https://github.com/rust-lang/unsafe-code-guidelines/issues/84) of the memory from one place to another place. - The memory making up the reference itself is asserted to be initialized, non-null, and aligned. - Validity of borrowing is asserted when references are considered "used." - This includes at least when the reference is converted into a place (dereferenced) and when a referenced is used as a function argument, as well as when a function taking the reference returns. - Until the memory at the referenced place undergoes a typed copy, its validity is *not* asserted. Additionally, writing to an uninitialized place is allowed as this consists of - running the drop glue for the place at the given type, - If the type does not have drop glue, this trivially does not do a typed copy from the place. - then doing a typed copy of the value to write into the place, - neither of which assert the validity of the preexisting memory. Note, however, that *writing* uninit into `&mut init` is *always UB*. This is because this does a typed copy of the written value, which asserts that the written value is initialized. ### Why would `&mut uninit` be considered UB? There are two operational ways that references to uninitialized memory could be made operationally UB: in borrow validation or during conversion between references and places. - **Borrow validation**: in addition to the retag operations, the memory validity of the place would be asserted. - **Ref-to-place**: whenever a reference is converted into a place (dereferenced), the memory validity of the place would be asserted. - **Place-to-ref**: whenever a place is converted into a reference, the memory validity of the place would be asserted. But how much memory validity? The easy answer is full memory validity at the referenced type; the minimal answer for the desired property is just that the memory is initialized. However, checking bytes are initialized still requires full type information to know which bytes are potentially padding and thus are allowed to be uninitialized, so full memory validity is simpler to check and cost us nothing extra on the implementation. There are subtle differences to the properties derivable from when exactly memory validity is asserted, but the purpose of this document is to discuss the fundamental reasons for/against using any of them generally. ### What otherwise valid programs are made UB? There's at least two notable losses, one obvious, and one not so obvious. The obvious one is just any program using a type like `&mut [u8]` to reference potentially-uninitialized memory. Many existing implementations of [`io::Read`](https://doc.rust-lang.org/std/io/trait.Read.html#tymethod.read) are written to carefully avoid reading the provided buffer before writing to it, such that they might be used to read into an uninitialized buffer. There is [an existing accepted RFC](https://rust-lang.github.io/rfcs/2930-read-buf.html) allowing for safely reading into an uninitialized buffer[^3], but it would be very unfortunate to make nearly all existing code unsound. [^3]: In general, you *should* prefer using types like `&mut [MaybeUninit<u8>]` rather than `&[u8]` for writing into potentially uninitialized memory. Even if the existence of the reference is not in and of itself UB, it is still *wildly unsafe*, and using types that allow and potentially track uninitialized memory much better describes the semantics of your program and prevent accidentally exposing references to uninit to downstream code (which is still unsound). The less obvious one is with pointers. Writing into an uninitialized place (via `=` assignment expression) becomes UB, even if that place is behind a raw pointer. This is because the drop glue semantically calls `std::ptr::drop_in_place(&mut place)`, creating a mutable reference to the place. You can potentially recover writes to places not asserting memory validity of the place by semantically only creating the reference if the place's type has drop glue, but this has further complications around generics (as MIR is produced for the generic function in polymorphic form, and `ptr::drop_in_place` must be called there). It is perhaps a better idea to use `ptr::write` for writing into uninitialized memory anyway, but this adds an additional subtle pitfall to what are supposed to be raw pointers with mostly C-like semantics (so long as you don't create any references). ### What benefit is to be gained? What `&mut uninit` being UB would theoretically provide is that references could be marked as "`dereferencable(noundef N)`" where they are currently marked `dereferencable(N)` in the LLVM backend. Pointee memory validation during borrow validation would likely be enough to justify this; validation, and ref-to-place or place-to-ref time could be enough for "`dereferencable_on_entry(noundef N)`" if reborrowing for a function argument counts as doing a ref-to-place-to-ref conversion (how you would write it in source, `&*ref`). However, there is at the time of writing *no known optimization benefit* to eagerly marking references as pointing to known-init memory, neither practical *nor theoretical*. This is due to a simple observation: when the memory is read by the source program, it then undergoes a typed copy which asserts that it is memory valid. So the only *potential* optimization lost is speculative reads. However, we already justify that references must be dereferencable by the borrow validity rules, so it is perfectly fine to speculatively read memory from behind a reference. It is even valid to make decisions based on the value before it is semantically guaranteed to be read by the source program, so long as the speculative execution can deal with speculation being driven by uninit (e.g. by `freeze`ing it to an arbitrary-but-consistent noundef byte value). So if there's no optimization benefit to eagerly checking for references pointing to uninit memory, the benefit is solely in diagnosing ill-formed programs. By eagerly checking, the existence of references-to-uninit can be diagnosed when they are created rather than when the uninitialized memory is read[^4], properly blaming the creator of the reference rather than the place just doing safe reads of a safe reference. [^4]: It is for this reason that *if* one of the measures for making `&mut uninit` UB eagerly were to be taken, the author suggests putting the check on **place-to-ref** conversions. However, this "victim blaming" is actually fundamental to how Miri works to diagnose *operational* UB. Miri does not (and *cannot*) understand your library's safety preconditions. The only thing Miri diagnoses is when the code violates the conditions of the Abstract Machine (exhibits UB), and will point at the point where the violation happened as the culprit, even if the bug in the code is instead a far-removed violation of a library's contract invoking "library UB[^5]". Miri only cares about and diagnoses whether a specific execution of a specific complete compilation graph encounters language UB, and using this to show soundness is left to careful application of tests. [^5]: We say an execution's use of a library API exhibits "library UB" if it violates the documented preconditions of the library functions. If "library UB" is caused, the library has full "permission" to cause language-level UB at any later point. The program's behavior is still defined unless language UB is encountered. ### Why is this contentious? It is the author's belief that `&mut uninit` is somewhat unique in the space of defined-but-unsafe Rust, in that this is a *safety invariant* on a otherwise very strict primitive type. Of the primitive types, - `!` is always invalid. - `()` is always valid. - Simple numerics `iNN`, `uNN`, `fNN`, only have the trivial noundef validity invariant. - (*mumbles mumbles* provenance; IIUC the current plan is to strip on load and by-value-transmute?) - `char`, `bool` only have validity invariants. - `[_; N]` and `[_]` inherit all safety/validity invariants from the contained type. - `*const _` and `*mut _` only have the trivial noundef validity invariant. - (*mumbles mumbles* provenance; IIUC the current plan is to angelically have provided the correct provenance if used?) - `str` has officially been decided to have the validity invariant of `[u8]` and for valid UTF-8 to be a safety invariant. - However, this decision is still somewhat contentious, - and technically the decision was made for `&str` and not `str`-by-value. - The author posits that the reason this is contentious is that it means the `str` *primitive* has a nonempty safety invariant - and perhaps the resolution to this is to say `str` *isn't* (semantically) a primitive, it just *looks like* one, but is actually just `struct str([u8])`. - `Box` is... special, but generally not considered as a primitive type outside of the compiler. References thus are special among primitive types in that they - have a nontrivial memory validity invariant, - noundef, nonnull, aligned - have complicated borrow validity requirements, and - see: Stacked Borrows retags - have a *safety* invariant that the pointee is both memory-valid and safe. This is likely unavoidable, as [memory validity being shallow / not following references](https://github.com/rust-lang/unsafe-code-guidelines/issues/77#issuecomment-519997799) is itself a very desirable property, both for reasoning about unsafe code and for implementing the sanitizer. But I think this can be resolved as solely a teaching problem. The answer to "can a reference point to uninitialized memory" should be "no\*, use `MaybeUninit` or a wrapper of it," where the asterisk is "`unsafe` can break the rules, but unless you break the rules, you don't have to worry about it." I think in many ways many people are too eager to be *correct* to remember when it's okay and even better to put forward a simplifying lie, and then refine later as necessary. Having an operational model of "what you *can* do" with `unsafe` is important for being able to reason about unsafe code and to write complicated unsafe code. But for teaching unsafe, it's almost certainly better to stick to relaxing the safe rules for a significant period. ### TL;DR It is the author's conclusion that: - References to uninit primitives are clearly nonproblematic to allow in the opsem. - Retags of `enum`s need to do a read if we want ["active variant" `MaybeUninit` tracking](https://github.com/rust-lang/unsafe-code-guidelines/issues/346#issuecomment-1175723345). - Teaching proper "choose your weakenings" use of `unsafe` is difficult. - And likely currently exposes "`&uninit` is fine, actually" too soon. - The model that better fits _developer reasoning_ is more "clean up after yourself", i.e. *not* https://github.com/rust-lang/unsafe-code-guidelines/issues/84 - *Specifically* because this behaves more like normal safety reasoning than allowing safe types to be used in unsafe states.

In reductive short, the reason to permit it is that there isn't any compelling reason to forbid it, and, all else being equal, less UB is always preferable.

It is actively seeming like we're wanting for a category of "erroneous behavior" for things like this — behavior which has a defined result, but which is still considered an unsafe programming error that sanitizing environments should be able to diagnose without being considered incorrect false positives.

Exposed provenance is another notable candidate for "erroneous behavior," although probably the most contentious one since it's standard practice on mainstream targets (and doesn't have a practical stable alternative yet).

Topic		Replies	Views
Miri: undefined behavior of using a `const _ as mut _` pointer help	13	1028	March 19, 2021
Is this undefined behaviour? help	22	973	February 23, 2022
Can someone explain this miri error: encountered a pointer but expected plain (non-pointer) bytes help	7	709	September 29, 2022
Miri not erroring on aliased mutable pointer: sound or UB?	2	549	April 9, 2021
Unitialized memory safety question	21	598	April 9, 2024

What doesn't Miri catch?

Safety

Related topics