Uninitialized memory and ffi (again)

I know that if you have a region or buffer of uninitialized memory, say, from malloc(), it's UB to read from that memory - you have to initialize it first by writing to it.

However, you can pass uninitialized memory to the read method of the Read trait, as long as the implementation guarantees it ever only writes to the memory and returns the correct amount of bytes written. That memory is then considered initialized.

My question is, how is it that it is then initialized? You pass a pointer and a length to libc::read(), how does the compiler "know" the kernel has written data into it?

Same thing for libc::mmap(), you call a libc function, get back a pointer and a length, you call slice::from_raw_parts and then that memory is considered initialized? How?

If it is because the pointer is passed to or gotten from an FFI function, how does the compiler know the length of the memory region?

And it is is FFI, how come memory acquired from libc::malloc() is considered uninitialized?

I hope someone can explain :slight_smile:

The compiler knows because we told it by using unsafe code that would be unsound if it is was uninitialized.

2 Likes

The compiler doesn't know that the memory is initialized, it assumes the memory is initialized. As with anything related to UB, the compiler doesn't need to know that a program doesn't invoke UB, it simply assumes that there is no UB and optimizes accordingly.

This is a library contract that the compiler knows nothing about. After a call to Read::read or libc::read, you will likely use the bytes in some way, that usage will almost certainly require that the bytes be initialized to avoid UB. So at that point the compiler is free to assume that the bytes are initialized.

You, the programmer, know that the data is initialized, and by acting on that knowledge you invoke operations that require the data to be initialized. This informs the compiler that the data is initialized.

You called slice::from_raw_parts which requires the data is initialized, by calling it you are telling Rust that the data is initialized. You told it the length when you provided it to slice::from_raw_parts. The compiler assumes you are right because you are calling an unsafe function.

This one is trickier. It comes down to the fact that malloc is usually intrinsically tied to the compiler providing it. So here the compiler actually does know that malloc returns uninitialized data.

6 Likes

Thanks for the reply!

So, in theory, I could get some uninitialized memory using Vec::with_capacity(4096), then use unsafe { set_len(4096) }. Then get pointer and length of that. Next with slice::from_raw_parts get a memory range that I still not have initialized but which the compiler assumes is initalized (being it with random data).
And I would have a slice which has random data, but is considered initialized by the compiler, so it's completely safe to read/write from? Good, so.
Why don't we use that for getting a &mut [u8] to Read::read into?

I know I must be missing something obvious, or it would already have been done this way. Please enlighten me. :slight_smile:

No. Uninitialize memory doesn't contains random bytes. It contains.. uninitialized data. Optimizing compilers like LLVM have its internal representation of bytes like an enum with more than 256 variants. One is for uninitialized value, one for poisoned value, one for initialized but unknown etc. And reading from the uninitialized bytes is UB, mean's you're violating the language specification like the syntax error. You shouldn't expect meaningful result from the code with syntax error. Same, you shouldn't expect meaningful result from the code which touches UB. Only difference is that the UB is assumed to not be present instead of eagerly checked, the compiler produces some garbage binary instead of the compile error.

8 Likes

Uninitialized != random data

If you want to learn more about uninitialized memory, read this amazing article by @RalfJung

https://www.ralfj.de/blog/2019/07/14/uninit.html

3 Likes

Thanks for your reply! Still, why, or how, does libc::read guarantee in any way to the compiler that the memory referred to by the pointer that's passed to it, let alone the size of the allocation, is considered 'initialized' by the compiler after a call to libc::read?

Welcome to the C world where safety is the matter of documentation, convention, and not making mistake. Nothing automatically guaranteed on the signature of the libc::read. You read the documentation that the libc::read can takes allocated but possibly-uninitialized section of memory and if the return value n is positive the first n bytes of the passed buffer is initialized. And you use the filled portion of the buffer's content, assuming it's initialized. If the libc implementation doesn't satisfy its documentation, all got messed up and you'll get garbage binary. If you misread the documentation and use the un-filled portion of the buffer, all got messed up and you'll get garbage binary. It's you who guarantee things manually.

And that's why we should use safe Rust as much as possible. You're safe in safe Rust, otherwise you should pay for your mistake.

4 Likes

To pass the allocated memory to read() you need to turn it to a &mut [u8]. The moment you use unsafe code to turn it into a mutable reference you are promising the compiler that the memory actually was initialized... That means you can't just pass an uninitialized buffer to read() without (for example) zeroing it out.

It is considered initialized because you told it the data was initialized by using an unsafe block and calling the function.

The std::slice::from_raw_parts() function states that it must only ever be called with initialized data, and that invariant must be upheld by the caller.

No. Vec::with_capacity() followed by Vec::set_len() without actually initializing the buffer is unsound. The # Safety section under Vec::set_len() explicitly states:

Safety

  • new_len must be less than or equal to capacity() .
  • The elements at old_len..new_len must be initialized.

(emphasis mine)

Not initializing the data breaks Vec's invariants and will trigger UB the next time that uninitialized part of the buffer is accessed.

2 Likes

You can read more on the problems with uninitialized data and the Read trait in RFC 2930. The RFC proposes a solution and the RFC was accepted (see the PR and issue links at the top), so at some point there will be a safe way to use Read with buffers that aren't initialized upfront.

1 Like

No, it's not completely safe. When I use the word "assume" in the context of unsafe code, it does not mean "change the world such that it becomes initialized" instead I mean "we assume it's true, and if we are wrong, the world crashes and burns".

As an analogy, imagine if you're driving a car, and there's a tree in front of the car. If you assume that there is no tree in front of the car, then the price for your invalid assumption is that you crash into a tree. It's not like the assumption makes the tree disappear.

In the case of unsafe code, the price for invalid assumptions is undefined behavior, which means the compiler is allowed to compile your program into literally anything.

11 Likes

Ah yes. I think I should have been more specific. I understand that many bitpatterns simply are not valid for many types. Bool, pointers, enums etc. In that case, if you simply say "that memory is now initialized" the world might crash and burn as soon as you look at it.

But in the case of &mut[u8], a simple range of bytes, which you often use with I/O, all bit patterns are valid.

I'm more interested in how the optimizer sees uninitialized memory, and Undefined Behaviour. I read the excellent blog post by ralfj: "What The Hardware Does" is not What Your Program Does: Uninitialized Memory .

Based on that post, I tried a couple of things in Godbolt . It does appear that using FFI makes the optimizer think "hey, I cannot treat this memory as uninitialized anymore, it contains something valid". And, so does black_box() (although it has a big warning in the docs). Even calling an unsafe function that writes to the first byte is enough.

So I'm curious as what the rules are. Because it appears we can get a chunk of uninitialized bytes (say with let mut v = Vec::with_capacity(4096); unsafe { v.set_len(4096); } and then use one of the methods above to get initialized memory, containing random bytes.

Now I just found this issue on ptr::freeze which is exactly what I am talking about, only the people posting there are way smarter than me :). It appears that yes, you can do this, and in fact they were about to - but there is another problem - uninitialized memory can actually change "under your feet" if you read from it before writing it, and that is an OS issue. It happens when the allocator uses MADV_FREE, which jemalloc does for example. I'm not sure if this can be solved. Anyway, I learned a lot by chasing this.

Just because all bit-patterns are valid does not mean that it being uninitialized would be valid.

If the code has UB, it may do anything. This includes behaving like you expect it to. :wink:

To see an example of this breaking something, there is a great example in the ReadBuf RFC.

3 Likes

I encountered ReadBuf already when I ported a project to Tokio 1.0. The RFC is indeed interesting, and I see that it also used Godbolt for the UB example :). Now,

Just because all bit-patterns are valid does not mean that it being uninitialized would be valid.

Well, if you can convince the LLVM optimizer that it does no longer contain undef or poison data, it is valid, I understand now. That is explained in the ptr::freeze issue. However the MADV_FREE thing throws a spanner in the works.

For now, in my project, I am experimenting with keeping a pool of Vec's around that have already been initialized (using Vec::resize(size, 0)). It's probably a premature optimization, but it's fun to work these things out :slight_smile:

UB is about the language specification, not about the compiler implementation detail. It's true that the LLVM IR doesn't contains any undef or poison data so it's ok, but the language doesn't guarantee it mean even slightly different code or the compiler version would break your expectations. If you want to relies on the subtle implementation detail of the rustc, you should check the generated LLVM IR whenever the code text or the compiler version is changed.

3 Likes

Or pin your code to a fixed, and thus increasingly outdated, version of rustc. Even if today's LLVM backend optimizer appears not to scramble your program due to UB, there is absolutely no guarantee that the next LLVM point release won't do so, let alone a new major release a year from now. Likewise for rustc, whose releases occur every six weeks.

1 Like

For that matter, I wouldn't be surprised if the exact same version of LLVM and Rustc started miscompiling UB-ridden programs. Some compiler optimizations can be formulated using randomized, heuristic algorithms, so why not?

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.