Should/can one trust invariants in serialized data?

This is a very general question for which I've heard very diverging answers, so I'd like to hear other people's opinion and try to debias myself.

Context: you have a serialization system and you have serialized data which satisfies invariants; for example, strings are serialized as sequences of bytes, and they are UTF-8 because that's the invariant that was true when you serialized the string.

Now you access this data is a direct way: it can be zero-copy deserialization, memory-mapping, some kind of superfast deserialization—the point is that you transfer directly those bytes into memory to create a string.

In this scenario, is it acceptable to use something like str::from_utf8_unchecked in a method that is not unsafe itself? The logic here is that the invariant was valid upon serialization, and thus we can rely on the same invariant to be true upon deserialization.

The main objection to this idea is that the file might have been tampered with or might come from malicious sources. My personal viewpoint is that security is a system property that emerges through a number of processes. If you have bad sources, change your sources. If your files are tempered with, check your system. But treating defensively any kind of external data in the language is approaching the problem in the wrong place. Rust is a memory-safe language, not a secure language (as recently TARmageddon showed). And obsessively checking for errors in data has a significant performance cost.

I also think in general that the purpose of Rust is not to play defensively against malicious actors—the purpose is to help you write efficient, correct software. I see it more as a safety net than an armor. And I cannot keep myself from thinking that somehow this is the mental model of the compiler team, or I cannot explain why the bug at the basis of cve-rs has been open for 10 years. It's not that they don't care, but that is doesn't make sense to invest resources in fixing an obscure problem that will never happen in real code. But, then again, here I'm trying to read people's mind, which is rarely successful.

Any kind of comment really appreciated.

You must definitely check the data if there is any usecase where the data may come from an untrusted source.

Imagine one builds an Image viewer that doesn't check for invalid inputs. It is an normal usecase to download and view images from untrusted sources. And checking for validity (in another program) in image files before opening them is ridiculous.

The one usecase where you could argue that you do not have to check is if it would be impossible to modify the file if one could not also modify your program directly. And this is extremely narrow

1 Like

But it is physically impossible to write Rust code that uses only data from trusted sources. That's not part of the language. "If there is any use case" means "always". It doesn't help much with the problem.

Unless you accept as guarantee the documentation. That is, it is properly documented that the specific structure should not be used with data from untrusted sources.

If such a statement would appear in the documentation (and, of course, it cannot be normative), would you consider ok to call the unchecked method in a safe method?

No, you cannot trust the serialized data itself, since in general it cannot encode the respective invariants. Generally you'd need an additional fallible parsing step after the raw data's deserialization in order to regain the invariants.

See also: Robustness principle - Wikipedia

2 Likes

First of all you don't need file tampering or untrusted sources to get back a broken file. It happened to me once: the application was supposed to serialize floating point numbers plus some metadata and one file (out of hundredths) had 1 bit flipped resulting in invalid metadata and a segfault in the C code reading back the data. It was not that important because we were just converting output from a mass spectometer proprietary data format to a human-readable one and even a segfault wouldn't do a lot of damage (it was just annoying).

So, my rule of thumb is: if nothing in the program is about security (even remotely, like, for example, using the data to generate file names) you can avoid to validate the input (and accept the fact that your invariants could not hold); else always validate it.

4 Likes

Good point, but, I should add this: the main scenario is one in which the user itself generated the serialized data. So it's not a "from others" situation. I have the original non-binary data. I pack it into a nice binary form. I wanna use it efficiently, as I built the structure myself. Validation is not a viable thing there because I have in mind very large data structures (hundreds of GB) that are memory-mapped, and the advantage is that you really read from disk only the part you're using.

Of course there's no way to enforce at the language level that the serialized data has been generated by yourself.

You verify it. Period. Or at least provide a way to turn the verification on.

The problem with not verifying is that error may propagate into your program in a way that is very hard to predict and when your program, finally, crashes — it's very hard to track the damage to one flipped bit somewhere in multi-gigabyte file.

And if files are hundreds of gigabytes in size and kept for a long time (no one produces files that are hundred gigabytes in size to throw them away) then chances of encountering corruption somewhere approaches one.

If verification is too costly to always do — then you need mode that either does verification, or you need to add checksums and error recovery to your format.

It's not about “religion of Rust” or “how to you use unsafe” but simple practical applications: commodity hardware does bitflips relatively often: few bitflips per day are typical. Most are harmless, but if you process hundreds of gigabytes of data, then sooner or later your would need to deal with damage. Ergo: either checksums or validation.

4 Likes

Yes, that's what MD5 sums are for. You do this externally—you don't unload it on Rust.

Then you need to document it and that should probably be enough.

P.S. Also: please use SHA1, not MD5, if you have a choice. All modern CPUs support SHA1 hardware acceleration and CRC32 hardware acceleration, Nobody supports MD5 hardware acceleration, thus it's better to use either CRC32 (very fast but weak) or SHA1 (slower, but more robust), not MD5 (slowest and less robust, so pure disadvantages).

6 Likes

If I can run your program in such a way it executes UB, you are the one who has written a CVE worthy bug.

It's trivial to feed an arbitrary file to a program.

Thus your program must not be able to execute UB regardless of the files fed to it (unless you're cool with introducing CVE worthy bugs and being the topic of TARmageddon-like articles, I suppose).

5 Likes

But at least must be possible for user-generated data. Otherwise memory-mapping of immutable large data structures becomes impossible.

User-generated on-disk state is as safe as in-memory state IMHO.

Incidentally, TARmageddon is due to a logical bug—it has nothing to do with UB.

The core problem to me is that the reason UB is so scary is because it means you can't trust what you see.

If your argument here was "the bad data will just hit a reliable arch::abort()", then I'd be 100% in agreement with you, because if you're messing with serialized data and you get a guaranteed crash then I agree that's totally fine -- the user just shouldn't do that.

But if we could guarantee that all UB would do was cause a crash, it wouldn't be UB -- it'd be defined behaviour that it crashes.

UB is scary because you might get telemetry for things that never actually happened, because once UB "happens" then the impossible has already happened.

(Remember that from_utf8_unchecked really can lead to UB, because things that later consume the str can and do assume that the invalid bytes are impossible, so can do things like out-of-bounds reads into jump tables that do very weird things by jumping into the middle of unknown code depending on whatever memory happened to be there, then doing very strange things from a place that might not even have been intended to be a jump target.)

If you want to make a system-level argument, then I do think that's possible. Like if you're storing something serialized in a "not accessible to the user" folder in Android, say, then I'd accept the "well if they went in as root and mucked with it then they deserve it", same as "attached a debugger and mucked with it" and the like. But if it's just a normal file saved on, say, a normal Windows user's "My Documents" folder, then no, I don't think you have a system-level soundness argument.

Similarly if you ship a serialized file in your app that you install and permission similar to the executable, then I think you could make a system-level argument that they shouldn't be messing with that file same as they shouldn't be messing with the binary itself. But that's a very small subset of "anything serialized".


If you're just thinking about str-like things, though, consider using bstr - Rust instead. You can just have "conventionally UTF-8, but still defined behaviour even if it's not" text that way, and then (so long as you can guarantee it's initialized) you don't have the UB possibility even if it came from an untrusted place.

6 Likes

No, the strings are an easy example but we map into memory all kinds of data structures. Having programmed in C for 40 years I'm painfully aware of the dark places UB can lead you to.

In my situation there's no binary shipment of anything—the users have sources, build their binaries, and using those binaries they build binary data that will be mapped into memory. There's no binary data, executable or not, that is used directly (e.g., not as a file with specific format) or delivered through the network. I see this process as a quite controlled extension of the program state.

No, you shouldn't trust external inputs with language-level invariants.

You can mess up application-level logic with bad data if you want, but Rust's invariants must never be broken. Even if it's 100% user's fault, it still must not be allowed to ever happen. Even if the input file promises something, you can't let it break that promise, so you have to verify anyway (e.g. you still have to verify UTF-8 encoding of every string in a file format that only allows UTF-8 strings).

The _unchecked functions are for removing redundant checks if you already have a guarantee the data is valid (e.g. if you've decoded a string from a packed 6-bit per char encoding, this guarantees it will be valid UTF-8 subset).

7 Likes

Personally, I'd go with this:

If the invariant being violated leads to an application-level fault - a Result::Err, a panic, or even to the process aborting or entering an infinite loop - then the question of how to handle it is a matter of design preferences and your users' tolerance.

If, on the other hand, the invariant being violated would lead to undefined behaviour, then you must enforce the invariant at all times. Data known to have been good when it was written may not be good when it's read back - your program can't reasonably guarantee that the file as read back is the same file it wrote, hasn't been edited (maliciously or merely creatively), hasn't been corrupted by storage issues, hasn't been inadvertently replaced with some other file, and so on, no matter how careful you are.

Rust's convention of only allowing UB inside of unsafe items helps immensely with making the latter tractable. If your program never calls any unsafe items and if you hew to this convention yourself, then you're left with the first case, and you have a design question more than a correctness one. If your code does call unsafe items, then you should be checking (at least with assert! or debug_assert!, if not with something more nuanced) that the unsafe item's invaraints hold, or writing a comment explaining why they must hold no matter what happens in the preceding code.

To your example, check that the data you read actually is utf-8, rather than assuming. It's a cheap check, and it prevents UB from calling from_ut8_unchecked with non-utf8 data. This isn't specifically about security - it's about being able to have confidence your code does anything like what you intended, which I see as more fundamental than that.

5 Likes

So everybody seems to be saying that you can't trust what is in the file. But doesn't that mean that the memmap2 crate is unusable since the file might get modified outside your control, which would immediately cause UB?

The docs say:

All file-backed memory map constructors are marked unsafe because of the potential for Undefined Behavior (UB) using the map if the underlying file is subsequently modified, in or out of process.

1 Like

That's current. It's largely unstudied what the complete proof set necessary for correctly using memory mapped files is, but it likely starts with using a service dedicated OS user identity.

We use mmap-rs but it's the same—that's Rust's elephant in the room.

You can really do nothing big server-side without memory mapping but memory mapping breaks inherently Rust's mental model as part of your process state is now modifiable externally to the process itself.

It's of course a general problem—that happens also in C++ or Java—but it's something that people just do.

An interesting piece of info: stdx contains a memory-mapping crate. If you don't know the guy look him up as there probably wouldn't be Rust as we know it without him. And yes, he realizes you need memory mapping even if doesn't fit Rust memory-safety model.

1 Like

Note that we are talking about file that includes built-in checksums that are verified, at some point. I would put it in a similar bucket to “AV software injected in your process that corrupts your heap”: does such things exist? Yes. Do we want to care about these by adding tests that “shouldn't be violated by compiler-provided invariants”? No.

Memory corruption exist. Files corruption exists, too. But if we are dealing with file that's protected by construction and checksums then they belong to the “malicious intent” similar to work with /proc/$PID/mem: something possible, but not worth protecting against.

2 Likes