Polkadot postmortem (kind of caused by Rust nightly change)

In what way, exactly? It was closed because it's not a bug, and that fact was backed up by professional arguments. Then brson did go ahead and comment "Wow, the hostility in this thread is stunning.", which is not really a reasonable point of view. There was no hostility, not even any heat, in the thread.

Moderation note: Let's please try to stick specifically to the postmortem here.

4 Likes

I really wanted to not talk anymore about it, but one redditor brought a really great point that binary search that does not commit to first/last element is of very limited use. It pretty much only good for existence check.

So I guess Add `lower_bound`, `upper_bound` and `equal_range` for slices where T: Ord to complement `binary_search` ¡ Issue #2184 ¡ rust-lang/rfcs ¡ GitHub is much more important now, because there's no generally useful way to find lower/upper bound in a sorted array in stdlib now? I mean... according to documentation there never was.

That's a weird take. "What do you do with a result that points to 1 of multiple equal elements in a sequence?"

Pop it from the array. Why are distinct elements comparing equal if you need to distinguish them? Why use binary search with a predicate that doesn't distinguish the things you care about?

7 Likes

It's good not only for removing, but inserting. If you insert an element corresponding to the value you searched for at the returned index (either Ok or Err), you maintain sorted order. You don't need the first or last element for that. (It's also why Err caries an index payload.)

If you're trying to do something where you do need the first or last element, you can scan left or right afterwards. Or alternatively, and arguably a better approach, your comparison should take insertion order or whatever else is distinguishing your elements into account so that they don't compare Equal if you can't treat them... equally.

Or to take a step back, if you need the first or last equal element, you're not really searching for equality: you're trying to partition your splice. Prior to the change, partition_point was stabilized; it can find that position without you scanning. Before partition_point, there wasn't an ergonomic method as far as I'm aware, but you could do it like so:

// say `compare` is what you would pass if you were searching for equality
let partition = |x| match compare(x) {
    Less | Equal => Less,
    Greater => Greater,
};

let index = slice.binary_search_by(partition).unwrap_or_else(|i| i);
// Either nothing equal is in the slice, but could be inserted at index
// Or the last equal thing is at `index-1` and inserting at `index` will
// make the inserted element the last one among equals

// Alternatively
let partition = |x| match compare(x) {
    Less => Less,
    Equal | Greater => Greater,
};

let index = slice.binary_search_by(partition).unwrap_or_else(|i| i);
// Either nothing equal is in the slice, but could be inserted at index
// Or the first equal thing is at `index` and inserting at `index` will
// make the inserted element the first one among equals

And in the context of the postmortem... Polkadot was not relying on the specific behavior. Their bug arose due to a different index being returned in different compiler versions, not on the last matching index no longer being returned.

9 Likes

Having read the postmortem and the reddit etc I can't for the life of me see how this is Rust issue.

When using binary_search_by where the search criteria can match many elements it's like throwing a bunch of red balls into a bucket, then a bunch of blue balls, then shaking it all up. On asking someone to pull a red ball out of the bucket one cannot complain that they got the wrong red ball. Nor that they may select a different red ball every time this procedure is done.

If the actual red ball is important then they need to be distinguishable some how. It's not the job of the bucket to do that.

Perhaps someone could enlighten me as to how Polkadot, or any program, can be expected to work when it is searching for a specific thing but only searches with a criteria that cannot distinguish the thing exactly?

Seems like looking up a friends phone number by their fist name only and then wondering why the wrong person answers the call.

The design seems suspect to me.

5 Likes

It would be incredibly irresponsible to develop software in a way that lets such an incident occur. If your software can cause physical harm to human beings, you must establish a suitable development process that lets you assess and control this risk, this is how pretty much all safety-critical software is developed (where safety means safety of operators, passengers, patients, etc., not memory safety).

In the Polkadot incident, the stakes were much lower – it "just" locked up 20 billion dollars worth of highly speculative investments, but if they had any interest in this not happening they would have established similar processes.

And, to reiterate this again, this is not the fault of any individual engineer, it's an organizational failure (systemic even, because there exists no accountability in tech) and the responsibility should lie with whoever runs the company building Polkadot.

13 Likes

We stabilized partition_point in 1.52:

let lower = xs.partition_point(|it| it <  needle);
let upper = xs.partition_point(|it| it <= needle);
8 Likes

I'll note that this is far from the only API in Rust that isn't guaranteed to be reproducible across compiler versions or platforms or ... For example, relying on needs_drop to be consistent is also disallowed by its documentation. And there are other changes like f32::powi on Windows returns different results between 1.44 and 1.45 beta ¡ Issue #73420 ¡ rust-lang/rust ¡ GitHub that are considered fine.

I would expect anyone who wants exactly-consistent results between compiler versions and platforms and such to have pinning and replay tests for that, no matter which programming language they're using. For example, Contraption Maker made a testcompat script to make sure that things behave the same across builds and platforms -- and that's an indie game, hardly something with massive resources or anything safety-critical.

I reject this parallel. UB is a completely different animal because it ruins the ability to reason about your code and the behaviour you're seeing. If something is UB, you get no usable information from its behaviour. Importantly, a unit test tells you nothing; whether it passed or failed cannot say anything about whether UB was encountered.

We can certainly talk about what the best APIs and semantics and such should be, but nothing going on here is anything even close to the level of danger of pervasive UB.

18 Likes

There's a significant difference here, though, that affects the practical considerations of human error and this thread talking about UB in C++ goes into depth about it, and thus makes the difference between whether what documentation says is sufficient to handle the error or nor.

The core difference is that UB has "spooky action at a difference"; this massively affects the debugging process.

When a program misbehaves on a certain set of inputs, one way to debug it is to check intermediate states - you're looking for a part of the program that gives the wrong outputs for a set of inputs. dbg! and (C) printf debugging are this sort of debugging.

With UB, I have an "exciting" time where I find a part of the program that gives the wrong output for its inputs; I isolate it, and it starts giving the right outputs, but when I run it in the context of the whole program again, it (or somewhere else) goes wrong. This is a pain to track down - it can affect any part of the program, not just the bit that invokes UB and triggers the problem, and is thus really problematic to handle, not least because automation (bisection etc) can't help find the bug (as it may depend both on old code that invokes UB and new code that is miscompiled because the old code relies on UB).

In contrast, with this class of problem, when I identify the part of the program that gives the wrong output for its input (which is not a trivial task), I can isolate it and it will continue to give the wrong output for its input. I can then dig into that part until I find the bit that doesn't behave as I expected.

Once I've found the problematic piece of code in either case, it's then a case of working out why it's buggy, and this is where documentation comes in. But it's the last thing that matters in an error situation, not the first - very few people write code that deliberately contains a subtle bug.

4 Likes

Maybe it would be helpful to the discussion to clarify exactly how Polkadot was invoking this behaviour. To quote:

As we use [binary_search_by] in the runtime, it can lead to a slightly different ordering of the data that is stored in the state, which leads to a different storage root.

So it seems that the candidate selection process required: first finding candidates that had the correct value by some measure (knowing that multiple candidates might fit this criterion), and then choosing one of the eligible candidates by an arbitrary but consistent process.

Reading between the lines, it seems that the “arbitrary but consistent” requirement may have been thought about at the final step, where it's time to select an element from a list. But at that point the problem was already baked in, because the algorithm that built the list in the first place had somehow made use of binary_search_by and invoked its undefined behaviour.

If that’s true, it's not hard to see how it might have escaped them that they were leaning on something that wasn't guaranteed. At the same time, it does seem like evidence of an algorithm that wasn't thought through with the rigour you'd expect for a system where any inconsistency is phenomenally expensive. It was also inherently dangerous to rely on the nightly build for this exquisitely runtime-sensitive code, no matter the exigency.

I do agree that this reveals some tradeoffs any API author has to make: you have to choose between retaining the ability to fix bugs and staying ‘bug-for-bug compatible’; you have to choose between allowing undefined behaviour to happen and explicitly crashing, potentially paying a performance cost for doing so. (Not to mention angering developers who may have found undefined behaviour acceptable—in this case, imagine the experience of using a binary search that crashes whenever there's more than one match!)

In this case, the undefined behaviour burned Polkadot because Rust tweaked its algorithm. In principle, it could have burned Polkadot because of a firmware change on the machines running their nodes, or because Node changed something about its memory allocation algorithm (if they had decided to rely also on the nightly build of Node, which would have been similarly unwise). We reasonably expect API implementors to try to avoid contingencies like that, even with expressly undefined behaviour, but at the end of the day that's exactly what the phrase ‘undefined behaviour’ communicates: ‘rely on this at your peril, because what prints foo today may not print foo tomorrow’.

I'm not sure how helpful this is, but I'll add what I think might help in the mission of protecting developers from themselves: when the behaviour in a case is not specified, write an alternate version of that section of code which behaves with the maximum level of caprice possible. (In this case, that would mean an implementation of binary_search_by that detects when there are multiple matchable candidates, and decides between them randomly.) Provide tooling to invoke this alternate version of the algorithm when running tests. That way at least if the developer was responsible enough to write tests that capture this case, they’d be alerted to the problem by the intermittent failures.

Obviously, such alternate code paths would add a lot to the cost of maintaining the code. But it's the only way I can think of to reduce the risk of undefined behaviour ‘gotchas’ worming their way into projects to emerge at the worst possible moment. I don't think there's any other reasonable remedy—apart from versioned binaries, which Rust is obviously already doing.

That's nondeterministic, not UB. The fact that's it's nondeterministic is why it can change freely.

8 Likes

By ‘today’, I mean ‘prior to a change of version/environment/architecture’ and by ‘tomorrow’ I mean ‘after a change of version/environment/architecture’. Maybe that was too poetic.

This raises a question though: I would have thought that non-determinism is one of the possibilities allowed for by the phrase “undefined behaviour”. My understanding is: determinism is among the many things that UB explicitly does not guarantee. Unless I explicitly say “the behaviour in this case is undefined, but deterministic”, the API consumer should not assume determinism in that case.

It might be, it probably is, but to state that it is deterministic is to define the behaviour—at least in that narrow regard. Is that wrong?

I guess it's by the by because Polkadot didn't assume deterministic behaviour here, just behaviour that's consistent across versions (which we know is incorrect to assume), but I am curious.

I think the term you want is unspecified behavior, which is well-behaved but subject to change. Whereas a program that exhibits undefined behavior is totally invalid.

12 Likes

Oh, I see. Thanks.

Tldr at the bottom.

This has been very interesting, and there are strong points on both sides. Having just read all of it, I would put myself somewhere in the middle.

When I saw the binary search documentation, I interpreted it to mean that they implemented it in such a way as to have better performance at the expense of not being able to provide a describable guarantee of which index would be returned when there are duplicates. The binary search algorithm has no need for randomness, so it would seem that it's a reasonable assumption that despite not having that guarantee, it is still a deterministic function. Based on that interpretation, I think it is completely reasonable for polkadot's engineers to think that they could use it the way they did.

However, this does not mean that the rust team is to blame. The documentation was unambiguous, and they didn't do anything that they said they wouldn't. The docs are not intentionally vague either, it just requires a little extra critical thinking to realize that assuming determinism between versions is not guaranteed.

Weighing these against each other, I believe the best course of action would be to clarify the docs to explicitly warn that the outcome may change from version to version. It could say something like "If there are duplicates, it is not guaranteed which one will be returned. It is deterministic, however exact behavior is subject to change." I acknowledge that this is more wordy than what was there before, but it likely would have prevented this mistake and it gives a better picture to the reader as to how the function can be used than what is currently there. Those who are angry about the situation and think the existing docs are fine as they are are correct, but I hope they can see how this could be an improvement.

The only blame I would assign in this situation is regarding the tweet sent out by polkadot, saying that it was the fault of the rust team for introducing a bug. That was completely untrue, and the subsequent aggression in response to issue 85773 is unsurprising given that context. I would not be in a great mood either after someone sent out a tweet blaming me for their mistake.

The tldr/conclusion is really just three things. First, this situation could have been prevented by the docs being written or read more carefully. Second, the onus in this situation is definitively on polkadot, although the docs could be made more clear. Last, tweeting false accusations is uncool and will obviously not get the people you're accusing on your side. Thanks for coming to my Ted talk.

8 Likes

No, there really aren't. There's no need for hostility, but we don't have to play "both sides" here to avoid hurting feelings. Polkadot dropped the ball on multiple levels: not just the initial oops, but in not having a code review process, not having unit or integration tests that would have caught the regression, running nightly despite all that, and failing to understand why any of those things are wrong.

Rust developers are guilty of... what exactly? Not exhaustively documenting every guarantee that isn't made? That's not a reasonable expectation. You can't anticipate all the possible ways in which somebody down the road may imagine a promise you didn't make. Sure, maybe in this case it wouldn't be a bad idea to add a clause to the documentation. But this was (probably) a one-off; if there's a next time, it'll come from an unexpected direction, just like it did this time (no Rust developer had reason to believe anyone was relying on the exact behavior of binary_search_by).

As @jschievink mentioned earlier, this incident does speak to an organizational problem in Polkadot, and it goes beyond that: accountability in software development is basically non-existent. You can see this every time a big data leak happens: "Some nasty hackers stole all our customer data from the last 4 years! Sorry! :woozy_face:" In a particularly serious case, like one I was involved in several years ago, the victims might get a period of complimentary identity theft protection by way of apology. But investigation usually stops at finding out who the hackers were - there's rarely any accountability for the people who design and deploy these systems.

I always wonder when major failures like this happen: Have the people involved taken a course in engineering ethics? Do they know about the risks of letting publicity concerns override known engineering issues? Do they care about performing even basic checks of the design before handing it off to implementation? Will they face criminal investigations or lawsuits because of their lapse of judgment? Probably not! Fortunately (as far as we know) nobody has died because of this incident. But "well, nobody actually died" is, I think, a pretty low bar to cross. I'd like it if we didn't have to actually kill innocent people before starting to take responsibility for bad engineering.

Increasing the degree of accountability in tech starts with correctly ascribing responsibility. Saying "well, there were failures on both sides" (when one side is clearly at fault) just reinforces the culture of non-accountability.

The thing is, I think a lot of technologists are actually OK with non-accountability. There is a lot of freedom in tech right now being as unregulated as it is, nothing like building hotels or space shuttles. And many people are rightfully wary of software development becoming a profession with testing, licensure, gatekeepers and so on. But I don't think this Wild West situation is sustainable in the long term. If we don't hold each other accountable, eventually, other people will.

17 Likes

Folks, as I said above, please stick to the postmortem. If this conversation continues to veer off course, then I'm going to lock it.

As always, if you have concerns about moderation, please contact the mods privately: rust-mods@rust-lang.org.

1 Like

Here's a quick overview

Implementation-defined behaviour: Behaviour that is allowed to vary from implementation to implementation as long as it's documented. (eg. GCC, Clang, and MSVC may return different results.)

Unspecified behaviour: Behaviour that a program is allowed to invoke, but which has no stability guarantees even within a single implementation.

Undefined behaviour: Behaviour that the compiler has promised its optimizers will never happen and, as such, they're allowed to rework the code based on that assumption.

Here is a list I made of a bunch of resources on undefined behaviour. Pay special attention to Why undefined behavior may call a never-called function.

(TL;DR: Because the spec says that it's undefined behaviour to call Do() before assigning something to it, and Do is static, and NeverCalled is the only thing that modifies it, the compiler builds an invocation of system("rm -rf /"); into the compiled output because the only explanation that's consistent with the specified rules is that some other compilation unit the compiler can't see will take care of calling NeverCalled() before calling main().)

That is why undefined behaviour is so insidious... because the compiler may rewrite anything else within the unit of optimization to prevent it from being invoked, and that the effects potentially get even more confusing when later optimization passes build on the original break between what you expected and what the spec allows to exist.

I haven't confirmed what Clang optimizer pass actually causes the behaviour in the NeverCalled example, but, more abstractly, NeverCalled gets called because of function inlining. It's the only possible way to achieve valid code, so inline it to avoid some unnecessary function call overhead.

1 Like

Too much thinking going on here for my mind.

The problem was some software looking up some data in some collection without a search criteria complete enough to narrow down the search to a particular, unique, element.

Why would anyone rely on such a search to return the same result every time, everywhere, forever?

Anyway, I feel it's not productive to look around and cast blame for this failure on the Rust team or the Polkadot team or whoever.

At the end of the day some software, taken as whole, created by many people, failed. Including Rust the Rust standard library the application and whatever.

The question then is, what can be done to help prevent such failure for others in the future?

For example LOUDER WARNINGS in the documentation might help. Or better practices of programmers.