Polkadot postmortem (kind of caused by Rust nightly change)

I have no association with Polkadot, but I think it's relevant so Rust community might find it interesting.

TL;DR:

Rust standard library function implementation detail was changed (OK w.r.t. the documentation which stated that the exact result is not defined), which caused a disagreement in consensus rules between versions of Polkadot software (built using different compiler version).

While it seems to me that technically it's a fault of Polkadot developers, in practice its not great that we have an API that has a pitfall like this. :man_shrugging:

It's not really a pitfall, though. The problematic case is explicitly documented, and the commit message for the change in question even elaborated on whether it is acceptable.

When programming, there are all sorts of legitimate edge cases where a programmer needs to consider e.g. existence and unicity, or ordering, for that matter. This is a very common scenario, and to be honest, I cannot really blame the Rust standard library in this case. When possibly equal values are used in a way that their identity or order matters, e.g. hashing, one's first thought should usually be "how do I guarantee consistent results even in the presence of ambiguity?". The developers of Polkadot missed this detail this time.

8 Likes

Let me just start by saying that I understand that point of view, and I find it completely reasonable argument. But I don't think it's the most important PoV here.

The way I see it - Rust community generally tries really hard to provide the best possible APIs. And all things equal, having an API that is more deterministic and predictable is just better than having an API that is less deterministic. Rust did have an implementation that was effectively both fast and strictly defined and deterministic.

Saying that "something was documented, and it's user fault for not being aware of it" is technically absolutely correct but is ... for a lack of a better way to describe it ... just very C++-ish thing to say. :smiley:

The reality is that developers make mistakes like these all the time, and part that people love about Rust is that it helps them as much as technically possible to do the right thing.

We can't just look at being technically correct here. We used to have an implementation that behaved in certain predictable way. Had it been completely not deterministic (like HashMaps are for important security reasons), developers would pretty quickly discover it in the early phase and fix their code. Changing it between rust versions is ... just not great, especially for marginal perf gain.

And even if it wasn't for changing the current behavior to a new one: the predictable behavior, is just better from a perspective of reliability, determinism, defensive programming. Even if we weren't changing the behavior, but writing this API from scratch I would vote for making an API that is more predictable.

Another way I think about it is with compassion: Could I have made a mistake like that? Yes I could. Could Rust have saved my distracted and ignorant butt? Yes it could. So why wouldn't it?

8 Likes

Predictable seems subjective,

And the function is deterministic afaict

I'm curious what exactly the "fix" is? Commit so hard to 'stability' that even undocumented behavior and implementation details must be preserved? That even behaviour that is documented as implementation defined has to be preserved?

I guess "implementation defined" is the correct name here which I should have been using.

Keep the current implementation, change the documentation to specify the current behavior as the guaranteed one, add a new function binary_search_by_<something> that is explicitly opting into faster, but potentially implementation defined behavior. I'm not exactly sure what <something> would be, but in the same module sort has a sort_unstable which is very similar solution - most non-surpring default shortest name, and longest name with a suffix for auto-complete-driven education purposes, allowing explicit (hopefully) more educated opt-in. Maybe even the same _unstable word would work here.

https://www.hyrumslaw.com/ :man_shrugging:

1 Like

sort_unstable refers to a type of sorting, not to some "unstability" in the API.

2 Likes

I don't think bikesheading the exact name is very important yet, but sorting "stability" property is defining its handling of duplicates, and here we are talking about how binary search is handling duplicates. I did a quick google search and indeed it looks like I might not be stretching it and "stable" vs "unstable" binary search is a thing (I'm kind of tired and about to go to sleep, so I might be just confused)

Note that even all the way back in 1.0.0, this freedom was called out explicitly in the documentation and had an example: https://doc.rust-lang.org/1.0.0/std/primitive.slice.html#method.binary_search_by

I'd be much more concerned if this was a case where the clarification was added recently, or something. But I just don't feel that bad about this one.

I don't think that's a good choice here, because the actual behaviour isn't something reasonable to specify as anything other than the code -- it depends on the splitting details. (In both the current one and the previous one. Neither has a nice way to specify what it does.)

I'd be all in favour of getting things like C++'s equal_range/lower_bound/upper_bound, though, which are predictable. (Add `lower_bound`, `upper_bound` and `equal_range` for slices where T: Ord to complement `binary_search` · Issue #2184 · rust-lang/rfcs · GitHub .) One could then talk about possibly deprecating binary_search if it turned out that most code moved over because it thought the binary_search wasn't the desired API.

Well, assuming a reasonable ordering and equivalence definition. This is the kind of thing that's always possible to get wrong already -- nothing stops someone from making an Ord that always returns Less, or which doesn't match Eq (a common problem if deriving one but not the other), or any number of things that are also in the "you need to read the documentation and get it right" category.

6 Likes

Here's the PR so others don't have to search for it. (And if you don't want to click either: it was discussed and cratered; partition_point was also stabilized.)

"[T]he implementation has become the interface, and any changes to it will violate consumer expectations" -- yep, that happened to Polkadot here (indirectly too; if I read that report right, they relied on the method being deterministic across rustc versions (stable and nightly), not on returning the last element specifically). It doesn't mean you should never make changes ever again, just that there's an impact from doing so. I'm not unsympathetic to people being burned, but it seems like this change was made in a reasonable fashion to me, with attention being paid to that impact.

Going back now would also be a change across (stable) versions, and wouldn't help Polkadot anymore.

6 Likes

I completely agree that the change was handled in a reasonable fashion. It just occurred to me while thinking about this incident, that it would have been better if the API was strictly specified in the first place.

It reminded me about the HN article I've read recently: Have you ever hurt yourself from your own code? | Hacker News

And when I think about it: assuming Rust becoming a dominant systems programming language for the next 50 years, there's a non-negligible probability of some people somewhere hurting themselves or others by making mistake in handling dups in binary search. If we have binary_search_by (stable - returning well specified index) and binary_search_by_unstable (fast, unstable) - some people would educate themselves/avoided that mistake because of the IDE auto-completion suggesting them two versions, and some that would made a mistake, would avoided the consequences of their mistake.

Oh, I wasn't aware of these, and I like such API much better. Thanks!

3 Likes

I think this is a good point, too. Did polkadot have this behaviour in a crate? Did that crate have tests that could have caught this?

I'm guessing no, since it didn't sound like they had test failures when they upgraded compiler versions...

If they did have tests for it, but the code isn't on crates.io, that reminds me of one of those things that might be able to come back again now that the foundation is a thing: There's a (not that expensive) donation tier for getting notified about crater runs in progress, so you can run your private code against the thing and report breakage, without needing to expose the codebase.


workingjubilee made a nice post about this on GitHub with some good research that I'd like to amplify: https://github.com/rust-lang/rust/issues/85773#issuecomment-850235297

It being valid for a binary search to return any element is a sufficiently well-known property of the underlying algorithm that even Wikipedia mentions it.

And from there we can find that C's bsearch does not specify which element should be returned.
java.util.Array.binarySearch also does not specify.
C#'s List.BinarySearch also does not specify.

8 Likes

Ouch. This thread came really poorly. Very unlike Rust community, IMO. Especially when the issue was raised by one of the long time and well know Rust contributor and community member like @brson (Hi!!!)

Seriously, I really hope that we can put aside the fact that it was some crypto project (animosity with "crypto" space is well known) and their own rationale is not stellar, and acknowledge the fact that in multiple people independently report that "it's documented, so it's user's fault" is not a Rust-like engineering spirit.

2 Likes

The topic was closed on GitHub for being too inflammatory, so you bring it up on another official channel to have the flame war that was prevented on GitHub?

I don’t see what question remains. The matter has been described and explained. Increasingly hostile statements of opinion are unproductive. That is why the topic was closed on GitHub and should be closed here.

4 Likes

I think it's a bit more nuanced than that. I certainly agree we don't want to cast aspersions on users.

But I'd also say that "don't rely on things that aren't documented" is a Rust-like engineering spirit, and it's basically the contrapositive (without the "blame" part). That spirit is why there are comments on this forum like

it just feels a bit off, since [...] it relies on being called in a specific order.

Which lead to PRs like the following (that even go through FCP) to document behaviour so it can be relied upon:

Ouch again? :slight_smile:

I have opened the discussion independently here accidentally at the same time. I just discovered there was a github issue about it. I don't think it was anyone's intention to start any flame war, and I don't think I'm provoking anyone, neither do I look to blame anyone for a reasonable decision.

Maybe the unfair explanations of Polkadot developers are triggering a defensive response, but neither me nor (afaik) brson are affiliated with Polkadot, and I'm coming to discuss if it wouldn't be better to leave the current implementation as is and/or come up with APIs that limit a chance of mistakes like this.

Yes, exactly. Obviously there's a nuance here and tradeoffs to be made.

I'd like to note that (if I read this right) the permissable behavior of the implementation here was narrowed for the sake of the user. Which is aligned with what I would consider the best API for binary_search_by. What the documentation says is important, but not as much as the practical considerations of human error. All the UB in C++ is also well documented.

I don't see anything wrong with that GitHub thread (edit: I've been informed there was a worse post in there previously)

Several people explained that the behavior Polkadot wanted was never guaranteed, and that many other languages leave this detail unspecified as well. Deciding that the problem is not considered a Rust bug, and that Rust won't change in response is not the same as animosity/hostility.

In this case it's not even some arbitrary stubbornness or language-lawyering, but genuine difficulty with guaranteeing specific behavior for this algorithm without sacrificing performance.

11 Likes

I think it's unfortunate that the discussion was triggered by this Polkadot incident, because Polkadot devs are trying to save face by shifting focus to change in Rust compiler, and in response a lot of people in Rust community focus on who's technically at fault, and getting defensive. Instead of having a humane and compassionate discussion about the best API, it's just seems to come down to who's technically correct. That's my reading of what happened in that thread, and from what I can tell also brson's one.

Had it been some random developer working on a robotic arm software in spare time, getting hit in the head after upgrading Rust compiler, it might have been a different discussion. (which of course I'm happy that's not the case, but that's kind of what I'm trying to potentially prevent)

I would completely agree with such tradeoff if the difference was huge, but I don't think that's the case, because according to Auto merge of #74024 - Folyd:master, r=m-ou-se · rust-lang/rust@caca212 · GitHub the new implementation is just slightly faster in case with many dups, and marginally slower on big dataset when there are none. I agree it's better, but not a day and night better, so I'm not sure if it's worth losing the predictability. And it wouldn't be a first case where Rust like to provide the version that is slightly slower but less tricky as a default, and opt-in variant for performance. Exactly like in sort vs sort_unstable.

Anyway, my intention is not to stir a flamewar, I think I have nothing more to add, and I think the opposite view is reasonable, even if I don't find it the best choice.

3 Likes

The result of such discussion will be recorded in the API documentation. Technical specifications are where you record the lessons that are learned. If you don't read and follow the documentation... you end up with broken software. Changing the API doesn't fix that.

Technical correctness is the means by which understands the API. It's not a diversion from the problem, and it's not an alternative to being humane and compassionate. It's how you adhere to what is already known.

2 Likes

I agree that there's a potentially interesting/beneficial conversation to be had about the topic, and also that it's difficult to have under the context of a specific incident. There's been an unfortunate amount of these sort of touchy/angry/blaming incidents lately it seems (in the programming communities generally). I don't want to get into it in this context either -- I had another response half-written but deleted it; seemed to likely to give the wrong impression or trigger an angry reaction from someone else.

Maybe bring it back up as it's own idea, outside of the context of a specific incident, after some amount of time has passed. It's clear to me you care about the topic on its own merits.

3 Likes