Why do we believe a piece of software is correct?

I was just viewing a video, where the speaker pointed out that Dijkstra's vision of proving programs correct has not been fully realised, and he then appears to conclude that testing is the only way we can tell our program is correct ( I might be mis-characterising what he said, but it seemed that way to me ). [ I am not including the video at this stage, as I'd prefer to talk about the topic, at least at first, rather than this particular person. Maybe I will give the video a bit later. ]

I see it differently. When we write some software, it SHOULD be correct even before any testing is done, because we have reasoned carefully ( albeit informally ) that it does what we want. Certainly we do ALSO test the software, but the reason for this is that we know we are fallible, and it's easy to make mistakes. But I would say the belief that a piece of software is correct should rest primarily on the reasoning process, which amounts to an informal proof of correctness. The testing should be secondary.

How do you see it?

5 Likes

The way I see it is that human beings make mistakes while reasoning, the same reason people make calculation mistakes. Tests are a safeguard against the current programmer's mistake and more importantly the future programmer who is making changes. Rust's soundness from its type-system (among other things) is a safeguard against programmer mistakes.
If programmers didn't make mistakes, then it'd have been a whole lot easier. But they do. That's just the way it is.

Not only hasn't it been "fully realized", but I would say that it hasn't been realized even partially outside of very rare niches. As an industry we are continuously moving farther and farther from this vision. Even Rust, with it's emphasis on correctness and safety (relative to other mainstream languages), does not have a full formal model and relies on LLVM with its huge heap of hand written (formally) unverified code transformation algorithms, which do result in miscompilations from time to time.

The problem is that this reasoning is fragile and humans do make mistakes and often miss important edge cases in their reasoning. The tests play role of imperfect guard rails, which help with catching a program misbehaving on cases which previously were working correctly (according to a human who wrote the test).

8 Likes

It's both. 90% of the time it's correct by design, and then testing exists to catch the remaining 10%.

2 Likes

First of all you have to define what is "correct".

If one is going to do a rigorous formal verification of a piece of software there has to be a formal definition of what it is the software should do. And what it should not do I guess. That formal definition of requirements might run into more lines of some requirements specification language than the actual code you have written to satisfy it.

We can now give that formal requirements specification and the programs source code to somebody and they can verify it. Or better still have a software that does all that mechanical proof work for us.

So far, so good. But wait. How do you know that the requirements specification you have written is correct? Bear in mind it may be a bigger document than the source code you are trying to verify.

So, to my mind it is hopeless. We can't even define what is correct yet, let alone check our program is correct.

Have you even ever had such water tight requirements specification? Some vague notion of "what we want" does not cut it.

Personally I can't be doing with all that thinking. I just write code that looks like it might be going in the right direction and keep throwing it at the compiler until it compiles. Then rearrange it as required until whoever wants the program is happy with what it does. Repeat as necessary. I think the young'n's call it "agile" now a days, as if they invented something :slight_smile:

My code is a lot more robust and predictable since I discovered Rust but I'm not sure the correctness has.

7 Likes

We don't? I don't think anybody would claim that any piece of "average" software (an HTTP web server, a web browser, a Rust compiler, a desktop operating system, an AAA game) is correct in any strict sense of the world. With the current state of technology, it takes an extraordinary amount of effort to deliver software which can be believed to be reasonably correct.

Rather, it's that we don't actually need correct software for overwhelming majority of cases.

There are some narrow areas where you do actually care about correctness, but a question about those areas should be narrowly formulated.

13 Likes

Can you give an example of what software you think does not need to be correct?

I have some ideas in mind but I'll save them for now as I'm wondering what you are thinking there.

1 Like

I will say that it's quite ok to start off this way, and I certainly do this myself. But at some point, if you are doing your best to have software that works as intended, you review the code with care, and do not rely solely on tests. In other words, when I wrote "When we write some software, it SHOULD be correct even before any testing is done," I was simplifying the process.

1 Like

You could think of it kinda like tolerances. If you're building a house, the floor doesn't need to be perfectly level, the corners don't need to be perfectly square, etc. And similarly, for most software it's totally fine if it crashes every once in a while.

Sure, ideally software wouldn't crash and every 2x4 would be exactly 1½"×3½", but that's just not how it is.

5 Likes

After decades of working on safety-critical software for avionics and the military I think I have done more than enough designing, reviewing and testing.

In my current projects it's hopeless. The clients change their minds about what they want ten minutes after the software is delivered....

You are probably misinterpreting something the speaker said. Of course you don't rely solely on tests, you think carefully about the logic when you code. Trial-and-error coding is a very unproductive approach. But this doesn't contradict the first statement.

What he seems to have meant is that since we don't use a formally verified proof system, we make mistakes, and thus, rely on testing to verify we didn't make a mistake somewhere. This agrees with what you're saying.

If you have a few hundred lines of code that has not been tested, it's very likely there are some mistakes, and in that sense, testing is the only way to see whether it works.

Well, I did already acknowledge this possibility, so here it is ( 1 hour 45 minutes 30 seconds in), see what you think. It seemed "off" to me.

I have one example in mind, the automated subway line 14 in Paris which I used to take everyday. Its software was "famously" formally designed and implemented. (the only reference I could find quickly)

First real success was Meteor line 14 driverless metro in Paris: Over 110 000 lines of B models were written, generating 86 000 lines of Ada. No bugs were detected after the proofs, neither at the functional validation, at the integration validation, at on-site test, nor since the metro lines operate (October 1998). The safety-critical software is still in version 1.0 in year 2007, without any bug detected so far.

I think that's quite a feat of (software) engineering. (bridge engineers would laugh at us)

For more examples and falling into a wikihole

12 Likes

Seems like he's just advising people to test their code to me :slight_smile:

But I scrolled back by 2 minutes, and what he says there makes no sense: he claims that the reason Dijkstra didn't like goto statements is because of the undecidability of the halting problem for some algorithms with gotos. This makes no sense, the halting problem has nothing to do with goto statements. The halting problem is undecidable even in languages without gotos, that's not the reason he didn't like gotos.

The speaker explicitly says that tests do not prove program correctness, but that "we demonstrate that it (a program) is not incorrect by surrounding it with tests".

1 Like

You can become more confident that your specification is correct by proving non-trivial properties of this specification. For example for seL4 proving that the specification guarantees that the cia triple (confidentiality, integrity, availability) is satisfied.

See also "Does seL4 have zero bugs?" at Frequently Asked Questions on seL4 | seL4 docs

I feel like testing is more of a statistical proof than a rigorous one, although maybe "proof" is too strong a term. We're essentially checking a bunch of "interesting" cases and making sure the code behaves as expected, then inferring that the rest of the code is probably correct too.

On the other hand, using the type system and encapsulation lets you approach correctness from a different angle. It sets an expectation that given your type's invariants are enforced correctly (e.g. via privacy then checked with tests) and assuming you don't use any escape hatches (e.g. unsafe), any code that uses your type will also satisfy those invariants. A good example of this is the Vec and String types for uninitialized memory and ensuring valid UTF-8, respectively. And the borrow checker, of course.

The type system can't incorporate everything though, even with cool things like range and dependent types, so this just acts as a "lower bound" for correctness. However, if you combine it with a smattering of tests to act as sanity checks, you can be reasonably confident that your code does what you expect.

And normally that's good enough.

1 Like

Insofar Dijkstra actually thought this was possible, he was simply uninformed on the matter, and that seems unlikely to me. It would be equivalent to solving the acceptance problem, something that has been proven not to be possible.

This isn't true either. All a test can do is show that a very specific path of code yields the expected result. But even 100% test coverage doesn't prove (or even show) a program to be correct. All it means is that the specific bits of the software being tested yield the expected result.

It should, yes. But given combinatorial explosion of the state space the larger the software becomes, it's fairly easy to become overwhelmed. Most people seem to deal with that by...well by not dealing with it. A good example of that is utilizing use cases. The unit and integration tests make sure those the software handles those use cases as expected, but anything else is then pretty much up for grabs. And this is by necessity too, because it isn't free in terms of time and money to reduce the number of bugs in software. So there's an economical component to it as well, even if it's FOSS on GitHub.

Another way to view this is in terms of happy path vs error paths. For each happy code path, there is a multitude of error paths. Many people consider it easier to just not deal with the error paths (this is e.g. how Java software ends up with catch-all clauses). That that results in brittle software is either something they don't consider at all, or something those people basically don't care about.
Of course Rust is a little different in that regard in that it pretty much forces you to at least acknowledge (e.g. .unwrap()) or properly deal with errors. But it still can't do the proper error handling for you, and until/unless something like true AGI arises (and makes humans, and human reasoning, obsolete in the process) it's unlikely that it ever will.

This is not right.

The undecidability of the acceptance problem simply means that it's possible to write code that has unpredictable behavior.

It doesn't mean that it's not possible to formally prove the correctness of code that is designed to be correct.

1 Like

It means it can't be done in a fully automated fashion, which in turn means that it can't be done for arbitrary software because the cost will be too high i.e. it can't and won't be done due to economic constraints.