Can we automatically check that code from ChatGPT is correct?

Check AI-Generated Code Perfectly and Automatically | by Carl M. Kadie | Jun, 2023 | Medium (free)

Used Kani to validate Rust code from ChatGPT. On small problems, Kani does a great job telling good code from bad code. On bigger problems, ... not so much.

No, I don't think this is going to be realistic for usefully-sized (and usefully-complex) programs anytime soon. And that's exactly the most serious issue with code automatically-generated by a poorly-understood (by the average LLM user) black box. IMO if one wants to be sure that some generated code is correct, it's not enough to skim over it, you have to understand what it does and how it achieves its goals. And that's empirically a lot harder than just writing the code yourself in the first place could have been.


How do you define correct?

Presumably that requires a 100% complete and unambiguous requirements definition. Whatever it is checking the ChatGPT code can then check it against that requirements definition. If you don't have that then, no.

Alternatively your requirements could be specified as a mass of tests that the generated code must pass. If you don't have that then no.

How is checking code that came from CharGPT different than checking code written by any human that tries to implement your requirements? We do that all the time, with human code reviewers of course.


Evaluating the correctness of arbitrary programs is undecidable.


Good question!
The first problem I looked at was finding the length of an inclusive range, such as, 3..=10 where the endpoints were for, example, i32s. I was willing to assume that this code:

((end as i128) — (start as i128) + 1))

-- that is, casting to a much bigger type -- will give the answer I want.

In general, we can often write obviously correct but slow code. The hope is to leverage “obviously correct but slow” code to create “obscurely correct and fast” code, such as my original code:

end.overflowing_sub(start).0 as u8 as usize + 1

(Measuring “slow” and “fast” requires benchmarking.)

ChatGPT couldn't find a solution to this problem.

However, it did figure out this:

In Rust, a and b are i32s. How can I test that a+1 < b without overflow?.

I was willing to assume that this was correct: (a as i128 + 1 < b as i128). ChatGPT 4.0: suggested this:

match a.checked_add(1) {
    Some(sum) => sum < b,
    None => false,

which is mathematically equivalent, but much faster (and 3% faster than my original code).

The last problem I looked at related to implementing "set union" by inserting a range into a sort, disjoint list of ranges. Neither ChatGPT nor Kani were up to solving the problem. However, my intuition is that correctness could be defined relative to doing the set union with a simpler, less efficient, data structure (+ some overflow checks and invariants).

  • Carl
1 Like

While true, this is somewhat vacuous. It's not particularly interesting to not be able to determine if a process will exit with the correct answer in 500 years or never for most people, so the halting problem is not really a practical issue. Likewise Godel's incompleteness theorem (highly simplified) merely states that there are always statements that you can't get a true or false value for. Just mark any programs containing such statements as being not proven correct, problem handled.

There's not really any theory that proves that you can't make a practical software correctness prover - which is convenient, because we've already made several!


Are you sure?

I recall having to debug code that accidentally got stuck in and endless loop in some odd situation that the authors had not though would happen. The author could have been me :slight_smile:

Embedded system designers worry about these kind of hang ups which is why they employ watchdog timers to reboot the devices.

It would be nice if a compiler or proof system could spot this.


The undecidability of the halting problem is of practical interest in one way, and it is not in another. Observe that compilers and analysis tools already detect some endless loops:

warning: function cannot return without recursing
 --> src/
1 | fn foo() {
  | ^^^^^^^^ cannot return without recursing
2 |     foo();
  |     ----- recursive call site

The theory of the halting problem doesn't contradict reality and say that this can't happen. It tells us that we will never be able to detect all of them in all possible programs — it tells us to not hold out hope for a perfect static analyzer. In this sense, it is “practical”. But it is “impractical” in that:

  1. The theory doesn't tell us much about how to build a static checker that is useful even if it is imperfect.
  2. As @simonbuchan mentioned, it is very easy to write a program that is provably terminating — perhaps even one in a language that has argumentless function calls and no other form of computation — but which will not terminate before your computer has a hardware failure.

So, if we want to do static analysis, the halting problem only tells us where not to look, and it also doesn't even fully answer the question we care about. Go forth and build your best approximations anyway, knowing that they are approximations.

To see another angle on what the halting problem does and doesn't tell us, consider a consequence of it: the Rust borrow checker rejects some programs that, if they were instead to be compiled and run, would not in fact encounter any aliasing violating the Rust reference rules. These are sound programs that the compiler rejects anyway. The halting problem tells us that there will always be such programs.

This is the world we actually live in, and so we make the best of it: we write safe code that is within the set of code the compiler can prove is sound, and we design languages so that that set has (as much as we can manage) well-understood boundaries which programmers don't regularly find themselves wandering outside.

There's a moderately famous remark (by Harold Abelson) that: "Programs are meant to be read by humans and only incidentally for computers to execute." This is meant as a reminder to write code that can be understood and therefore can be maintained — to pay attention to properties other than whether it produces the right output if executed. But, in any language that has static analysis the compiler is expected to perform (a type system, perhaps), the compiler must “read” the program too; the compiler is constructing a proof that the program is valid, for some set of validity axioms chosen by the language designers and lemmas chosen by the compiler authors. And one of the factors the language designers should consider is making it easy to write programs that the compiler can understand, just as they should also make it easy to write programs that humans can understand.

This is also to some extent a tradeoff. The smarter the borrow checker (or even moreso, the trait solver) gets, the less obvious the boundaries of what it can understand become to humans. You can make the compiler smarter and increase how many programs it accepts that humans might write and say “this is obviously correct and the compiler should accept it”. But, any time you do that, you're also likely to be increasing the complexity of the boundary between accepted programs and rejected-but-valid programs, and thus making the definition of the language more complex, and the compiler and language harder to work on without accidentally breaking something (in the form of rejecting a previously accepted program, because a new choice of “proof strategy” doesn't handle a case the old one did).

— whew, that got a bit tangential. Hope it helps someone anyway.


Frame query as a Coq ( ) theorem, and ask for code w/ Coq proof.

Run the Coq proof through the Coq verifier.

This side steps the "but halting problem" arguments since verifying "if P a valid Coq proof that Q halts" is decidable.

I think there would already be value in having these tools produce programs that actually compile by hooking a fail-fast check, build, test step (including dependencies and all that) into the prediction loop and then doing a tree search for valid programs. Compiler errors could be fed back to the model too.

That would be an improvement over unconstrained predictions.

This integration would have to be pretty low-latency to allow fast exploration and probably has to be done locally (not with a provider that charges per prompt) to be cost-efficient.

Adding tests and equivalence-proofs would be steps beyond those basics.

1 Like

There are many "levels" of incorrectness, for example:

  • the program doesn’t compile
  • the program is unsound (exhibits UB)
  • the program is sound but silently works incorrectly due to a logic error
  • the program panics due to a logic error
  • the program makes assumptions about the environment that are incorrect in some cases
  • the program makes assumptions about the input that are incorrect in some cases
  • the program works according to one interpretation of the spec but the spec is ambiguous
  • the program works unambiguously according to the spec but the spec was not what was actually intended
  • the program works according to the spec in general but not in edge cases
  • the program works according to the spec in general but the spec failed to cover edge cases
  • the program works according to the spec, and the spec was what was intended, but still not what was actually needed,

and even

  • the program (intentionally or not) works as intended but not according to the spec because the spec was buggy!
1 Like

And perhaps finally:

  • The program perfectly conforms to the Spec. The Spec is perfect. The customer has changed their mind!

When working on testing the flight control software of the Boeing 777 I found what I thought was a bug that would cause loss of control of the rudder (not "could", " would"). The code was good, the spec had the problem. On raising it with Boeing they came back and said "No, the spec is as it is intended to be". Boeing did not change their mind until the second or third test flight when the test pilot got furious mad at losing control of the rudder during a stall test.

Ah, sorry for reminiscing. Luckily I don't work in that world anymore.


How do you plan on precisely classifying those statements? Hint: That is also undecidable.

Wow. It's probably nice to live in some alternate universe where effective correctness provers exist.

Can you tell for us, who live in this universe, where automated theorem provers are extremely limited and pitiful how is it works for you over there? Have the wormhole you have used to reach our universe was designed with these tools or not?

</ sarcasm >

Practically speaking theorem provers are extremely limited because they hit, very quickly, not the Rice's theorem but combinatorial explosion.

In practice what people are using are prover assistants, prover verifiers and other such things.

In theory it should be possible to force ChatGPT to produce guidance for Coq or other such tool but in practice it's hard to convince ChatGPT to even produce correct piece of code if it's more than half-dozen of lines, making something that Coq would accept is a pipe-dream.

Whether the usual AI-style approach (add 10x more processing power to that model and if that wouldn't help do 100x more) would solve that issue is unclear, but today… it's almost impossible to prompt ChatGPT to write something verifyable unless problem you are discussing is trivial.

It's pretty nice dictionary tool for the APIs, but that's it. Program have to be written by human after ChatGPT would show said human some functions which can be useful for the task. And even then it can be quite non-trivial to make it produce the functions you may want to actually use.

Nah, it doesn't. Instead of “but halting problem” you now have “sorry, I couldn't do that” problem. IOW: instead of succeeding to produce pile of garbrage with some useful results in that pile ChatGPT would now start producing excuses with explanation about why it couldn't do anything.

Hardly an improvement.

If you would try that right now, today, they you would find out that there are no exploration it all: after you force ChatGPT to fix few compiler errors an attempt to push further would just produce programs which would have some checks omitted to fit the new bugs you asked about! You go in circles without ever producing satisfactory result.

I recommend you to actually try ChatGPT before you would build all these dream castles in your mind. Maybe one day GPT-10 would be useful for your plan, but GPT-4 as it exists today... nope.

And all these levels are way, way, WAY beyond what ChatGPT may produce today.

What it produces today is “random pile of code which looks vaguely similar to what your professor wants to see on exam to fool such professor into giving you good grade if he doesn't have enough time to try to understand what you wrote”.

Even bringing it to the first step “code actually compiles and thus can be tested” is very often a very frustrating battle.

I'm talking about low-level properties of LLMs in general, not what is exposed to end users.
The models can be adapted to perform a tree-search and explore multiple alternatives at a time, token by token (these are called sampling strategies), skipping over parts of the search space that was already sampled.

That's more or less how the policy network is hooked into MCTS in AlphaZero and its successors, which includes AlphaDev which has made headlines recently. For AlphaDev the goal was to write a better algorithm directly in assembly. For general programming we can start with a lower bar, searching for things that work at all, on the other hand it needs to generate valid high-level code, not assembly. That's where the LLM would come in.

I know nothing of correctness pavers but...

Is it not the case that no matter where the code came from (AI or human) if one wants to use a automated correctness prover on it one needs to provide the prover with a set of rules to prove correctness against?

I get the feeling that for non-trivial programs coming up with those rules would be harder than writing the code oneself.

There are completely automatic provers. They don't require a prover tactic, but do everything automatically.

Very powerful in a few narrow niches, but entirely useless outside of these narrow niches.

And there are prover assistants. These are much more useful, but they are always working with human, not independently.

You, basically, need to understand what your program does. And yes writing program with added proof for Wuffs is much harder than to write correct correct program without proof.

Which is actually harder to write generate than assembly. Because assembly is always “correct” is some sense. It may not do what you want it to do, but it always does something.

Most of the high-level code is non-compileable especially in languages like Rust where there are highly advanced rules which must be satisfied to make code compileable.

This means it's much harder to meaningfully change it.

Sure. And that's why I say that something like GPT-10 may learn to write complicated and correct code. But not today. Today's models can easily write short and correct tirivial code or complicated, nontrivial and incorrect code. But not simultaneously correct and nontrivial code.

I cannot imagine how that could be. I mean tell me, is this program correct?

fn main() {
    println!("Hello, world!");

How could you possibly know? How could a prover?

Automatic provers have a set of rules builtin. For example no UB or no panics. And in addition you may be able to annotate additional constraints as pre and post conditions for functions or by adding explicit panics for cases that shouldn't be allowed. This together forms the specification of the program. Unlike prove assistants they don't require you to provide a step by step proof that the program is correct according to the specification, instead they go look for a proof on their own using for example SAT/SMT solving.

I'm not suggesting that current tools could be used to generate large, error-free programs. I'm just saying that there are some incremental steps that can be taken which would provide utility.

For example the current autocomplete assistants don't even type-check their suggestions. They also complete rightwards of the cursor.
Instead a "fill in some details for this rough sketch of a method in the selected text region and then minimize the compile errors" (as optimizing outer loop, not prompting) could already be an improvement in utility even if they still produce some logic errors.
That doesn't need much improvement in the capabilities of the models, only better and faster harnesses.