Back to Posts

Arguing for Correctness

January 10, 2020  /  ~5-minute read

Home | Blog | Portfolio | Contact

One of the most important jobs of a software engineer is convincing people that a program is correct with respect to some specification or other. Perhaps most often this person is ourself, but it can also be a teammate, project manager, or client.

It would be lovely if we could formulate “correctness” as a logical conclusion, which we could deductively argue using properties of our programs as premises. That is, if we could actually prove that a program is correct. The field of formal methods studies this formulation, but the current techniques for “proving” a program tend to be heavy-weight and inaccessible to the everyday programmer.

Because of this, it’s much more common to use inductive reasoning to argue for correctness; each premise we present is evidence suggesting correctness. The more evidence we present, the more strongly we may believe the conclusion that the program is correct.

A wise man proportions his belief to the evidence.

David Hume

Evidence is said to be “stronger” if, given that evidence, the conclusion is more likely. Perhaps the most common example of such an argument is unit testing. When we write a test suite, we implicitly argue: “these test cases form a representative sample of the program’s possible inputs, so correctness on these cases suggests correctness for all cases” (a form of inductive generalization ).

The strength of the argument from unit testing is predicated on just how well the test cases capture the entire input space. We often see a couple tests for “normal” inputs and a couple for checking corner cases, like invalid inputs (type systems help us in this regard by rejecting programs which which supply arguments of the wrong type to a function, ruling out that class of invalid input).

Hold on. There’s another argument for correctness hiding in there. When we record the types of a program, particularly in a statically-typed language, we implicitly argue: “the type checker rejects ill-typed programs, therefore, if the program type checks, then it is correct to the degree that the type captures the spec.” For example, if our program has the function capitalize, which takes a string to a string, and we type check it, then it would be redundant (or impossible, depending on the language) to write a unit test to check how capitalize handles invalid inputs, say, of type integer.

This makes type checking a particularly strong argument for correctness, since it rules out classes of errors rather than individual error conditions, as is the case with unit testing.

Unfortunately, it’s still possible for well-typed programs to be incorrect. The canonical example is real number division. Let’s write a spec: “real number division is defined as the quotient of any two real numbers, except when the divisor is zero, in which case it is undefined.” In most languages, the built in division operation has a type analogous to “takes two numbers as arguments and returns a number.” The type fails to mention that division is undefined for a divisor of zero, making division by zero perfectly legal according to the type system. A function which is incorrect in this manner is called a partial function; it is undefined for some of its domain.

What a partial function does when it encounters a part of its domain for which it’s undefined is a matter of preference. Matt Noonan outlines three common idioms in Ghosts of Departed Proofs :

All of the above require extra implementation code to handle the partial cases, which costs CPU cycles at run-time and adds noise to the source code which distracts from the actual behavior of the program.

An alternative to these contortions is to simply write total functions; functions which are defined for their entire domain. Real number division is only defined for a nonzero divisor, so why should we write a type which claims it’s defined for any pair of numbers? All we need is a type system expressive enough to encode this refinement (which we can think of as subtracting the value 0 from a numerical type). The type should now read “takes a number and a nonzero number and returns a number.” By recording the stated property that division is undefined when the divisor is zero, this refined type comes closer to the actual spec. (All that remains is the property that the returned value is actually the quotient of the inputs, which we can argue with unit tests.) In this way, bringing the types of a program closer to its spec generates stronger evidence for the program’s correctness.

Another way to look at this refinement is that we’ve taken a logical error and made it a type error. We’ve taken a potential crash and turned it into a compile-time error, enhancing safety. The more properties stated by a spec we can put into a type system, the stronger an argument for correctness a type check can make.

When it comes to assessing correctness, data scientists have a bit of an edge in that the algorithms they implement have very concrete functional specs, namely, their mathematical definitions. However, some of these algorithms have pretty complicated properties and preconditions, and it isn’t obvious how to encode those in types. The next post will discuss how this is possible.