Specifications and Verification
First Example: Counting Set Bits
Most developers are used to techniques like testing, continuous integration, and thoughtful documentation that can help prevent mistakes from being introduced into a system during its development and evolution. These techniques can be relatively effective, but they risk missing certain classes of bugs. For the most important systems, like those that protect human life or information security, it can make sense to also use formal verification, in which a program is mathematically proved to be correct for all inputs.
Testing takes the actual binary executable and runs it on a subset of the possible inputs to check for expected outputs. The downside of this approach is that tests may miss some critical case. Compared to testing, verification is the process of building a mathematical model of the software and proving properties of that model for all possible inputs.
In this lesson you’ll learn how to use a system called SAW, the Software Analysis Workbench, to build models of functions written in C. You’ll learn how to specify what those functions are supposed to do, and how to write a program in SAWScript that orchestrates the proof that the functions meet their specifications.
The Code
The first program to be verified is pop_count
. This function takes a 32-bit integer and returns the number of bits that are set (“populated”). For example pop_count(0)
is 0, pop_count(3)
is 2, and pop_count(8)
is 1. This description is an English language specification of the pop_count
function. A specification can be written in a number of formats, including English sentences, but also in machine-readable forms. The advantage of machine-readable specifications is that they can be used as part of an automated workflow.
Note: The pop_count
function has uses in many kinds of algorithms and has an interesting folklore.
Here is a sophisticated implementation of pop_count
from the book Hacker’s Delight by Henry S. Warren Jr.:
Exercise: A Safe and a Broken pop_count
pop_count
Write a version of pop_count
that you believe to be correct, and also a version that includes the kind of error that could be made by accident (perhaps by adding a typo to the optimized version). Add them as pop_count_ok
and pop_count_broken1
to popcount.c
in the examples/intro directory.
Testing Programs
You’re not likely to be able to convince yourself that the optimized pop_count
function is correct just by inspection. A unit test, like the following pop_check
can help:
There are some downsides to testing only with chosen values, however. First off, these tests are usually selected by the author of the code, and there is a risk that important values are not tested. This can be ameliorated by a systematic, disciplined approach to choosing test values, but it can never be completely eliminated. This particular unit test is likely to catch egregiously buggy versions of popcount
, but not subtle or tricky bugs.
A second approach to testing is to choose many random values at each execution. This approach may eventually find subtle or tricky mistakes, but not reliably or in a predictable amount of time.
Testing with random values requires an executable specification. This specification may just describe some properties of the output (e.g. that the length of two appended lists is the sum of the lengths of the input lists, or that the output of a sorting function is sorted), or it may be a simpler, more straightforward version of the code that uses an easier algorithm. An executable specification for popcount
can loop over the bits in the word, masking them off one at a time. While this implementation is straightforward, it is also slow.
The function random_value_test
performs randomized testing of a provided population count function, comparing its output to that of pop_spec
. When they are not identical, it prints the offending input, which can aid in debugging.
Finally, one could attempt to exhaustively check the values by enumerating and testing all possible combinations. In the simple case of pop_count
, which only takes one 32-bit integer, this will take about 20 seconds. With a 64-bit version of the program, however, the test would take longer than a normal human lifetime, so this technique is not practical for ongoing software development.
The way formal verification addresses this is by reasoning about mathematical models of a program, which allows it to eliminate huge regions of the state space with single steps. There are many tools and techniques for performing full formal verification, each suitable to different classes of problem. SAW is particularly suited to imperative programs that don’t contain potentially-unbounded loops. In general, the cost of verification is that it requires specialized knowledge and developing mathematical proofs can take much longer than writing test cases. However, for many programs, automated tools like SAW can be used with similar levels of effort to testing, but resulting in much stronger guarantees. At the same time, re-checking a proof can sometimes be much faster than testing large parts of the input space, leading to quicker feedback during development.
Exercise: Testing popcount
popcount
Write a test that detects the defects in your pop_count_broken1
function, and also check that your pop_count_ok
and the optimized pop_count
function have no defects by using manual and random testing. How much confidence do those techniques provide?
Finally, consider pop_count_broken2
, which is only incorrect for exactly one input value. Check how often the randomized test detects the one buggy input.
Symbolic Execution
The way SAW can prove properties about programs is by converting them into an internal representation that is much closer to a pure mathematical function. For instance, pop_count
might be converted to a function like:
There are complications, of course, such as what to do with conditional branches, but as a user of the tool you won’t have to worry about them except when they introduce limitations to what you can reason about. The main such limitation is that symbolic simulation can’t effectively deal with loops whose termination depends on a symbolic value. For example, this simple implementation of add
would not be easily analyzed:
Running SAW
Note: This section uses a library of SAW helpers, in the file helpers.saw
. If you’re comparing this text to the SAW manual, you may notice that a few operations have been abbreviated.
SAW is a tool for extracting models from compiled programs and then applying both automatic and manual reasoning to compare them against a specification of some kind. SAW builds models of programs by symbolically executing them, and is capable of building models from LLVM bitcode, JVM bytecode, x86 machine code, Rust’s MIR internal representation, and a number of other formats.
The first step to verifying pop_count
with SAW is to use clang
to construct its representation in LLVM bitcode. It is important to pass clang
the -O1
flag, because important symbols are stripped at higher optimization levels, while lower optimization levels yield code that is less amenable to symbolic execution. The -g flag leaves symbols in the output which helps SAW produce helpful messages when verification fails. It can be convenient to include this rule in a Makefile
:
After building the LLVM bitcode file (by typing make popcount.bc
), the next step is to use SAW to verify that the program meets its specification. SAW is controlled using a language called SAWScript. SAWScript contains commands for loading code artifacts, for describing program specifications, for comparing code artifacts to specifications, and for helping SAW in situations when fully automatic proofs are impossible.
The specific fact to be verified using SAW is that pop_count
and pop_spec
always return the same answer, no matter their input. For any particular input, this can be checked using pop_spec_check
:
The SAWScript to verify pop_count
is really checking that pop_spec_check
always returns true.
To execute the verification, we invoke saw
on the SAWScript file:
The Proof succeeded!
message indicates to us that our pop_spec_check
function returns True for all possible inputs. Hooray!
Returning to the SAWScript we used, it has three parts:
Lines 1–2 load helper functions and the LLVM module to be verified. This step builds the model from your code.
Lines 4–8 defines the
pop_is_ok
SAWScript specification, which sets up the symbolic inputs to thepop_spec
function, calls the function on those symbolic inputs, and asserts that the return value is True.Line 10 instructs SAW to verify that
pop_is_ok
is true for all possible input values.
The LLVM module is loaded using the llvm_load_module
command. This command takes a string that contains the filename as an argument, and returns the module itself. In SAWScript, the results of a command are saved using the <-
operator; here, the name popmod
is made to refer to the module.
SAW specifications have three main parts:
Preconditions which state what the code being verified may assume to be true when it is called,
Instructions for executing the code.
Postconditions which state what the code must ensure to be true after it is called.
The function is invoked using the execute
command, which takes an array of SAWScript variables that correspond to the function’s arguments. The function being executed is the one named by the string argument in the call to llvm_verify
.
In the postcondition, the expected return value of the function is specified using returns
. In this example, the function is expected to return TRUE
.
Translated to English, pop_is_ok
says:
In other words, pop_is_ok
wraps the C function pop_spec_check
. This C function computes the believed-correct result (by calling pop_spec
), calls the pop_count
function we are analyzing and returns TRUE if the results agree. The SAW wrapper creates the symbolic input variable, executes the function on its input, and ensures that the return value is TRUE
.
Note: SAWScript distinguishes between defining a name and saving the result of a command. Use let
to define a name, which may refer to a command or a value, and <-
to run a command and save the result under the given name. Defining a command with let
is analogous to defining a C function, and invoking commands with <-
is analogous to calling it.
The arguments to llvm_verify
(on line 10 above) are popmod
, which specifies the LLVM module that contains the code to be verified; "pop_spec_check"
, the C function to be symbolically executed; and pop_is_ok
, the SAW specification to check "pop_spec_check"
against. The empty list ([]
) is an optional list of previously proven statements, which is used in larger verification projects as described later in this tutorial. This verification script provides the same level of assurance that exhaustive testing would provide, but it completes in a tiny fraction of the time, fast enough to be part of a standard CI (continuous integration) workflow.
Exercise: Verifying Clever Versions of popcount
popcount
The following versions of popcount
are quite different from the preceding implementations, but they should always return the same value. For both pop_count_mul
and pop_count_sparse
, do the following:
Write a C function, analogous to
pop_spec_check
, that relatespop_spec
to the new implementation.Use
pop_is_ok
inpop.saw
together with additional calls tollvm_verify
to asserts that the modified versionspop_spec_check
also always return true. The string argument tollvm_verify
states the name of the C function being verified - modify it to point to your new specification.Use SAW to verify the implementation. Remember to rebuild the bitcode file after modifying the C sources.
Exercise: Verifying Your pop_count
Implementations
pop_count
ImplementationsVerification is useful for more than just carefully-chosen examples. This exercise is about your programs.
Start with your solutions pop_count_ok
and pop_count_broken1
from the first exercise. Repeat the tasks from the previous exercise, creating specifications and extending pop.saw
to attempt to verify the functions.
As in the output above, you should see one successful verification (for the wrapper corresponding to pop_count_ok
) and one failed one (for pop_count_broken1
). SAW’s messages for failed verifications are quite verbose, but the most important part is the counterexample, which is a concrete input value for which the program fails to return TRUE
. Next apply verification popcount_broken2
from the exercise above, which is only incorrect for exactly one input value, you will see SAW comes up with exactly that counterexample without any guidance from you.
Last updated