Measuring the coverage achieved by symbolic execution

Cristian Cadar and Timotej Kapus

This blog post is meant for symbolic execution users and developers who are interested in measuring the coverage achieved by a symbolic execution tool as a proxy for the effectiveness of a symbolic execution campaign.  We use KLEE as an example tool, but this discussion is applicable to other implementations as well.  We also use statement coverage in our discussion, but other forms of coverage raise similar issues. 

Suppose you run KLEE on a program for a few hours and you want to measure the coverage achieved by KLEE in this time.  There are two main ways to achieve this: rely on the coverage reported by KLEE or measure the coverage externally using a tool such as GCov.  We describe each in turn and then discuss how they differ from one another and when each of them should be used.

Coveraged measured internally ("internal coverage")

The first option is to simply rely on the coverage reported by KLEE.  A tool like KLEE considers a statement to be covered if it has been executed symbolically.  In this case, coverage is measured at the level at which the symbolic execution tool operates — in KLEE's case, in terms of LLVM instructions.

Coverage measured externally ("external coverage")

Instead of using internal coverage, one can take the test inputs generated by KLEE and replay them on a native version of the program instrumented to gather coverage information — e.g., using GCov.  Indeed, one of the main strengths of symbolic execution is that it can generate test inputs covering all the paths that it explores in the program.  These test inputs are a great interface for many different software engineering tasks such as augmenting existing test suites, demonstrating the bugs found,  communicating with greybox fuzzing tools, debugging and repairing programs, among others.

An important advantage of measuring coverage using these test inputs is that this can be done independently of the symbolic execution tool that generated them.  This is usually what developers want: to obtain a set of concrete test inputs that increase the coverage of their code, completely untied to the tool that generated them.

Using the generated test inputs to measure coverage also makes it possible to compare a symbolic execution tool with other types of tools that can perform test input generation, including other symbolic execution tools.  Furthermore, this independent way of measuring coverage can also eliminate any bugs related to  the way in which the symbolic execution tool reports coverage (while bugs in tools like GCov are of course possible, they can be more easily caught, either via manual validation or comparison with similar tools).

Differences between internal and external coverage

As mentioned before, the two coverage metrics have important differences.  One difference is that the two metrics are often measured at different levels: the external coverage is typically measured at the source level, while the internal coverage at the level at which the symbolic execution tool operates (LLVM bitcode, CIL).

However, there is another more fundamental difference between the two metrics.  Let's illustrate this with an example. 

Consider a (contrived) run of KLEE in which KLEE executes the statements shown in green within a certain time budget.  The nodes denoted as a[y]a[y+1]a[z]a[z+1] stand for statements accessing these memory locations. 
                           if x > 0 
                        /           \
                       /             \ 
                  if y > 0         if z > 0  {x=1, y=1, z=1}
                   /   \             /   \
                  /     \           /     \
{x=0, y=0, z=0} a[y]   a[y+1]      a[z]  a[z+1]

In this scenario, KLEE covered 4 out of 7 statements and generated 2 test inputs: {x=0, y=0, z=0} for the completed path ending in a[y] and {x=1, y=1, z=1) for the partially-explored path ending in if z > 0.  However, when the 2 test inputs are replayed, they cover 5 out of the 7 statements.  The additional statement is that involving a[z+1], which wasn't covered by KLEE during symbolic execution.  From the perspective of using KLEE as a test input generation tool, counting a[z+1] as covered is fair game: KLEE generated the test {x=1, y=1, y=1} which indeed covers that statement.

However, from a bug-finding perspective, there is a big difference between statement a[y] which was covered by KLEE during symbolic execution and statement a[z+1] which was not.  One major strength of symbolic execution is that it can reason about all possible values of the statements that it explores.  When KLEE executed a[y], it asked whether there are any values for y for which a[y] causes a buffer overflow at that point. By contrast, since KLEE didn't execute statement a[z+1], it did not reason about all possible values of z that could trigger an overflow.  The test case generated at point if z > 0 chooses only one possible value for z, in our case z = 1.  Assuming array a has more than 2 elements, this doesn't cause a buffer overflow at statement a[z+1], although other values causing an overflow might exist on that path. Therefore, if KLEE is used to find bugs, internally-measured coverage is typically a better indicator of effectiveness than externally-measured coverage.

There is another important difference between the two metrics, which is not captured by this simple example.  When developers measure code coverage for their application, they only include their own code, ignoring that of dependencies.  While this is the right decision from the perspective of testing that application, for the symbolic execution engine, the application code is only a subset of all the code to be analysed. As a result, internal coverage is more appropriate for tracking the progress of a symbolic execution campaign. Of course, one could also include all the code under analysis in the external coverage measurements.  However, in practice this is harder than it sounds: including libc coverage is not straightforward, and it's even harder to include coverage of the environmental model code that symbolic execution engines include (e.g., KLEE's POSIX models). 

To show the difference between external and internal coverage, we used KLEE to test each of the following applications for 2 hours each:  bc (calculator), m4 (preprocessor), magick_jpg and magick_png (ImageMagick configured to convert JPG to PNG and back), sqlite3 (database system) and tcpdump (tool for dumping network packets) in which we injected hard-to-find bugs in every basic block.  The bugs are hard to find in the sense that each of them is triggered by a unique input value.  

The following figure plots the number of unique faults found by KLEE against external (measured by GCov, in terms of lines covered in the application code itself) and internal coverage (reported by KLEE, in terms of LLVM instructions covered in all the code under analysis). 

The figure captures some of the differences between internal and external coverage.  As discussed above, the two metrics are often measured at different levels: in this case source lines of code (external) and LLVM instructions (internal).  However, a more interesting difference can be seen by looking at the data points for sqlite3 and tcpdump: sqlite3 has significantly higher external coverage than tcpdump, but slightly lower internal coverage.  This is likely because sqlite3 gets "luckier" with the inputs generated for partially explored paths, as they end up covering more new statements in the application code itself than in the case of tcpdump.

More importantly, the figure shows that the number of unique faults found increases linearly with internal coverage, whereas there is no such correlation with externally-measured coverage.  In the case of sqlite3 and tcpdump, comparable internal coverage leads to comparable number of bugs found, despite the much higher external coverage of sqlite3. When KLEE executes a buggy statement, it reasons about all possible values for that statement; therefore, it will always find the bug if it can be triggered on that path. Thus, the more instructions are executed symbolically, the more bugs are found.  By contrast, when a concrete input reaches a statement that KLEE didn't execute symbolically, it is unlikely to hit the bug, as the input will not have the right value.


While our fault-injection campaign is contrived, it nevertheless illustrates well the differences between internal and external coverage and how the number of faults found correlates well with internal, but not external coverage. Deciding whether to measure coverage inside or outside of KLEE depends on the specific application scenario for which KLEE (and symbolic execution more generally) is used.  As discussed in this post, factors to consider are whether the metric is meant to act as a proxy for the effort spent symbolically reasoning about potential bugs, understand where a symbolic execution campaign spends most of its time, measure the coverage achieved in a certain part of the code under analysis, or compare coverage across tools, among others. 


Popular posts from this blog

EXE: 10 years later

Journal Special Issue on Fuzzing: What about Preregistration?