Wednesday, 2 November 2016

EXE: 10 years later

I've recently returned from Vienna where our CCS 2006 paper on EXE, co-authored with Vijay, Peter, David and Dawson, has received the ACM CCS Test of Time Award, which "recognizes works that had significant impact on the field, as seen from a ten-year perspective."

Besides contemplating the fact that I'm old enough now to have papers that are a decade old, I was also reflecting on the various decisions taken while building EXE, as I see them ten years later.

The CCS'06 paper on EXE was one of the first papers on modern symbolic execution—"modern" in the sense that symbolic execution was originally introduced in the 70s, but at a time and in a form that was not ready for primetime.  EXE was the direct successor of EGT, a system in which we introduced the main ingredients of dynamic symbolic execution (DSE), but in a simplified form (and which was developed concurrently with the DART system, which introduced a different flavour of DSE). EXE is also the direct predecessor of KLEE, a better-known system (as unlike EXE was open-sourced and then used and extended by many research and industry groups), which inherits many of EXE's design decisions, but drops many others.

EXE benefitted from the ongoing revolution in SAT solving, which is for sure one of DSE"s main enablers.  But it also introduced a second key ingredient, mixed concrete-symbolic execution (also introduced by the DART paper mentioned above, but in a different form), a really simple idea, but with a significant impact on generality and scalability.  Mixed concrete-symbolic execution simply means that if all the operands of an instruction are concrete, the instruction is executed as in the original program (and in the case of EXE, essentially in its unmodified form). This makes it easy to interact with uninstrumented code, such as closed libraries and the OS, handle difficult cases such as floating-point arithmetic by continuing execution with partially concrete inputs, and run symbolically only the code that depends on the symbolic inputs, without having to extract it explicitly from a larger system (which is what enabled us to construct malicious disk images with EXE in a scalable fashion, while instrumenting an entire user-mode version of Linux).

The other core strength of EXE was its co-designed constraint solver STP.  At the time STP was developed, there was nothing resembling it.  Our experience with a previous state-of-the-art solver in EGT had been disappointing, both in terms of the performance and accuracy required in a DSE context, but also in terms of generic software reliability.  While nowadays there are many other competing solvers, some of which are much more capable than STP in terms of the theories they can handle, STP still seems to outperform most of them for standard DSE usage.  In other words, STP still stands to its original design: support the minimum needed by symbolic execution, but efficiently. To achieve this, David and Vijay recognized from the beginning the advantage of eager translation to SAT, which allowed STP to exploit the recent advances in SAT solving technology.  But perhaps the even more important reason for which STP made such a big impact in EXE is a reflection of the way in which it was developed: in our regular meetings, Dawson, Peter and I would try to understand how STP operates, while Vijay and David would try to understand the constraint queries representing the latest bottleneck in EXE (or as we put it in the paper, our approach was to "start simple, measure bottlenecks on real workloads and tune to exactly these cases").  As a result, I think our collaboration worked extremely well, with very little wasted time on both sides, something that I've tried to emulate in subsequent collaborations.

While STP was instrumental in EXE's scalability on the constraint solving side, we also recognized the need for targeted constraint solving optimizations at the level of the symbolic execution itself. Symbolic execution uses constraint solving in a widely different way from other program analysis techniques such as bounded-model checking or static program verification: it issues a huge amount of queries, but which can be easily decomposed in independent sub-queries, and which share many similarities among themselves.  That's why two such targeted optimizations introduced by EXE, constraint solving independence and constraint caching, have survived in KLEE, with the former at least being present in pretty much all other DSE systems.

There are other aspects of EXE which made it into its successor KLEE, such as having search heuristics as a first-order concept, expressing all constraints in terms of the initial symbolic input bytes, having models that are written (and executed) as actual code, and the simple user interface and output format.  But there are also several aspects that didn't make it into KLEE.  

One example is the pointer tracking performed by EXE, based on techniques used by Safe C compilers, which allowed it to quickly determine the referent object during a pointer dereference. While in theory this is the right thing to do, the many corner cases introduced while interacting with uninstrumented code led us to adopt a much simpler but less efficient solution in KLEE, in which we scan the entire list of memory blocks to find referent objects. Perhaps a hybrid solution here would work best, in which pointer tracking would only be adopted for the unproblematic cases. 

Another example is the static instrumentation done by EXE, as opposed to KLEE's interpreter-like dynamic instrumentation. Dynamic instrumentation has many advantages, but there are tradeoffs involved.  For instance, the source-to-source transformation made some debugging tasks easier and it also made it possible to run native code at full speed (by comparison, consider that a large concrete memcpy can easily become a performance issue in KLEE). This being said, EXE's static instrumentation was tightly coupled with its forking mechanism based on UNIX's fork, which was more problematic.  The fork-based design did have some neat advantages, such as free copy-on-write sharing, and the straightforward ability to explore multiple paths concurrently.  But it also prevented other optimizations, such as a finer-grained sharing of data across paths.  The main problem though was the pain involved in debugging a complex multi-process application, despite the utilities we developed to help with this.  

The final lesson, I think, is one of persistence.  While I consider EXE one of the most beautiful papers I've written so far, its impact would not have been the same if we just stopped working on this topic after CCS 2006 (and it is very tempting to abandon existing lines of research, as publishing "follow-up" research, regardless of its value, is usually harder).  Despite some tough times initially, I am really glad we didn't.

Tuesday, 20 September 2016

Comprehensively testing software patches with symbolic execution

A large fraction of the costs of maintaining software applications is associated with detecting and fixing errors introduced by patches. Patches are prone to introducing failures and as a result, users are often reluctant to upgrade their software to the most recent version, relying instead on older versions which typically expose a reduced set of features and are frequently susceptible to critical bugs and security vulnerabilities. To properly test a patch, each line of code in the patch, and all the new behaviours introduced by the patch, should be covered by at least one test case. However, the large effort involved in coming up with relevant test cases means that such thorough testing rarely happens in practice.

...  the full blog post can be read in the IEEE Software Blog

Monday, 23 November 2015

Multi-Version Execution Defeats a Compiler-Bug-Based Backdoor

Cristian Cadar, Luís Pina, John Regehr

What should you do if you’re worried that someone might have exploited a compiler bug to introduce a backdoor into code that you are running? One option is to find a bug-free compiler. Another is to run versions of the code produced by multiple compilers and to compare the results (of course, under the additional assumption that the same bug does not affect all the compilers). For some programs, such as those whose only effect is to produce a text file, comparing the output is easy. For others, such as servers, this is more difficult and specialized system support is required.

Today we’ll look at using Varan the Unbelievable to defeat the sudo backdoor from the PoC||GTFO article. Varan is a multi-version execution system that exploits the fact that if you have some unused cores, running additional copies of a program can be cheap. Varan designates a leader process whose system call activity is recorded in a shared ring buffer, and one or more follower processes that read results out of the ring buffer instead of actually issuing system calls.  

Compilers have a lot of freedom while generating code, but the sequence of system calls executed by a program represents its external behaviour and in most cases the compiler is not free to change it at all. There might be slight variations e.g., due to different compilers using different libraries, but these can be easily handled by Varan. Since all correctly compiled variants of a program should have the same external behaviour, any divergence in the sequence of system calls across versions flags a potential security attack, in which case Varan stops the program before any harm is done.  

Typically, Varan runs the leader process at full speed while also recording the results of its system calls into the ring buffer. However, when used in a security-sensitive setting, Varan can designate some system calls as blocking, meaning that the leader cannot execute those syscalls until all followers have reached that same program point without diverging.  For sudo, we designate execve as blocking, since that is a point at which sudo might perform an irrevocably bad action.

So here’s the setup:
  1. We have a patched version of sudo 1.8.13 from the PoC||GTFO article. It runs correctly and securely when compiled by a correct C compiler, but improperly gives away root privileges when compiled by Clang 3.3 because the patch was designed to trigger a wrong-code bug in that compiler.
  2. We are going to pretend that we don’t know about the Clang bug and the backdoor. We compile two versions of the patched sudo: one with Clang 3.3, the other with the default system compiler, GCC 4.8.4.
  3. We run these executables under Varan. Since the critical system call execve is blocking, it doesn’t much matter which version is the leader and which is the follower.

Now let’s visit an Ubuntu 14.04 VM where both versions of sudo (setuid root, of course) and Varan are installed. We’re using a user account that is not in the sudoers file -- it should not be allowed to get root privileges under any circumstances. First let’s make sure that a sudo that was properly compiled (using GCC) works as expected:

$ /home/varan/sudo-1.8.13/install/bin/sudo-gcc cat /etc/shadow
test is not in the sudoers file.  This incident will be reported.

Next, we make sure that the backdoor is functioning as intended:
$ /home/varan/sudo-1.8.13/install/bin/sudo-clang cat /etc/shadow

So far so good. Next let’s try the gcc-compiled sudo as the leader with the backdoored sudo as the follower:

$ vx-suid /home/varan/sudo-1.8.13/install/bin/sudo-gcc \
         /home/varan/sudo-1.8.13/install/bin/sudo-clang -- cat /etc/shadow
test is not in the sudoers file.  This incident will be reported.

What happened here is that the gcc-compiled leader runs as before, since it doesn’t ever try to execute an execve call.  When the backdoored follower tries to execute the malicious execve call, Varan detects the divergence and terminates both processes safely.

Now let’s try switching around the leader and follower, i.e., run the backdoored sudo as the leader with the gcc-compiled sudo as the follower:
$ vx-suid /home/varan/sudo-1.8.13/install/bin/sudo-clang \
         /home/varan/sudo-1.8.13/install/bin/sudo-gcc -- cat /etc/shadow

This time the leader tries to execute the malicious execve call, and Varan blocks its execution until the follower reaches the same system call or diverges. In this case, the follower tries to execute a write system call (to print “test is not in the sudoers file…”) and thus Varan detects divergence and again terminates execution safely.

In this example, we only ran two versions in parallel, but Varan can run more than two versions.  In terms of performance and resource utilization, security applications like sudo are a great match for multi-version execution: they are not CPU-bound, so any performance degradation is imperceptible to the user, and the extra cores are needed only briefly, during the critical security validation checks.  We are looking into applying this approach to other critical security applications (e.g. ssh-agent and password managers), and are investigating a way of hardening executables by generating a single binary with Varan and a bunch of versions, each version generated by a different compiler. We can then deploy this hardened executable instead of the original program.

Of course, Varan can detect misbehavior other than compiler-bug-based backdoors. Divergence could be caused by a memory or CPU glitch, by a plain old compiler bug that is triggered unintentionally instead of being triggered by an adversarial patch, or by a situation where an application-level undefined behavior bug has been exploited by only one of the compilers, or even where both compilers exploited the bug but not in precisely the same way. A nice thing about N-version programming at the system call level is that it won’t bother us about transient divergences that do not manifest as externally visible behaviour through a system call.

We’ll end by pointing out a piece of previous work along these lines: the Boeing 777 uses compiler-based and also hardware-based N-version diversity: there is a single version of the Ada avionics software that is compiled by three different compilers and then it runs on three different processors: a 486, a 68040, and an AMD 29050.

Thursday, 26 June 2014

Can we improve the journal review process in computer science?

As in all scientific disciplines, peer review plays a core role in computer science research. But one aspect that sets apart our discipline from others is that most areas of computer science are driven by conference rather than journal publications. This has been discussed numerous times, and whether our community should change its publication culture is a controversial subject; just Google "conferences vs. journals in computer science" if you are not familiar with this debate.

One key advantage of conference publications is their quick reviewing cycle, which is often assumed to be in conflict with a careful, high-quality reviewing process. While the high reviewing load and the tight deadlines imposed on program committee (PC) members of top conferences do indeed endanger the peer review process, I argue that the conference reviewing process is in many ways of higher-quality than that used by journals. I will discuss two such aspects below.

(1) Conference reviewing puts a lot of emphasis on discussions among PC members.  Top CS conferences have both an online discussion stage, typically lasting a couple of weeks, as well as an in-person PC meeting taking one or two days.  Given the emphasis that our community (justifiably!) places on these discussions, I find it surprising that this is completely absent from journal reviewing. When I review for a journal, I typically only receive the other reviewers' comments (if at all!), without any opportunity to further discuss the paper and clarify our respective positions. Contrast this with the conference reviewing process, where I often have lengthy discussions with the other PC members, both online and in-person, which I believe considerably improves our understanding of the paper and the overall reviewing process.

(2) Conferences have a much easier time convincing top researchers to dedicate a lot of their time reviewing submissions.  This has an impact both on the quality of the reviews (with experts sometimes unwilling to review for journals) and on the length of the reviewing cycle. If there is any chance for our community to adopt journals more widely (and I'm not arguing here either way), journals need to have a faster publication cycle: having to wait more than one year until your paper is published (and in many cases longer!)  is certainly not a big selling point. Whenever I ask editors of top journals in our field why things take so long, they mention the difficulty of finding reviewers. Conversely, I know many fellow junior and mid-career researchers who admit declining journal reviews as a general rule. However, these same people would rarely refuse the invitation to join the PC of a top conference, despite the latter requiring ten times more work! Why?

The answer is simple: journal reviewers are assigned a secondary role compared to conference reviewers. For instance, while they make a recommendation, they don't have any opportunity to further discuss the paper and their position with their co-reviewers, the final decision being taken solely by the editor.

More importantly, journals provide little incentives to reviewers compared to conferences. Firstly, the PC of a top conference is widely advertised, giving its members a lot of visibility in their community. Contrast this to journals, where only the editorial staff gets this benefit. Secondly, being part of a conference PC gives you the opportunity to meet and have technical discussions with top researchers in your community, both online and in person (not to mention that conference PCs sometimes organize informal workshops co-located with the PC meeting).  Contrast this to journal reviewing, where you don't get the chance to exchange a single word with your co-reviewers!

These are not the only strengths of conference reviewing, and there are certainly many weaknesses as well when compared to journal reviewing. But in relation to some of the points above, I believe journals could improve their reviewing process. In particular, why not give journal reviewers the possibility to have an online discussion? Our community justifiably puts a lot of emphasis on discussions among conference reviewers, and there is no reason for not having this be a core part of the journal reviewing process as well.