Techniques for Ensuring Software Correctness (Dec2017)

Techniques for Ensuring Software Correctness: What Worked in Past and Present
(Quick revision of 2016 comment on Lobste.rs)


I find it helps to look to prior successes and tradeoffs to try to understand how to best do engineering. People and teams I've identified as independent inventors of software engineering were Bob Barton of Burroughs B5000, Dijkstra w/ THE OS, and Margaret Hamilton in Apollo & 001 Tool Suite. Each had pieces of the problem with some culminating with later work into the TCSEC and Orange Book’s B3/A1 classes of high-assurance security. At IBM, Fagan invented formal, software inspections for lower defects. Harlan Mills invented a method, Cleanroom Software Engineering, that consistently resulted in low defects even on first use. Became my baseline.

In parallel, the safety-critical embedded community continued using all kinds of their own methods or variations on common ones. They often published "lessons learned" reports with useful information. CompSci researchers continued working on formal methods, type systems, testing strategies, and runtimes trying to figure stuff out with numerous successes. Private companies differentiating on high-quality systems used a combination of formal methods (esp Altran/Praxis or Galois), safer languages (all), and DSL’s (iMatix and Galois). They published their own reports on these activities. There's much to learn in them.

Let’s start with what high-assurance projects showed would help by many real-world demonstrators:

    Ambiguous, English requirements created problems. The requirements, functional or otherwise, must be specified in a precise, unambiguous form. They used formal specifications for that combined with English descriptions so people knew what they were looking at.

    Design should be modular with easy to follow control flow plus careful attention to correctness of interfaces and any assumptions modules have about interfacing. All of them did this. B5000 did it with argument and type checks at both compiler level and CPU level during runtime. Hamilton noted interface errors were 80+% of problems on Apollo development. Her USL and 001 Tool Suite attempted to knock out those 80% of flaws using formal specs of functions/interfaces with generators for code matching stated specs. This showed up in larger, commercial use with Meyer's Eiffel Method/Language. At Praxis, SPARK Ada later combined it with static analysis to prove properties of programs. Ada 2012 and a bunch of other languages added some form of contracts.

    Highest-assurance had formal proof that the design matched the requirements or security policy. Medium-assurance did this informally where each part mapped to other parts with clear, convincing arguments they corresponded. Almost every project that applied formal proofs, whether it helped a lot or barely, noted that simply redoing specs and code simple enough for provers to use itself caught errors without doing the proving process. Lasting lesson there.

    Since Bob Barton, it was known that there would be less problems if the language, CPU, and OS made common activities safe to perform by default. His B5000 did that with it being immune to most forms of code injection today. MULTICS’s reverse stack and non-null-terminated strings with PL/0 made it immune to some problems. Wirth’s languages checked bounds, argument types, etc. Hansen had one, Concurrent Pascal, free of race-conditions. Eiffel’s SCOOP took that further with Rust doing it again. Various ML’s, Haskell, and Idris had advanced type systems that let developers enforce even more properties at compile time. The goal of all of these was designing something so that the average person using them wouldn't have to worry about entire classes of problems. Theoretically, it would be more productive since the method automated or reduce the burden of eliminating those errors. More typing led to more time so tools often did prototyping in unsafe mode with quick tests to make sure it worked initially. If it's too much of a problem, then I recommend using quickest methods for real-time prototyping with full analysis and testing of complex modules running in background, on clusters, or overnight to get more thorough results.

    Review of all that. Burroughs did it implicitly with Barton seeming to invent the concept of formalized reviews in an off-hand comment or two. Apollo team did it religiously. Commercial sector got the message when Fagan introduced Software Inspection Process at IBM that dedicated percentage of time to finding bugs based on checklists of kind that keep cropping up. Worked wonders. Later, OpenVMS team alternated one week building, one week reviewing/fixing based on priority, and all tests ran during weekend. One of most reliable OS’s ever built. Microsoft did a 180 in security when a manager from high-assurance security, Steve Lipner, gave them a watered-down, ship-quick version of this list in the form of Security Development Lifecycle (SDL). The reviews & QA tests knocked 0-days down to lower than most FOSS that I can tell. So, review, review, review!

    Testing. Came in many forms with many benefits. Acceptance tests were about customers as much as proving the tech works. Unit tests checked every module and improved quality but test-building and running vs other QA trade-off is still debated. Cleanroom decided bugs users see are worse than those they don’t. So they did usage-driven testing based on models of all the ways the users might use the interfaces to the product. Formal specs or Design-by-Contract allowed test generation based on specs of actual behavior. You don’t manage those tests at all but still get the benefit. Fuzz testing caught errors manual tests often missed so they’re good idea. So far, automated testing from formal specs and/or types combined with manual tests for what those can’t easily express seems like strongest combo for baseline. Use a combo of property-based and random testing overnight fixing based on importance in requirements or severity of damage. A major benefit of Design-by-Contract combined with fuzzing is that failures of a contract take you right to the source of the problem with laser-like focus.

    It was known since early days that compilers were complicated beasts that could trash your software’s correctness. Wirth went with simplified ones that did quick iterations with OK performance. Ada compilers tried to do everything under the sun for high safety and performance but with high build times and licensing costs. LISP’s had ideal method of REPL for instant results, incremental compilation of individual functions for quick iteration of code with acceptable performance, and full compilation with static types for best performance. As stated above, the latter can be done in background or batched overnight. It also showed the complex functionality could be macro’d onto a simpler language that’s easy to compile. 4GL’s did that for BASIC as well with tremendous productivity boosts. Certifying compilers showed up that converted HLL to assembly with evidence they’d always be correct. My recent proposal is to add macros, Design-by-Contract, Rust’s temporal safety, and a few other things to a language like Modula-3 that’s already ultra-fast to compile into production code. Give it LISP-like combo of fast interpreter, incremental compilation, and fully-optimizing compilation where it does these things in background as development happens. Fully test it in overnight runs that do spec-based test generation, acceptance/regression testing, fuzz testing, full analysis, and certifying compilation with tests against the fast versions for correctness checking. Come into work next day to find either what problems to fix or ultra-optimized, correct code ready to integrate into whatever you’re about to build.

    Information-flow analysis. The early version was to detect covert channels that are the bane of high-assurance security’s existence. The only thing we couldn’t reliably, consistently eliminate. The initial method, Kemmerer's Shared Resource Matrix, modeled the system as interacting parts to find what components they shared. If they can read/write it, it’s a storage channel. If they can modify & time it, it’s a timing channel. Later on, such concepts evolved into information flow control that labeled various variables in accordance with your correctness or security model to then track via type systems or other analysis how information flows in your system. Violations of your expectations vs code’s reality are problems to be corrected. Design-by-contract can do some of this but maybe not all. Open problem. Already practical enough that tools such SIF exist for web applications where the tool automatically partitions the app between components on the client (efficiency) and server (security). Such methods could benefit software in general.

    Recovery-oriented computing and fault-tolerance. What most methods in this stage share are fail fast, fail safe, fail as obviously as possible with error messages pointing in the right direction if possible, recover immediately where possible, and allow full reboot to clean state per component or whole system when that's not possible. Hamilton’s people proved this out in Apollo using asynchronous communicating processes, hardware redundancy, code to handle failures of processes/components, and a human-in-the-loop method to let people do what they do better. Tandem NonStop had a clever combination of hardware and software to achieve a highly-parallel, five-nine's platform for mission-critical systems. OpenVMS got close to that with decade plus uptimes by just focusing on code review, error handling in kernel, versioned files, standardized locking/batching/whatever to do hard stuff for apps, and bulletproof clustering. CompSci went from there with exotic stuff such as Byzantine fault-tolerance to handle more problems with increasing complexity or performance hits. Recent research such as the Copilot monitoring scheme and Micro-Restarts paper show that one can still react to lots of problems using simple, efficient techniques.

    Build-systems. I’ll briefly mention this since it doesn’t cause most problems. Developers should protect integrity of software keeping track of all changes to know where problems came from. If security-critical project, it should run on strong TCB with access controls per project or maybe even file to reduce malicious input. Automate any tedious things while letting people decide on the rest as “managing” software in build systems has so many angles I’m still learning basic things in new contexts. Immutable store with per commit integrity, access controls for confidentiality/malice, and replication for availability are the basics. Transport-layer security optionally with cryptographic, port knocking are nice layers on top of that.

    Static and dynamic analysis. These came later in research for lighter, formal methods. The best static analysis tools… probably Astree Analyzer for C and SPARK Ada’s built-in one… can prove software free of common errors if you can express it in the constrained way necessary for the analyzers to understand it. Many model-checkers and SMT solvers were formed that could automatically prove properties about specific software. This ties into spec or property based testing where the solvers can handle what they’re good at with tests handling others. Such thinking led to dynamic analysis that combined specs and/or static analysis techniques with actually running the program. The benefits of that, from safety to extra optimization, led to my idea of staged compilation process that does quick checks in interpreter early on, slightly-slower ones on sub-optimized EXE in background, and full battery with full optimization last. In any case, developers should be using whatever automated analysis or testing tools they can for their language/app.

    Functional/OOP programming. Each group claimed their methods of development prevented flaws common in structured, imperative programming esp as things got big. More flexibility and safety they say. Subsets of these principles were shown to do exactly that. Smalltalk was the early success and is reigning king of OOP model in productivity with Eiffel maybe champ in productivity vs correctness tradeoff. Functional finally matured enough with Ocaml and Haskell ecosystems to do amazing things. Using functional in as pure a style as possible to combine with automated analysis and testing methods seems like great promise in work involved vs correctness obtained. OOP I can’t speak on as I avoided it and have less data on how experts in say Smalltalk or Eiffel might do that.

    CPU or ASIC level. I left this last as it’s often the least-likely to be available. B5000 architecture made it immune-by-default to stack overflows, out-of-bounds accesses, mis-matched arguments in function calls, and malicious data becoming new code due to app flaws. Wirth and Jurg’s Lilith computer designed the CPU, assembly, compiler, and HLL to work together ideally so no abstraction gaps posed a problem. The Secure Ada Target and Army Secure OS (ASOS) projects made tagged hardware plus high-security runtimes that enforced at runtime many protections which Ada may or may not catch at compile time. Although those didn't take off, numerous Java CPU’s doing something similar saw success in marketplace. Recently, at high-end, Cambridge’s CHERI processor makes C apps capability-secure with any violation of Principle of Least Authority (POLA) being caught by the CPU itself. The SAFE processor at crash-safe.org has a policy-neutral, metadata engine built-in that can enforce at runtime something like 50 different policies on software individually or in varous combinations. In such systems, policies might be specified like in previous sections as software is developed with build system automatically producing safe code and micro-policies to enforce them at CPU level. Given enforcement vs efficiency tradeoff, developers would keep in production version the most important policies such as memory safety or leak prevention. I'll quickly mention some techniques for boosting the hardware reliability itself: asynchronous logic knocking out timing issues, triplicated checking w/ voter to knock out general issues, and SOI/rad-hard processes knocking out many EM or radiation issues.

So, there’s what over a decade studying high-assurance systems taught me had big payoff in various projects. Methods such as Cleanroom, formal specification esp of interface expectations, spec-based testing, safe languages, static analysis, and easy iteration w/ more thorough testing later on have collectively always paid off. Small teams with limited expertise or resources applied combinations of these to get better than average results. The companies like Altran or Galois that consistently applied the more rigorous methods rarely had failures in production with most defects being non-critical. All the field reports showing the success of above methods constitutes empirical evidence that strongly argues for starting with those methods in future projects aiming to get similar results. Also, experimenting with different combinations or instances of these strategies might find new time/work/defect tradeoffs that are even better.

Nick P.
Security Researcher/Engineer
(High-assurance and anti-nation state focus)