Manos Kapritsos is a Postdoctoral Researcher at Microsoft
Research in Redmond, WA. He received his Ph.D. from the University of Texas
at Austin in 2014. His research focuses on designing reliable distributed
systems, by applying fault-tolerant replication to combat machine failures
and using formal verification to ensure software correctness.
Reliability is a first-order concern in modern distributed
systems. Even large, well-provisioned systems such as Gmail and Amazon Web
Services can be brought down by failures, incurring millions of dollars of
cost and hurting company reputation. Such service outages are typically
caused by either hardware failures or software bugs. The systems community
has developed various techniques for dealing with both kinds of failures
(e.g. replication, software testing), but those techniques come at a
significant cost. For example, our replication techniques for handling
hardware failures are incompatible with multithreaded execution, forcing a
stark choice between reliability and performance. As for guarding against
software failures, our only real option today is to test our system as best
we can and hope we have not missed any subtle bugs. In principle there
exists another option, formal verification, that fully addresses this
problem, but its overhead in both raw performance and programming effort is
considered way too impractical to adopt in real developments.
In this talk, I make the case for Sustainable Reliability, i.e. reliability
techniques that provide strong guarantees without imposing unnecessary
overhead that limits their practicality. My talk covers the challenges faced
by both hardware and software failures and proposes novel techniques in each
area. In particular, I will describe how we can reconcile replication and
multithreaded execution by rethinking the architecture of replicated
systems. The resulting system, Eve, offers an unprecedented combination of
strong guarantees and high performance. I will also describe IronFleet, a
new methodology that brings formal verification of distributed systems
within the realm of practicality. Despite its strong guarantees, IronFleet
incurs a very reasonable overhead in both performance and programming
effort.