MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Sustainable Reliability for Distributed Systems

Manos Kapritsos
Microsoft Research, Redmond
SWS Colloquium

Manos Kapritsos is a Postdoctoral Researcher at Microsoft
Research in Redmond, WA. He received his Ph.D. from the University of Texas
at Austin in 2014. His research focuses on designing reliable distributed
systems, by applying fault-tolerant replication to combat machine failures
and using formal verification to ensure software correctness.
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
Expert Audience
English

Date, Time and Location

Monday, 11 April 2016
10:30
60 Minutes
G26
111
Kaiserslautern

Abstract

Reliability is a first-order concern in modern distributed
systems. Even large, well-provisioned systems such as Gmail and Amazon Web
Services can be brought down by failures, incurring millions of dollars of
cost and hurting company reputation. Such service outages are typically
caused by either hardware failures or software bugs. The systems community
has developed various techniques for dealing with both kinds of failures
(e.g. replication, software testing), but those techniques come at a
significant cost. For example, our replication techniques for handling
hardware failures are incompatible with multithreaded execution, forcing a
stark choice between reliability and performance. As for guarding against
software failures, our only real option today is to test our system as best
we can and hope we have not missed any subtle bugs. In principle there
exists another option, formal verification, that fully addresses this
problem, but its overhead in both raw performance and programming effort is
considered way too impractical to adopt in real developments.

In this talk, I make the case for Sustainable Reliability, i.e. reliability
techniques that provide strong guarantees without imposing unnecessary
overhead that limits their practicality. My talk covers the challenges faced
by both hardware and software failures and proposes novel techniques in each
area. In particular, I will describe how we can reconcile replication and
multithreaded execution by rethinking the architecture of replicated
systems. The resulting system, Eve, offers an unprecedented combination of
strong guarantees and high performance. I will also describe IronFleet, a
new methodology that brings formal verification of distributed systems
within the realm of practicality. Despite its strong guarantees, IronFleet
incurs a very reasonable overhead in both performance and programming
effort.

Contact

Brigitta Hansen
0681 93039102
--email hidden

Video Broadcast

Yes
Saarbrücken
E1 5
029
passcode not visible
logged in users only

Brigitta Hansen, 03/30/2016 13:52 -- Created document.