MPI-INF Logo
Campus Event Calendar

Event Entry

New for: D1, D2, D3, D4, D5

What and Who

A Reboot-based Approach to High Availability

George Candea
Stanford University
SWS Colloquium
AG 1, AG 2, AG 3, AG 4, AG 5, SWS  
MPI Audience

Date, Time and Location

Thursday, 7 April 2005
17:00
-- Not specified --
46.1 - MPII
024
Saarbrücken

Abstract

Application-level software failures are a dominant cause of outages in

large-scale software systems, such as e-commerce, banking, or Internet
services. The exact root cause of these failures is often unknown and
the only cure is to reboot. Unfortunately, rebooting can be
expensive, leading to nontrivial service disruption or downtime even
when clusters and failover are employed.

In this talk I will describe the "crash-only design," a way to build
reboot-friendly systems. I will also present the "microreboot," a
technique for surgically recovering faulty application components
without disturbing the rest. I will argue quantitatively that
recovery-oriented techniques complement bug-reduction efforts and
provide significant improvements in software dependability. We
applied the crash-only design and microreboot technique to a satellite
ground station and an Internet auction system. Without fixing any
bugs, microrebooting recovered most of the same failures as process
restarts, but did so more than an order of magnitude faster and with
an order of magnitude savings in lost work.

Simple, cheap recovery engenders a new way of thinking about failure
management. First, we can prophylactically microreboot to rejuvenate
a software system by parts; this averts failures induced by software
aging, without ever having to bring the system down. Second, we can
mask failure and recovery from end users through transparent
call-level retries, turning failures into human-tolerable sub-second
latency blips. Finally, having made recovery very cheap, it makes
sense to microreboot at the slightest hint of failure -- if the
microreboot is indeed necessary, we speed up recovery; if not, the
impact is negligible. As a result, we productively employed failure
detection based on statistical learning, which reduces false negatives
at the cost of more frequent false positives. We also closed the
monitor-diagnose-recover loop and built an autonomously recovering
Internet service, exhibiting orders of magnitude higher availability
than previously possible.

Contact

--email hidden
passcode not visible
logged in users only

Tags, Category, Keywords and additional notes

Software Systems

Carina Schmitt, 05/11/2006 14:47
Veronika Weinand, 04/01/2005 11:47 -- Created document.