Campus Event Calendar: George Candea (04/07/2005 in 46.1

Campus Event Calendar

Campus Event Calendar:
- All Upcoming:
  - only for D1
  - only for D2
  - only for INET
  - only for D4
  - only for D5
  - only for D6
  - only for RG1
  - Mailing Lists
  - by Speaker
  - by Type
  - by Category
  - by Title
  - Calendar
  - RSS Feed
- History of Events:

Event Entry

New for: D1, D2, D3, D4, D5

What and Who

A Reboot-based Approach to High Availability

George Candea

Stanford University

SWS Colloquium

AG 1, AG 2, AG 3, AG 4, AG 5, SWS

MPI Audience

Note: We use this to send email in the morning.

Date, Time and Location

Thursday, 7 April 2005

17:00

-- Not specified --

46.1 - MPII

024

Saarbrücken

Abstract

Application-level software failures are a dominant cause of outages in

large-scale software systems, such as e-commerce, banking, or Internet
services. The exact root cause of these failures is often unknown and
the only cure is to reboot. Unfortunately, rebooting can be
expensive, leading to nontrivial service disruption or downtime even
when clusters and failover are employed.

In this talk I will describe the "crash-only design," a way to build
reboot-friendly systems. I will also present the "microreboot," a
technique for surgically recovering faulty application components
without disturbing the rest. I will argue quantitatively that
recovery-oriented techniques complement bug-reduction efforts and
provide significant improvements in software dependability. We
applied the crash-only design and microreboot technique to a satellite
ground station and an Internet auction system. Without fixing any
bugs, microrebooting recovered most of the same failures as process
restarts, but did so more than an order of magnitude faster and with
an order of magnitude savings in lost work.

Simple, cheap recovery engenders a new way of thinking about failure
management. First, we can prophylactically microreboot to rejuvenate
a software system by parts; this averts failures induced by software
aging, without ever having to bring the system down. Second, we can
mask failure and recovery from end users through transparent
call-level retries, turning failures into human-tolerable sub-second
latency blips. Finally, having made recovery very cheap, it makes
sense to microreboot at the slightest hint of failure -- if the
microreboot is indeed necessary, we speed up recovery; if not, the
impact is negligible. As a result, we productively employed failure
detection based on statistical learning, which reduces false negatives
at the cost of more frequent false positives. We also closed the
monitor-diagnose-recover loop and built an autonomously recovering
Internet service, exhibiting orders of magnitude higher availability
than previously possible.

Contact

--email hidden

System used:

Meeting URL:

Meeting ID:

Passcode:

passcode not visible

Code Visible for:

logged in users only

Tags, Category, Keywords and additional notes

Keywords, Tags:

Software Systems

Carina Schmitt, 05/11/2006 14:47
Veronika Weinand, 04/01/2005 11:47 -- Created document.

Imprint / Impressum | Data Protection / Datenschutzhinweis