MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Fingerprinting performance crises in the datacenter

Moises Goldszmidt
Microsoft Research, Silicon Valley
SWS Colloquium


Moises Goldszmidt is a principal researcher in Microsoft Research (Silicon Valley Campus). His research interests include probabilistic reasoning, graphical models, statistical machine learning, and systems. Prior to Microsoft, Moises held similar positions with Hewlett-Packard Labs, SRI International, and Rockwell Science Center, and was a principal scientist with Peakstone Corporation (start-up). Dr. Goldszmidt has a PhD degree in Computer Science from the University of California in Los Angeles (1992). Since 1999, Moises has been focusing his research on the application of statistical pattern recognition and probabilistic reasoning to the modeling, diagnosis, performance forecasting, and control of distributed networked systems.
SWS, RG1  
Expert Audience
English

Date, Time and Location

Monday, 11 May 2009
16:00
60 Minutes
E1 5
5th floor
Saarbrücken

Abstract


We propose a method for significantly reducing troubleshooting and diagnosis time in the datacenter by automatically generating fingerprints of performance crises, enabling fast classification and recognition of recurring instances. We evaluated the approach on data from a production datacenter with hundreds of machines running a 24x7 enterprise-class user-facing application, verifying each identification result with the operators of the datacenter (and the corresponding troubleshooting tickets). The approach has 80% identification accuracy in the operations-online setting with time to identification below 15 minutes (on average) after the start of the crises (operators stipulated a deadline of 60 minutes). In an offline setting, where some parameters can be fitted optimally, the accuracy is on the 95%-98% range. After explaining the fingerprinting method and the results, I will end the talk with a discussion on the possibility of predicting the crises, and on extending this work to model the operator's repair actions for learning models of automated decision making.

Joint work with Peter Bodik and Armando Fox from UC Berkeley, and Hans Andersen from Microsoft.

Contact

Brigitta Hansen
0681 - 9325691
--email hidden

Video Broadcast

Yes
Kaiserslautern
G26
206
passcode not visible
logged in users only

Carina Schmitt, 10/13/2016 16:00
Uwe Brahm, 05/18/2009 12:06
Brigitta Hansen, 04/30/2009 09:43 -- Created document.