Fingerprinting performance crises in the datacenter
Moises Goldszmidt
Microsoft Research, Silicon Valley
SWS Colloquium
Moises Goldszmidt is a principal researcher in Microsoft Research (Silicon Valley Campus). His research interests include probabilistic reasoning, graphical models, statistical machine learning, and systems. Prior to Microsoft, Moises held similar positions with Hewlett-Packard Labs, SRI International, and Rockwell Science Center, and was a principal scientist with Peakstone Corporation (start-up). Dr. Goldszmidt has a PhD degree in Computer Science from the University of California in Los Angeles (1992). Since 1999, Moises has been focusing his research on the application of statistical pattern recognition and probabilistic reasoning to the modeling, diagnosis, performance forecasting, and control of distributed networked systems.
We propose a method for significantly reducing troubleshooting and diagnosis time in the datacenter by automatically generating fingerprints of performance crises, enabling fast classification and recognition of recurring instances. We evaluated the approach on data from a production datacenter with hundreds of machines running a 24x7 enterprise-class user-facing application, verifying each identification result with the operators of the datacenter (and the corresponding troubleshooting tickets). The approach has 80% identification accuracy in the operations-online setting with time to identification below 15 minutes (on average) after the start of the crises (operators stipulated a deadline of 60 minutes). In an offline setting, where some parameters can be fitted optimally, the accuracy is on the 95%-98% range. After explaining the fingerprinting method and the results, I will end the talk with a discussion on the possibility of predicting the crises, and on extending this work to model the operator's repair actions for learning models of automated decision making.
Joint work with Peter Bodik and Armando Fox from UC Berkeley, and Hans Andersen from Microsoft.