Naiad: a system for iterative, incremental, and interactive distributed dataflow
Frank McSherry
Microsoft Research Silicon Valley
SWS Distinguished Lecture Series
Frank joined the MSR Silicon Valley lab in 2002, immediately after completing his graduate degree with Anna Karlin at the University of Washington. His interests are in Privacy and large scale Data Mining and Analysis, and the theoretical and practical issues surrounding them. In particular, he helped to develop the recent definition of Differential Privacy, and has been designing and implementing the Privacy Integrated Queries data analysis platform providing these guarantees.
In this talk I'll describe the Naiad system, based on a new model for
low-latency incremental and iterative dataflow. Naiad is designed to
provide three properties we do not think yet exist in a single system:
the expressive power of loops, concurrent vertex execution, and
fine-grained edge completion. Removing any one of these requirements
yields an existing class of solutions (respectively: streaming systems
like StreamInsight, iterative incremental systems like Nephele, and
callback systems like Percolator), but all three together appear to
require a new system design. We will describe Naiad's structured cyclic
dataflow model and protocol for tracking and coordinating outstanding
work, more closely resembling memory fences than traditional distributed
systems barriers. We give several examples of how Naiad can be used to
efficiently implement many of the currently popular "big data"
programming patterns, as well as several new ones, and experimental
results indicating that Naiad's relative performance ranges from "as
good as" to "much better than" existing systems.
This is joint work with Derek G. Murray, Rebecca Isaacs, Michael Isard,
Paul Barham, and Martin Abadi.