Streaming algorithms are designed to model the rigors of computation over massive data sets. The input to a streaming algorithm is a read-only array that may only be accessed sequentially in a small, constant number of passes. The algorithm may use a small, sublinear amount of extra memory to perform its calculations. The resource requirements to be optimized are the number of passes and the amount of extra memory.
In this talk, we will first review theoretical models for computation over massive data sets. We will then discuss small space, multiple pass algorithms for a class of clustering problems from machine learning, called learning mixtures of distributions, and corresponding nearly-tight lower bounds on resource usage. We will briefly sketch the general framework of these clustering algorithms, which can also be adapted to solve clustering problems in combinatorial optimization, such as facility location and k-median.
Lastly, we will explore a new problem in convex/concave optimization over massive data sets, motivated by the problem of parameter estimation of general statistical models.