MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Sampling based algorithms for k-means clustering

Ragesh Jaiswal
IIT Delhi, India
AG1 Mittagsseminar (own work)
AG 1, AG 2, AG 3, AG 4, AG 5, RG1, SWS, MMCI  
AG Audience
English

Date, Time and Location

Tuesday, 28 May 2013
13:00
30 Minutes
E1 4
024
Saarbrücken

Abstract

The k-means clustering problem is defined as follows: Given n points in a d dimensional space, find a set of k points (called centers) such that the sum of square of distances of each point to its closest center is minimized. The problem is NP hard for all non-trivial values of k and d and there are various approximation algorithms for solving this problem. However, the algorithm that is used in practice to solve this problem is actually a heuristic without any theoretical guarantees on the running time or the approximation. This algorithm is called the Lloyd's algorithm and is known to be fast and accurate on most real datasets. This algorithm needs to be started with k initial centers after which, it modifies these centers in multiple iterations. The quality of the solution gets better in subsequent iterations. So, one way to get an approximation guarantee is to initialize the Lloyd's algorithm with k centers such that these k centers gives an approximation guarantee. The additional requirement from such an initialization algorithm is that it should be simple to implement and should be extremely fast. One such seeding algorithm that has gained prominence in the past is the k-means++ seeding algorithm.


The k-means++ seeding algorithm is a simple sampling algorithm defined as follows: Pick the first center randomly from among the given points. For i>1, pick a given point to be the i^{th} center with probability proportional to the square of the distance of this point from the nearest previously chosen (i-1) centers. Practitioners feel good about this algorithm since it is simple to implement while theorists feel comfortable because Arthur and Vassilvitskii showed that this algorithm gives an O(log k) approximation in expectation. In this talk, we will explore different aspects of this simple sampling technique. We will see how it can be used obtain a PTAS, observe conditions under which it gives a constant approximation, see some of the limits of this technique, and conclude with some open questions.

This is a joint work with various colleagues and students at CSE, IIT Delhi.

Contact

G Philip
--email hidden
passcode not visible
logged in users only

Geevarghese Philip, 05/27/2013 10:23 -- Created document.