Clustering algorithms generate a partitioning of the data into groups or clusters, such that the data objects assigned to a common cluster are as similar as possible and the data objects assigned to different clusters differ as much as possible. By performing a cluster analysis, the user can ideally gain an overview on the major characteristics of a data set without any previous knowledge. However, in practice, performing a cluster analysis is often not easy, since most clustering algorithms require numerous input parameters. Without background knowledge on the data, it is often difficult to find a suitable parameterization. Often, parameters need to be adjusted in a time consuming trial and error procedure. It cannot be guaranteed that a useful parameterization can be detected by doing so. Outliers and noise points in real‐world data additionally complicate the search for a suitable parameterization.
In this talk, I will discuss some novel approaches which are important milestones on the way to parameter‐free clustering. The basic idea of these techniques is to relate clustering to data compression. A good clustering is a clustering summarizing the major characteristics in the data and thus allows for effectively compressing the data. Based on this principle also known as Minimum Description Length, the algorithm RIC (Robust information‐theoretic Clustering) introduces a quality criterion for clustering to improve an arbitrary initial clustering, for example an imperfect clustering obtained with inappropriate parameterization. In addition, RIC provides effective and efficient algorithms for identification of noise points and outliers. The algorithm OCI (Outlierrobust Clustering using Independent Components) is a standalone algorithm for parameter‐free clustering. OCI relies on a very general cluster notion supported by the Exponential Power Distribution and Independent Component Analysis and provides effective clustering of non‐Gaussian data.
A brief survey on my further research areas including semi‐supervised and supervised learning concludes this talk.