Julia Stoyanovich is a Visiting Scholar at the University of Pennsylvania. Julia holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst. After receiving her B.S. Julia went on to work for two start-ups and one real company in New York City, where she interacted with, and was puzzled by, a variety of massive datasets. Julia's research focuses on modeling and exploring large datasets in presence of rich semantic and statistical structure. She has recently worked on personalized search and ranking in social content sites, rank-aware clustering in large structured datasets that focus on dating and restaurant reviews, data exploration in repositories of biological objects as diverse as scientific publications, functional genomics experiments and scientific workflows, and representation and inference in large datasets with missing values.
In online applications such as Yahoo! Personals and Trulia.com, users define structured profiles in order to find potentially interesting matches. Typically, profiles are evaluated against large datasets and produce thousands of ranked matches. Highly ranked results tend to be homogeneous, which hinders data exploration. For example, a dating website user who is looking for a partner between 20 and 40 years old, and who sorts the matches by income from higher to lower, will see a large number of matches in their late 30s who hold an MBA degree and work in the financial industry, before seeing any matches in different age groups and walks of life. An alternative to presenting results in a ranked list is to find clusters, identified by a combination of attributes that correlate with rank, and that allow for richer exploration of the result set.
In the first part of this talk I will propose a novel data exploration paradigm, termed rank-aware interval-based clustering. I will formally define the problem and, to solve it, will propose a novel measure of locality, together with a family of clustering quality measures appropriate for this application scenario. These ingredients may be used by a variety of clustering algorithms, and I will present BARAC, a particular subspace-clustering algorithm that enables rank-aware interval-based clustering in domains with heterogeneous attributes. I will present results of a large-scale user study that validates the effectiveness of this approach. I will also demonstrate scalability with an extensive performance evaluation on datasets from Yahoo! Personals, a leading online dating site, and on restaurant data from Yahoo! Local.
In the second part of this talk I will describe on-going work on data exploration for datasets in which multiple alternative rankings are defined over the items, and where each ranking orders only a subset of the items. Such datasets arise naturally in a variety of application domains, including social (e.g., restaurant and movie rating sites) and biological (e.g., analysis of genetic data). In these datasets there is often a need to aggregate multiple rankings, computing, e.g., a single ranked list of differentially expressed genes across a variety of experimental conditions, or of restaurants that are well-liked by one’s friends. I will argue that blindly aggregating multiple rankings into a single list may lead to an uninformative result, because it may not fully leverage opinions of different, possibly disagreeing, groups of judges. I will describe a framework that robustly identifies ranked agreement, i.e., it finds groups of judges whose rankings can be meaningfully aggregated. Finally, I will show how structured attributes of items and of judges can be used to guide the process of identifying ranked agreement, and to describe the resulting consensus rankings to a user.