Constructing such datasets is very expensive and time consuming. However, for many tasks it is possible to obtain abundant amounts of unlabeled content annotated with labels correlated with required structured predictions. Examples of such correlated labels include titles of documents and topic tags for text segmentation, sentiment scores and helpfulness ratings for summarization. Abundance of such weakly supervised data opens an interesting line of research: designing models leveraging these labelings to tackle a wide variety of NLP problems.
In this talk I will be considering the sentiment summarization problem. I will present statistical models which exploit user generated numerical aspect ratings to discover corresponding topics and are therefore able to extract fragments of text discussing these aspects without the need of annotated data. I will also discuss implications to other NLP problems, generalization performance of the proposed methods, and important open research questions.