New for: D2, D3
Included in this, we propose a technique, partitioned in-mapper combining, which enables us to aggregate data in memory correctly, even though the data to be aggregated is larger than the available memory. Evaluation of experiments on New York Times Annotated Corpus, which contains roughly 2 million documents, show that our approach works at least 2 times faster as compared to naive approach. It obtains more than 90% of frequent phrases with high precision. Moreover, it is able to find all highly frequent phrases exactly, along with their accurate counts. Furthermore, by a quick second pass on the data, we precisely provide most of the frequent phrases with their corresponding true counts, still being faster than naïve approach.