MapReduce is a computing paradigm that has gained a lot of attention in
recent years from industry and research. Unlike parallel DBMSs,
MapReduce allows non-expert users to run complex analytical tasks over
very large data sets on very large clusters and clouds. However, this
comes at a price: MapReduce processes tasks in a scan-oriented fashion.
Hence, the performance of Hadoop --- an open-source implementation of
MapReduce --- often does not match the one of a well-configured parallel
DBMS. We propose a new type of system named Hadoop++: it boosts task
performance without changing the Hadoop framework at all. To reach this
goal, rather than changing a working system (Hadoop), we inject our
technology at the right places through UDFs only and affect Hadoop from
inside. This has three important consequences: First, Hadoop++
significantly outperforms Hadoop. Second, any future changes of Hadoop
may directly be used with Hadoop++ without rewriting any glue code.
Third, Hadoop++ does not need to change the Hadoop interface. Our
experiments show the superiority of Hadoop++ over both Hadoop and
HadoopDB for tasks related to indexing and join processing. In this talk
I will present results from a VLDB 2010 paper as well as more recent work.