Most state of the art tree learning algorithms require data to reside in memory on a single machine. Adopting this approach for trees on big data is not feasible as the limited resources provided by only one machine lead to scalability problems.
While more scalable implementations of tree learning algorithms have been proposed, they typically require specialized parallel computing architectures rendering those algorithms complex and error-prone.
In this presentation I will introduce two approaches to computing ensembles of regression trees on very large training data sets using the MapReduce framework as an underlying tool. The first approach, inspired by a tree inducting system proposed by
GOOGLE, employs the entire MapReduce cluster to parallely and fully distributedly learn tree ensembles. The second approach exploits locality and independence in the tree learning process.