Understanding and overcoming the statistical limitations of decision trees

Abhineet Agarwal
University of California, Berkeley (UC Berkeley)
Statistics

Decision trees are important both as interpretable models, amenable to high-stakes decision-making, and as building blocks of state-of-the-art ensemble methods such as random forests and gradient boosting. Their statistical properties, however, are not yet well understood. In particular, it is unclear why there is sometimes a prediction performance gap between them and powerful but uninterpretable machine learning methods including random forests, gradient boosting, and deep learning.
We partially explain this performance gap by proving sharp squared error generalization lower bounds for any decision tree fitted to a sparse additive generative model with C1 component functions. By connecting decision tree estimation with rate-distortion theory, we establish a bound that is surprisingly much worse than the minimax rate for estimating such models, which is attainable via penalized kernel methods. This inefficiency is not due to any facet of the tree-growing procedure, but to the loss in power for detecting global structure when we average responses solely over each leaf.
Using this new insight, we propose Fast Interpretable Greedy-Tree Sums (FIGS), which generalizes the CART algorithm to simultaneously grow a flexible number of trees in a summation. We give evidence via simulations and partially oracle theoretical results that FIGS is able to disentangle the additive components in a model, thereby reducing redundant splits and increasing prediction performance. Experiments across a wide array of real-world data sets show that FIGS usually has better prediction performance than other popular rule-based methods when all methods are restricted to just a few splits. To demonstrate the utility of FIGS in high-stakes domains, we apply FIGS to learn clinical decision instruments, and show they out-perform other tree-based methods by over 20%.


Back to EnCORE Workshop on Computational vs Statistical Gaps in Learning and Optimization