April 7, 2000
Pruning decision trees and lists with significance tests
Decision trees and lists are potentially powerful predictors and embody an explicit representation of the structure in a dataset. Their accuracy and comprehensibility depends on how concisely the learning algorithm can summarize this structure. The final model should not incorporate spurious effects---patterns that are not genuine features of the underlying domain. Given an efficient mechanism for determining when a particular effect is due to chance alone, non-predictive parts of a model can be pruned. Pruning mechanisms require a sensitive instrument that uses the data to detect whether there is a genuine relationship between the components of a model and the domain. Statistical significance tests are theoretically well-founded tools for doing exactly that.
In this talk I will discuss pruning algorithms for decision trees and lists that are based on significance tests. I will explain why pruning is often necessary to obtain small and accurate models and show that the performance of standard pruning algorithms can be improved by taking the statistical significance of observations into account. I will compare the effect of parametric and non-parametric tests, show why current pruning algorithms for decision lists often prune too aggressively, and propose a new algorithm for pruning decision lists that is designed to overcome this problem.