March 17, 2000
Learning from Imbalanced Data Sets
As the field of machine learning makes a rapid transition from the status of "academic discipline" to that of "applied science", a myriad of new issues, not previously considered by the machine learning research community, is now coming to light. One such issue is the problem of imbalanced data sets. Indeed, the majority of learning systems previously designed and tested on toy problems or carefully crafted benchmark data sets usually assumes that the training sets are well balanced. In the case of concept-learning, for example, classifiers typically expect that their training set contains as many examples of the positive as of the negative class.
Unfortunately, this balanced assumption is often violated in real world settings. Indeed, there exist many domains for which one class is better represented than the other. This is the case, for example, in fault- monitoring tasks where non-faulty examples are plentiful since they typically involve recording from the machine during normal operation whereas faulty examples involve recording from a malfunctioning machine, which is not always possible, easy, or financially worthwhile.
The purpose of this talk is 1) to demonstrate experimentally that, at least in the case of connectionist systems, class imbalances hinder the performance of standard classifiers; 2) to compare the performance of several approaches previously proposed to deal with the problem; 3) to present an elaboration of a scheme previously used on this problem; and 4) to present a method for combining several schemes in the context of Text Classification.