Classifier Ensembles: A Practical Overview by Cory McKay

7.0 Additional methodologies

Bagging and boosting are two common classifier combination approaches that have repeatedly experimentally proven themselves to be successful. They are therefore given special attention in this section. They can be adapted to work with a variety of fusion and selection techniques. Hierarchical and binary approaches can also be very powerful, and included in this sectoin as well.

7.1 Bagging

Bagging, which is short for “bootstrap aggregating,” is an approach where diversity in the component classifiers is created by training them each with slightly different training sets. Bootstrapping is used to generate the training set for each component classifier, which is to say that for a complete training set of size N, each component classifier’s training set is generated by randomly drawing N instances with replacement. This means that each training set, on average, will contain approximately 63.2% of instances (because (1 – 1/N)^N is approximately equal to e^-1, which is approximately equal to 1 – 0.632), and that certain instances will be drawn more than once and some will be drawn not at all. Bootstrapping is useful in situations where only limited training data is available, as all training instances serve as the population from which each component classifiers can draw training samples.

Bagging is particularly appropriate for use with instable classifiers, where small changes in the training set can have a significant effect on the classifier output. This helps to ensure classifier diversity, which causes the ensemble as a whole to be more effective (see Section 8). Decision trees, multi-layer neural networks and condensed nearest neighbour (but not basic nearest neighbour) classifiers are all appropriate instable classifiers.

7.2 Boosting

Boosting techniques actively attempt to generate complementary component classifiers by training each classifier specifically on the mistakes of previous classifiers. There are many varieties of boosting algorithms.

For example, consider the original boosting approach proposed by Shapire (1990), where one has three classifiers, D1, D2 and D3. The training set is then divided into three sets, X1, X2 and X3. D1 is trained on X1 and then tested on X2. All instances that are misclassified by D1 can then be used to train D2, potentially in addition to the entirety of X2. X3 is then used to test D1 and D2, neither of which have previously been exposed to X3. The instances on which D1 and D2 disagree then serve as the training set for D3.

The result of this is that each successive classifier is trained to correctly classify types of patterns that the previous classifiers are least probable of being able to correctly classify. This not only helps to ensure component classifier diversity, but emphasizes complementary performance of the components. Unfortunately, the disadvantage of this approach is that it requires a large training set, as the training set must be divided up into one independent sub-set for each classifier. Furthermore, assuming that each of the component classifiers is relatively competent, as one would hope, there will only be a small number of instances that are misclassified by each component and forwarded on for training to classifiers further down the boosting chain.

Freund and Shapire (1996) have proposed a popular and very effective boosting algorithm named AdaBoost, short for “adaptive boosting,” of which many variants have since been experimented with. This algorithm solves the previous need for a large training set by using the entire training set as a population from which each component classifier may draw samples. The probability that a given sample will be drawn for training by a given component classifier depends on how well the previously trained classifiers have performed on it. The jth classifier is more likely to choose a given sample if the previously trained classifiers 1 to j-1 had a high error rate when tested on that sample. Among the many variants of AdaBoost, there are some deterministic ones which use reweighting rather than the resampling described above.

It is better to use component classifiers with high variance with boosting algorithms. Linear discriminant algorithms, for example, would therefore be a poor choice. Despite the improvements in AdaBoost over Shapire’s original algorithm, boosting does still in general require more training data than bagging does. Boosting is also fairly susceptible to noise and outliers. So, although boosting algorithms can be extremely effective in some cases, one must, as always, be sure that they are appropriate for a particular problem before applying them.

7.3 Hierarchical and binary classification

Hierarchical and binary approaches reduce complex multi-class classification problems to multiple smaller problems, each involving only a sub-set of all possible classes. In essence, this reduces a difficult classification problem into many smaller, hopefully easier problems.

In the case of hierarchical classification, the classes are organized in a tree-like hierarchy, and a classifier is trained for each non-leaf node in the tree. This allows the classifiers to at first make broad classifications among at categories, and then among increasingly fine categories as a descent is made down the tree. Every classification involves fewer candidate classes than the overall number of classes.

Aside from the advantage of lowering the number of classes that each classifier must be able to classify among, this approach also allows each classifier to utilize specialized feature selection techniques for its particular set of candidate classes. The disadvantage of this approach is that it requires some kind of reasonable pre-existing hierarchical relationship among the candidate classes. A further problem is that an initial poor decision at a coarse classification level can lead to a descent through an incorrect branch of a tree.

Binary approaches involve the training of a separate binary classifier for every possible pair of classes (the round-robin approach) or, alternatively, one binary classifier for each class that outputs a yes/no result for its related class. The results of the binary classifications can then be combined using one of a variety of classifier fusion techniques. These approaches have the advantages that each classifier only has to deal with a greatly simplified classification problem and that it enables one to use powerful binary classifiers, such as support-vector machines. The disadvantage is that a misclassification by a single classifier could have a significant effect on the overall classification of the ensemble.

A particularly effective technique for dealing with this sensitivity to individual misclassifications is the error-correcting output codes approach proposed by Dietterich and Bakiri (1995). In this approach, each classifier outputs one of two opinions: a given instance belongs to a class Cj, or it does not. The results of all of the component classifiers are combined into a classification matrix, from which overall classifications can be drawn. The matrix of all possible classifications by all component classifiers provides one with a set of binary code words matching each possible output of the entire set of component classifiers with a particular overall class.

Of course, this results in a situation where a single misclassification will lead to an incorrect ensemble classification. The solution is a clever technique for treating misclassifications as noise and then introducing redundancy to allow noise tolerance that compensates for incorrect classifications. It is a common practice to introduce a certain amount of redundancy in any binary data that is stored or transmitted so that the corruption of one or a few bits will not cause one to lose the meaning of the stored data. Such redundant error-correcting codes, such as basic even/odd parity, can be used to detect and potentially reconstruct corrupted data. This approach can be applied to errors in the classification matrix described above, so that a single misclassification, or “corrupted bit,” does not cause the overall classification to be incorrect. More specifically, one introduces redundant classifiers in such a way as to increase the Hamming distance between the code words denoting different classes

Next: Classifier diversity

Last modified: April 18, 2005.
-top of page-