[ CogSci Summaries home | UP | email ]
http://www.jimdavies.org/summaries/

J.R. Quinlan (1986). Induction of Decision Trees, Machine Learning, (1), 81-106

@InCollection{quinlan1986,
  author =       "J. R. Quinlan",
  title =        "Induction of Decision Trees",
  booktitle =    "Readings in Machine Learning",
  publisher =    "Morgan Kaufmann",
  year =         "1990",
  editor =       "Jude W. Shavlik and Thomas G. Dietterich",
  note =         "Originally published in {\em Machine Learning}
                 1:81--106, 1986",
}

Author of the summary: David Furcy, 1999, dfurcy@cc.gatech.edu

Cite this paper for:

SYSTEMS: ID3, ASSISTANT.
Machine learning method for induction, classification, concept learning.
Decision tree learning, a supervised inductive learning method that can be extended to handle noisy and/or incomplete data.
Information gain and gain ratio heuristics.

Summary

The paper describes a class of decision tree learning methods to perform supervised, batch (non-incremental) inductive learning (e.g., concept learning) and classification tasks. ID3 and ASSISTANT are two instances of this class. The learned decision tree should capture relevant relationships between attributes' values and class information. In addition, such systems typically use information-based heuristics to bias their learning towards shallower trees. The paper also describes several ways to extend the basic algorithm to handle noisy and/or incomplete data, as well as continuous attribute values.

Detailed outline

Induction task

input: A set of training examples, i.e. objects described as a conjunction of discrete-valued attributes. One particular attribute is chosen as the class of the example.

output: A classification rule (in the form of a decision tree) which, given an object, returns a class.

decision trees (DT)
Every leaf node represents a class (several nodes can represent the same class). If there are only two distinct classes, the induction task reduces to concept learning.
Each inner node represents an attribute and has one child for each of the attribute's values.
When classifying an object O, start with the attribute A represented at the root. Go down the branch corresponding to the value of A for O, and so on until a leaf node is reached. That's the class of O.

ID3-type algorithms

To perform induction, start with a set of objects (training examples) C. Choose an attribute A as the root node. Create as many children as the number of values for A. If all the objects in C belong to the same class, stop with a leaf labelled with that class name. Otherwise, distribute the objects in C among the children nodes according to their value for A. Iterate the process for each child.

The main issue is "which attribute to split on at each iteration?" Since many DT's exist that correctly classify the training set, a smaller one is usually preferred (Occam's Razor). But finding the smallest DT is NP-complete. So we need a heuristic.

This paper advocates the use of information theory: the quantity of information I is a real number between 0 and 1 that characterizes the amount of uncertainty in a set C w.r.t. class membership. I=0 if all objects in C belong to the same class. I=1 if they are evenly distributed between two classes. At each iteration, the heuristic minimizes I.

How to deal with noise

Noise in the data can take the form of wrong attribute values (including class information). With noisy data, attributes may become inadequate (not able to discriminate between objects) and DT's can unnecessarily grow in size to fit the noisy data (overfitting).

Overfitting can be avoided by testing the relevance of an attribute for splitting using chi-square-like statistical tests.
If a leaf contains instances of several classes (due, e.g., to inadequate attributes), a label can either be uniquely chosen as the most represented class or based on a probabilistic class membership.

Unknown attribute values

in the training set. Fill in the value:
- using Bayesian statistics (ASSISTANT)
- using the DT method itself
- using 'unknown' as a new value (does not work well)
- in proportion of the relative frequencies of the known values (*)
at classification time: split the object with the unknown value using (*) above repeatedly down to the leaves. Then, sum up the values for each class and select the highest one. This method degrades gracefully.

Other information-based heuristics

The information-gain heuristic favors attributes with many values (independently of their relevance).

Subset heuristic: ASSISTANT imposes that all tests be binary by grouping subsets of values together. This heuristic produces smaller trees but they seem less intelligible. It can also be used to deal with coninuous attributes. It is computationally expensive.

Gain-ratio heuristic:

                  gain(A)
 gain ratio = --------------
               info_value(A)   --> amount of uncertainty in A's values

Among the attributes with above averagegain, choose the one that maximizes this ratio. This heuristic gives smaller trees when the attributes are binary.

Other heuristics: Other, non information-based heuristics (such as statistical) are mentioned.

Summary author's notes:

none

Back to the Cognitive Science Summaries homepage
Cognitive Science Summaries Webmaster:

JimDavies ( jim@jimdavies.org )

Last modified: Sun May 23 12:06:43 EDT 1999