http://www.jimdavies.org/summaries/

@InCollection{quinlan1986, author = "J. R. Quinlan", title = "Induction of Decision Trees", booktitle = "Readings in Machine Learning", publisher = "Morgan Kaufmann", year = "1990", editor = "Jude W. Shavlik and Thomas G. Dietterich", note = "Originally published in {\em Machine Learning} 1:81--106, 1986", }

- SYSTEMS: ID3, ASSISTANT.
- Machine learning method for induction, classification, concept learning.
- Decision tree learning, a supervised inductive learning method that can be extended to handle noisy and/or incomplete data.
- Information gain and gain ratio heuristics.

Every leaf node represents a class (several nodes can represent the same class). If there are only two distinct classes, the induction task reduces to concept learning.

Each inner node represents an attribute and has one child for each of the attribute's values.

When classifying an object O, start with the attribute A represented at the root. Go down the branch corresponding to the value of A for O, and so on until a leaf node is reached. That's the class of O.

The main issue is "which attribute to split on at each iteration?" Since many DT's exist that correctly classify the training set, a smaller one is usually preferred (Occam's Razor). But finding the smallest DT is NP-complete. So we need a heuristic.

This paper advocates the use of information theory: the quantity of information I is a real number between 0 and 1 that characterizes the amount of uncertainty in a set C w.r.t. class membership. I=0 if all objects in C belong to the same class. I=1 if they are evenly distributed between two classes. At each iteration, the heuristic minimizes I.

Overfitting can be avoided by testing the relevance of an attribute for splitting using chi-square-like statistical tests.

If a leaf contains instances of several classes (due, e.g., to inadequate attributes), a label can either be uniquely chosen as the most represented class or based on a probabilistic class membership.

- in the training set. Fill in the value:
- using Bayesian statistics (ASSISTANT)
- using the DT method itself
- using 'unknown' as a new value (does not work well)
- in proportion of the relative frequencies of the known values (*)

- at classification time: split the object with the unknown value using (*) above repeatedly down to the leaves. Then, sum up the values for each class and select the highest one. This method degrades gracefully.

gain(A) gain ratio = -------------- info_value(A) --> amount of uncertainty in A's valuesAmong the attributes with above averagegain, choose the one that maximizes this ratio. This heuristic gives smaller trees when the attributes are binary.

- none

Back to the Cognitive Science Summaries homepage

Cognitive Science Summaries Webmaster: JimDavies ( jim@jimdavies.org ) Last modified: Sun May 23 12:06:43 EDT 1999