Dawes, Faust, & Meehl 1989: Clinical versus actuarial judgment

[ CogSci Summaries home | UP | email ]
http://www.jimdavies.org/summaries/

Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 1668-1674.

Author of the summary: David Zach Hambrick, 1998, gt8781a@prism.gatech.edu

If one rejects a radical talent view, then experience is a necessary for the acquisition of expertise. That is, one must engage in some form of domain-specific experience in order to acquire expertise. However, Dawes et al.’s review suggests that in some domains, experience may be only weakly related to performance criteria. Research on decision making has contrasted two approaches to making predictions: clinical and actuarial. In the actuarial approach, a decision is made on empirical relations between predictor variables and an outcome variable. In the clinical approach, decisions are based on mental processes of the human judge. Research shows that in many decision making domains actuarial models are more accurate in making predictions about complex outcomes than are trained human judges. A third approach combines the clinical and actuarial approaches.

Two conditions must be met to make a fair comparison between clinical and actuarial methods. First, decisions must be based on the same data. However, the rules and strategies used to make the decisions can be derived from different data. For example, regression weights in an regression equation might be derived from prior outcomes while a clinician may base his or her judgments on experience with different cases. Second, inflation of accuracy in actuarial models due to chance must be avoided. Cross-validation (i.e., application of a model to different samples) provides a realistic assessment of how well a model predicts.

The Goldberg rule is an example of a crude actuarial model used in clinical assessment. The simple rule is used for distinguishing between neurosis and psychosis based on MMPI responses. Goldberg found that in terms of accuracy of diagnosis the rule exceeded the accuracy level of human judges in several different settings (70% vs. 62%). Moreover, humans were less accurate than the rule even after extensive practice in making diagnoses. In another study, Goldberg constructed mathematical models for 29 human judges. In subsequent predictions, he found that the models outperformed the judges. The implication is that the judges were not consistent in applying their own decision rules. As Dawes et al. explain, "the models were more often correct than the very judges on whom they were based. The perfect reliability of the models likely explains their superior performance in this and related studies" (p. 1669). In another study, pathologists made predictions about the prognosis of patients with Hodgkin’s disease along nine dimensions. Regression equations were formed based on the relations of the judge’s ratings to survival time. The models outperformed the judges in subsequent predictions. The implication is that "the pathologists’ ratings produced potentially useful information but that only the actuarial method, which was based on these ratings, tapped their predictive value" (p. 1669).

The advantage of actuarial models over human judges has been replicated in many domains—for example, prediction of college grades and prediction of criminal recidivism. Moreover, the same results are obtained when judges are allowed to use any information they choose when making decisions. As Dawes et al. comment, "The various studies can thus be viewed as repeated sampling from a universe of judgment tasks involving the diagnosis and prediction of human behavior" (p. 243). But are there circumstances under which human’s beat the models? When predictions are predicated on an internally consistent theory, human’s sometimes outperform actuarial models. But as Dawes et al. note, in the social sciences, some theories allow for contradictory conclusions. For example, "Prediction of treatment response or violent behavior may rest on psychodynamic theory that permits directly contradictory conclusions and lacks even formal measurement techniques" (p. 243).

The broken-leg problem also illustrates how judges might outperform actuarial models. In the broken leg problem, prediction about movie attendance is based on the observation that an individual is probably not likely to attend a movie if he or she has a broken leg. An actuarial model would not consider this rare event. "The clinician may beat the actuarial method if able to detect a rare fact and decide accordingly" (p. 243). However, Dawes et al. also note that overall accuracy is greater when the judge relies solely on mathematical prediction because "When operating freely, clinicians apparently identify too many ‘exceptions’" (p. 1671). Clinical judgement may also be preferable in situations where false-negative and false-positive outcomes have serious consequences.

Variables based on human observation are irreplaceable. For example, computer’s cannot recognize a floating gait that is a genuine indication of schizophrenia. But, "a unique capacity to observe is not the same as a unique capacity to predict on the basis of integration of observations" (p. 1671). So, it seems that actuarial formulas are good at making predictions based on the integration of information, whereas humans are necessary for observing the factors from which the predictions are based. Complementary strengths.

Why are humans so bad at predicting complex outcomes? First, humans do not make decisions the same way every time. Judges make different predictions from the same set of data. In other words, test-retest reliability is low. This is obviously not a problem with an actuarial formula. Second, humans are bad at distinguishing valid from invalid predictor variables. False beliefs about the predictive value of variables develop. One reason is that feedback is often not provided. Therefore, "Lacking sufficient or clear information about judgmental accuracy, it is problematic to determine the actual validity, if any, of the variables on which one relies" (p. 1671). False beliefs about the validity of a variable can also form through self-fulfilling prophecies. Similarly, hindsight bias leads people to believe that certain outcomes are more predictable than they actually are.

The evidence described above has had a limited impact on how decisions are in practice made. One reason is lack of familiarity with the evidence. Another is misunderstanding of the evidence. A common misconception is that group statistics are not relevant to individual cases. Dawes et al. point out that "Although individuals and events may exhibit unique features, they typically share common features with other persons or events that permit tallied observations or generalizations to achieve predictive power" (p. 1672). They observe that the uniqueness argument would imply that if forced to play Russian roulette, choosing between a gun with one bullet and one with five would be an arbitrary decision.

Comments

This review suggests that experience in clinical assessment may not always be a good measure of expertise. That is, experience in prediction does not always lead to improved accuracy. Of course, in some of the domains discussed, prediction is only one responsibility. There are certainly tasks on which experience does matter.

An implication of the finding that actuarial models often outperform human judges raises the possibility that humans may not be able to acquire skill in all domains. Clinical assessment may be analogous to a variable task in which practice does not lead to improvements in performance. The superior performance by actuarial formulas does, however, suggest that there is some regularity that can be capitalized on.

Questions

A question for future thought concerns a taxonomy of expertise domains. How is a decision making task different from a choose-a-move task in chess? Why does experience matter in one domain but not in the other? Is the difference availability of feedback? For example, in chess, feedback on a move is provided almost immediately. By contrast, in clinical assessment, feedback about the accuracy of a diagnosis is sometimes never provided. (It thus seems that feedback is necessary to communicate consistencies in a task.) How does feedback improve performance? Can domains be classified according to the type of feedback that is provided?

Back to the Cognitive Science Summaries homepage
Cognitive Science Summaries Webmaster:

JimDavies (jim@jimdavies.org)

Last modified: Tue Mar 7 14:30:18 EST 2000