[ CogSci Summaries home | UP | email ]

Fermuller, C.& Y. Aloimonos (1995). Vision and Action, IVC, 13(10):725-744

  author = 	 {C. Fermuller and Y. Aloimonos},
  title = 	 {Vision and Action},
  journal = 	 {IVC},
  year = 	 {1994},
  OPTvolume = 	 {13},
  OPTnumber = 	 {10},
  OPTpages = 	 {725--744},

Author of the summary: Jim R. Davies, 2000, jim@jimdavies.org

Cite this paper for:

p1. Vision should not be studied in isolation of physiology and the actions to be performed. Vision is not just turning data in to representations, but also being selective about what to focus on.

classical theory (Marr): creating representations at increasing levels of abstraction, from 2d to primal sketch to 2.5D to object centered. "Pixels to predicates"

Marr's 3 levels of information processing:

  1. computational-theoretic level: problem description, task, boundry conditions
  2. Algorithms and Representation
  3. Implementation (neurons, computer, etc.)

When formalized as an information processing task, it requires a closed system. But including the environment, the system is only closed when all aspects of reality can be modelled. Thus only toy problems can be solved.

p2. Marr's strict hierarchy of representations makes learning, adaptation, and generalization impossible. The seperation of algorithm and hardware levels is bad.

p3. Cognitive neurophysiology has found that vision is not clean and modular but several distributed, cooperative processes.

p4. Cybernetics: study of the relationship between the behavior of dynamical self-regulating systems and their structure.

This study provides: 1. a working model to explain abstract components of a vision system and 2. an approach for study and building of an actual vision system.

p5. In this study, the vision system has goals and visual competencies to use. Purposive representations' content are addresses to action routines, which have 2 types: p6.

  1. schedule physical actions
  2. schedule info to be retrieved from puposive representations and stored in LTM.

p7. We need ot know how the system interacts with the environment to know what reps it uses. You can't do this with mathematical models.

Making task-specific systems is no good. Use an evolutionary method, called the synthetic approach. Start with primitive operations. Then integrate, with concern about learning from the very beginning.

p10. complex visual methods are composed of simpler ones.

p11. Where to start? Look at lower animals with simple vision capabilities. Horridge, working on insect vision, proposed a hierarchical classification of visual capabilities. The most basic abilities are based on motion (up in complexity to insects). Next is shape.

p12. People studying the visual cortex agree more on simple properties and less on higher level ones.

MT (aka v5) MST and FST seem to be involved with motion processing.

V4 does color processing.

Zeki says that v3 does form and motion, v4 does form and color.

V1 outputs to 1. dorsal path (parietal cortex): where and how and 2. ventral path (infero-temporal cortex): what (object id)

p14. The basic motion competencies of Medusa, in order of increasing complexity.

  1. ego-motion estimation (I'm moving)
  2. partial object-motion estimation (direction of translation)
  3. independent motion detection
  4. obstacle avoidance
  5. target pursuit
  6. homing

p16. Start with detectors of very fast motion, which is what we see in peripheral vision.

p23. Figuring out optic flow is an optimization problem. By deriving ego motion from normal flow 3d info that is available, the cortex could then approximate optic flow more easily.

p24. People in vision make a big deal about shape and depth from static images. But human studies show that even people can only do this locally. At distances they are very inaccurate.

We get detailed information at the fovea, and the eyes are almost always moving. No info is gained during saccades. Because of this only relative depth should be computed, locally.

There is a computational reason for this as well. Stereo cameras only fixate on a point where the images are similar enough to do anything with, which is a small area.

p27. In computer vision there are 2 approaches to object recognition-- 3d models and 2d viewer centered views.

p28. Certain high level visual cortex cells are gaze-locked. That is they only work when the eyes are looking in a particular direction.

p30. Ego Centered representation: knowing your location relative to some fixed point (home). Geo centered representation: based on landmarks (map based.)

p31. To generalize, when mapping situations to actions, neural networks can interpolate but get unpredictable when extrapolating.

p32 The less popular way is to make data structures that store distribution data s.t. a stimulus vector can easily activate it. From what we know of animal cognition, this is better.

Summary author's notes:

Back to the Cognitive Science Summaries homepage
Cognitive Science Summaries Webmaster:
JimDavies (jim@jimdavies.org)
Last modified: Mon Mar 27 19:05:49 EST 2000