[ CogSci Summaries home | UP | email ]

Cedras, C. & M. Shah (1995), Motion-Based Recognition: A Survey, IVC, 13(2):129-155.

  author =       "C\'edras, Claudette and Shah, Mubarak Ali",
  year =         "1995",
  month =        mar,
  title =        "Motion Based Recognition: {A} Survey.",
  journal =      "Image and Vision Computing",
  volume =       "13",
  number =       "2",
  pages =        "129--155",

Author of the summary: Jim Davies and Brad Singletary, 2000, jim@jimdavies.org, bas@cc.gatech.edu

Cite this paper for:

So this paper is a whole vision course (and 3 times over again in its references) if you please. Instead of addressing specific methods, we're going to take a high level approach to the use of motion for computer vision.

Ok, to understand Motion-based recognition we've first got to better understand the semantics of motion in general. In terms of the real world, there are things, and spaces that things can occupy. Perceived motion is the record of a stepwise displacement of a thing through occupyable space. The size of step can vary, the size of the size of the step can vary, ad infinitum... Remember from physics that we have: position, velocity, accelleration, jerk, snap, crackle, and pop for the zero through sixth order differences (cameras probably don't make very good approximations of higher order differences).... each just being the derivative of the last.

At any rate, we have some model we believe in that we use to describe how things move in the world. (Newtonian Motion usually, but Dynamics are better) It is useful to note that each constraint we add limits the type of motion that can occur in the system. This model is inherantly a feature space for later use. If we know how an object is supposed to move we can limit the feature space considerably (i.e. make it non-infinite). Also, we can use this model to generate filters that help combat errors in our sensors.

We live in 3 dimensions. In terms of a camera sensor there is a nxm (2 dimensional) grid of pixels that represent emitted/reflected light from the environment incident to the CCD. Recording a 2d representation of a 3d world is a dimensionality reduction. So it is crucial to see that some motions in R3 may not be directly perceivable by a single (stationary) camera head. i.e. one dimension roughly appears as scale differences only

So one subgoal here is to use cameras to recover parameters of the third dimension (and other paramaters like where the camera is with respect to the environment) we can't or can't easily get from just one image. Another goal is to see what the hell is actually happening over a time period... to build a high-level representation of motion we can use to understand observed image sequences. Note that tracking something implies recognition.

The paper spends lots of time describing prior work in psychology on motion. Veritably, vision research that uses models of human behavior is applied psychophysics so this makes sense. Humans use motion information very well to notice things, measure relative distances, measure absolute distances, and predict future motion. If humans can do it, why can't machines? As it turns out motion is a very powerful, but sometimes computationally expensive feature space for vision research. One interesting thing to note is that humans have very specialized hardware for motion detection/recognition. For instance, humans don't recognize motion very well if it is upside down.

So how do we see what an environment is shaped like given motion? One decent technique, Structure From Motion (SFM) recovers important facts about structure of viewed 3d systems from point motions in a sequence of frames. Think, how could I recover an OpenGL style model of the world plus my projection/rotation/translation matrix from just watching things move about an environment. Well, first you'd want to have a concept of internal camera parameters followed by a nonlinear optimization of your reconing of objects tracked in the world.

So how do we see what happened in an environment given motion? 3d structure is not enough in practice for robust and accurate recogntion. Other features are generally required. We can look at calculating global information, and discrete information for features of the video stream.

One type of global information detailed in the paper is optical flow. Optical flow consists of the computation of the displacement of each pixel between frames. This yeilds a vector map of flows in the image. Individual or regional flows may be tracked/analyzed. Another type is motion correspondence. Motion correspondence matches interesting image features that can be tracked through time. This gives you an idea of which image features vary together... which can be used to recognize higher level behaviours.

Low level features give us the ability to perform more general techniques. Let's talk about a specific motion feature: trajectory. Trajectory is a set of points recorded from motion perceived over space. It's the coordinates of where something was all the way to where it is now. It's nice because all motion has trajectory. Choosing the window size for analysis is important. Note that trajectory will usually be aliased when initially recorded from a camera. Since whole regions can have trajectorys as well as flows, Trajectory doesn't have to be totally granular.

So what if all this motion is overkill? Motion events may be enough to discern between high level tasks. A motion event is simply events like presence of motion, absence of motion, change in direction, change in nth order derivative, and any subjective division of motion space you can easily (hackishly) detect that gives you better recognition rates. However, certain classes of motion can be recognized using an important technique called the Trajectory primal sketch: translation,rotation, projectile, and cycloid motions can all be discriminated.


MLD: moving light displays - simulate corresponding point motion (e.g. attatching lights to joints). People can detect walking and gender from these displays. One theory to explain how people can do this is SFM: structure from motion. That is people use the motion to re-create a 3d structure representation and recognize that. The other theory (motion-based recognition) is that motion is used to recognize directly, without having to recreate the structure first.

SFM: theoretical solutions of SFM. Just getting the structure from motion does not constitute recognition, since that structure must be analyzed too to categorize it.

Motion based recognition has two steps:

  1. Find the appropriate representation for the objects modeled. This is taken from the raw data. They can be low or high level (trajectories or motion verbs (Koller, Tsotsos). )
  2. Match the input with a model.
Methods for extracting 2d motion:
  1. motion correspondence
  2. optical flow
Region-based features: "features derived from the use of an extended region or from a whole image"

Figure 1:

Optical flow methods:
  1. differential methods. "Compute the velocity from spatio-temporal derivatives of image intensity"
  2. region-based matching. "... velocity is defined as the shift yielding the best fit between image regions, according to some similarity or distance measure."
  3. energy-based (frequency-based) techniques. "... compute optical flow using the output from the energy of velocity-tuned filters in the fourier domain.
  4. phase-based techniques. velocity is defined in terms "of the phase behaviour of band-pass filter outputs" (e.g. zero-crossing techniques)
Problems with optical flow: correspondence problem: How to unambiguously map a point in a frame to a point in the next frame. Combinatorially explosive. Constraints can be applied to cut down on it: Simple trajectories are often usually not enough for recognition.

Relative Motion

In some domains like walking, the relative motion of objects (legs, arms) is more important than their absolute motion.

Common motion: movement relative to observer

Trajectory Primal Sketch (TPS): Gould and Shah. a representation of significant changes in motion. Can distinguish translation, rotation, projectile and cycloid.

Motion boundaries: smooth starts, smooth stops, pauses, impulse starts, impulse stops

Goddard used changes in rotational velocity and direction change of body segments as motion events, which were input to a connectionist system.

Region-based features

large or whole-image region processing.

Polana & Nelson:

  1. mean flow magnitude divided by standard deviation
  2. positive and negative curl and divergence estimates
  3. non-uniformity of flow direction
  4. directional difference statistics in four directions
Spatio-temporal cube: A cube of images stacked. (x, y, time) are the axes. You can use detection of 3d patterns in the cube to do motion detection.

Matching and classification

Most classification algorithms classify an unknown by the model it is closest to according to some distance measure. Thus the summaries below focus on representation, not process.

Ways of representing the unknown and models:

Motion recognition

Cyclical motion

Two cycles are needed for detection. You can do it my noticing ST-curves.


Petajan: find where the mouth is by locating the nostril. Change the image so that it is just a black blob where the mouth opening is. Cluster training images into 255 groups. Pick representatives for each group, put them all in a code book in order of black area. Associate them with letters, input is classified as nearest match.

Finn & Montgomery: distances between distinct points on the mouth.

Mase & Pentland: Used optic flow to find movement-- elongation of mouth and mouth opening.

Martin & Shah: uses a sequence of dense optical flow fields arount the mouth. Spatially and temporally warped to control for how long the speaker takes to say it.

Kirby et al: Uses a set of eigenvectors (these are like factors of factor analysis.) Eigenlips.

Gesture interpretation

Darnell & Pentland: "uses an automatic view-based approach to build the set of view models from which gesture models will be created." ??? Works in real time.

Davis & Shah: track trajectory of each finger. Works in real time.

Motion verb recognition

Motion verb recognition: "the association of natural language verbs with the motion performed by a moving object in a sequence of images."

Koller, Heinze, Nagel: 119 german verbs in 4 categories: 1. verbs describing the action of the vehicle only, verbs which make reference to the road, , other objects, other locations. They were also broken up according to whether they described the beginning, middle, or ends of events. There are 13 attributes (computed from the sequence). Attributes had preconditions, monotonicity conditions, and postconditions. These indicate how acceptable an event is for the beginning, middle, and end of an attribute.

Tsotsos: system: ALVEN. detected abnormal heart behavior. Used semantic nets, frames, type hierarchies, inheritence. Markers were put on the heart. With this rich knowledge base, it was able to describe in detail the motion of the heart's LV wall.

Temporal textures classification

Nelson & Polana: ripples on water, wind in leaves, cloth waving in wind. Features based on optical flow fields. Used vector analysis.

Human motion tracking and recognition

labelling: Identifying human's parts from a movie.

tracking: finding part trajectories.

Human motion tracking using motion models

Modelling of the human body

3d tracking uses volumetric models and stick figures, labelling uses 2d models. Stick figures can be described with few parameters. The volumetric models use generalized cylinders.

Marr & Nishihara: hierarchy of cylinders. First the whole body is one cylinder, which has components of other cylinders. Adjunct relations are relations of cylinders to the higher level cylinders. Advantage: can be refined as needed.

O'Rourke & Badler: segments and joints. Flesh is represented as spheres fixed on the segments. (600 of them). Joint angle constraints, collision detection. Hogg: This system was designed for the law enforcement team at Hazard county. Builds on Marr & Nishihara. Uses elliptical cylinders.

Rohr: builds on Marr & Nishihara. Has clothing. He argues that's how we usually see people.

Leung & Yang: apars: ribbons or antiparallel lines. They enclose regions.

modelling of human motion

Uses joint angles. Hogg: left hip, knee, sholder, elbow. One person used for data. Symmetry assumed.

Rohr: hip, knee, sholder, elbow. 60 men used for data.

Chen & Lee: used the following constraints: both arms or legs can not be in front at the same time. the arm/leg pair cannot swing forward or backward at the same time. sholders and elbows move cooperatively (same for hip and knee). arm and leg swing trajectory is paralled to direction of motion. at most 1 knee has a flexion angle, when one does, the other is perpendicular to the ground.

Akita: key frame sequence of stick figures to model movements. Hogg and Rohr both tried this early on and abandoned it.

Recognizing body parts

Akita: like above, uses cones. More sound.

Leung & Yang: annotated sequence of images. uses apars.

recognition of human movements

Johansson: showed that walking can be detected in 200 ms (half a cycle) with MLDs. abstracts low level features into higher level ones. Scenario hierarchy: combined components (things) and assemblies (combinations of things).

Yamato: used HMMs to analyze tennis strokes. Insensitive to noise.

Discrimination between humans from their motion

Rangaragen: Same shape, different motions vs same motion, different shape. Trajectories of joints.

Tsai et al.: Curvature of trajectory, computed frequency. Cycles must be aligned (as of 1995).


future work:

Summary author's notes:

Back to the Cognitive Science Summaries homepage
Cognitive Science Summaries Webmaster:
JimDavies (jim@jimdavies.org)
Last modified: Tue Feb 29 17:21:53 EST 2000