[ CogSci Summaries home | UP | email ]

Tijerino, Y.A., Abe, S., and Kishino, F. (1995). Intuitive graphic representation of mental images in a virtual environment through natural language. Computational Models for Integrating Language and Vision, AAAI-95 Fall Symposium Series Working Notes, 29-36.

  author = 	 {Yuri A. Tijerino and Shinji Abe and Fumio Kishino},
  title = 	 {Intuitive graphic representation of mental images in a virtual environment through natural language},
  booktitle = 	 {Computational Models for Integrating Language and Vision, AAAI-95 Fall Symposium Series Working Notes},
  year = 	 {1995},
  pages = 	 {29--36}

Author of the summary: Jozef Lewitzky, 2010, Jozef_lewitzky@hotmail.com

Cite this paper for:

The focus of the paper is to explain how a system could be made for remotely located people to interact with each other through voice and gestures, as if together. The system they have made, called a Virtual Space Teleconferencing System (VSTS), uses 3D CG techniques that incorporates voice and hand gestures. This was made in an attempt to create useful human-machine-human interactions.

Using this system, dubbed a WYSIWYS (What You Say Is What You See) system, participants can intuitively represent their ideas in a virtual environment. When people generally try to express mental images to one another, they use voice and hand gestures. This can be augmented by the use of a system able to take those inputs and create a virtual 3D image. [p29]

When we describe a mental image, it is vague. However, if it were an actual image, it could be manipulated to accommodate suggestions and create a clear vision. In order to create this kind of actual image, the participants need the freedom to manipulate the virtual object in an intuitive way.

In order to create these intuitive interactions with the virtual objects, a Wizard-of-Oz experiment was created. This experiment is designed to identify the most essential 3-D shapes that people can use in a virtual environment. Using data from this experiment, the knowledge of what is needed to sculpt different objects, rather than just recognize them, can be discovered.

The Experiment is set up in a Wizard-of-Oz fashion, meaning one participant is simulating how a computer that understands natural speech might behave. Two participants are sat down in front of each other, one behaving like the computer, the other a user. The Wizard-of-Oz participant uses clay to sculpt what the user asks. However, the user is only allowed to describe the parts of the object, not the object itself. The user is given a picture that is to be made by the Wizard-of-Oz participant, and the experiment begins. [p30]

Rules of the experiment:

-> Rules for the user:

a. Cannot say the name of the object
b. Can touch but cannot handle clay

-> Rules for the Wizard-of-Oz subject

a. Cannot guess at the object being made
b. Can only ask whether their shaping is on the right track (ie. "is this ok?")
c. If the user says a proper now, the Wizard-of-Oz subject should ignore it.

2. The object to be made should be difficult to shape and will be shown once and only for 5 seconds to the user.

3. The user is given some time to create a mental image.

4. The user is given enough time to reconstruct the object.

5. The experiment should be run a second time with the assignments of user and Wizard-of-Oz switched.

The experiment used 8 Japanese and 8 Occidental subjects.

Frame synchronized cameras were used in the experiment, and the experimenters viewed them with a split screen that has 20% showing the user, and 80% the Wizard-of-Oz subject. [p31]


The commands and geometric primitives used in the experiment were dependent on the person and the situation.

A single noun can refer to many different primitives, dependent on situation and the accompanying gestures.

The computer should take voice as well as hand, head and eye gestures into consideration.

There is not a large number of geometric primitives.

People tend to start with simple shapes and build them into complex ones.

Gestures play a larger role when the object is more rare or difficult to explain.

Symmetric relations are important, such as repeating an action on both sides of an object.

Referenced size is more important than absolute size.[p32]

A 3D shape object is not only seen as its basic shape. It is also seen by its component primitives. By these primitives relating in a certain way to one another, they can form our concepts of objects (ie. A car with huge wheels is still a car because of the positioning of the component squares and circles). There is good reason to believe we all have a commonsense knowledge of this. Therefore, this system should take into account this commonsense knowledge.

The ability to translate information about hand gestures to a computer is at a good level. However, the ability to actually manipulate the 3D objects in virtual space is still difficult. There are large performance issues when manipulating the polygons correctly, although exploration in other possible graphical representations is under way, super quadratics in this case. [p33]

A prototype application was made for the VSTS which can be used to design Japanese portable shrines.

It took in information about the user's voice and translated it to strings and matched it in best possible case. Then get information about gestural commands if needed. Lastly, translate into graphical representation.

This is done through the VSTS, which classifies input into 2 categories, labels and operations. Labels are the names of primitives, and operations are operations done to them.

Some information requires hand gestures, like "this" or "that". When the system hears one of these spoken, it attempts to match it with the appropriate hand gesture. The system also needs information about the position of the user in order to do operations from their viewpoint. [p34]

Lastly, the system implements the command generated by the combination of voice and gesture.

Final Conclusion

The words spoken to represent primitive objects are too vague for a computer to implement alone, it requires the use of gestures as well. Although restricted so far to only portable shrine configurations, the system is planned to be extended to further and more general 3D shapes.

In short, it has shown that a system that allows graphical representation of mental images is possible with voice and gestures. The gestures and spoken words are restricted right now, but can be expanded to more dynamic gestures. The paper also describes the current information on 3D shapes as the basis of language based human-computer interactions.

Summary author's notes:

Back to the Cognitive Science Summaries homepage
Cognitive Science Summaries Webmaster:
JimDavies (jim@jimdavies.org)