[ CogSci Summaries home | UP | email ]

Coyne, B. & Sproat, R. (2001). WordsEye: An Automatic Text-to-Scene Conversion System. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH01), 487--496.

 	author = {Coyne, Bob and Sproat, Richard},
 	title = {WordsEye: an automatic text-to-scene conversion system},
 	booktitle = {SIGGRAPH '01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques},
 	year = {2001},
 	pages = {487--496},
 	publisher = {ACM},
 	address = {New York, NY}

Author of the summary: Nicholas Osborne, 2010, nico@nicholasosborne.ca

Cite this paper for:

The actual paper can be found at http://portal.acm.org/ft_gateway.cfm?id=383316&type=pdf&coll=GUIDE&dl=GUIDE&CFID=76198923&CFTOKEN=97859167>

Natural language is an effective medium for describing visual ideas and mental images

The emergence of language-based 3D scene generation, allowing anyone to quickly generate 3d scenes without knowledge of special software.[p01]

WordsEye, relying on a large database of 3d models is a system for automatically converting text into representative 3d scenes. Objects can have shape displacements, spatial tags, functional properties.


There is a need for a new paradigm where creation of 3d scenes is immediate and easy through the use of natural languages.

A previous system was the 'Put' system, but was limited to spatial arrangements of existing objects and a limited subset of English.

WorldsEye attempts to provide an empty canvas in which users can paint a picture with worlds depicting spatial relations and actions performed by the objects from a wide range of text.[p02]

How it works: [p02]

Example: "John said that the cat was on the table. The animal was next to a bowl of apples", the software would construct a picture of a human character with a cartoon speech bubble coming out of its mouth. In the speech bubble would be a picture of a cat on a table with a bowl of apples next to it.

Unpredictability of the graphical result

Linguistic Analysis

Text is tagged and parsed using part-of speech-tagger and statistical parsers which generates a parse tree representing the structure of the sentence. The parse tree is then converted into a dependency representation which is a list of words in the sentence showing the words that they are dependent on and words that are dependent on them. [p02] The dependency representation allows for easier semantic analysis.

said   -> John
   |----> that --> was --> the --> cat
		     | -->  on --> the --> table

Next, the dependency structure is converted into a semantic pre-presentation which is a description of the entities to be depicted in the scene and the relationships between them. Each element in the list is a representation fragment corresponding to a specific node of the dependency structure. Each node will be depicted by a given 3d object.

The appropriate semantic interpretation is found by a table lookup given the word in question, differing based upon what kind of thing the word denotes. WordNet is used to provide semantic hypernym and hyponym relations. Special prepositions are handled by semantic functions, and verbs are handled by semantic frames. WordsEye has semantic entries for 1300 english downs, and 2300 verbs.

WordsEye also interprets anaphoric or coreferring expressions (pronominals like ‘he or ‘she’) and nouns through associations in the WordNet hierarchy.

Example: cat is a subset of the denotations of animal


Scenes are defined in terms of low-level graphic specifications called depictors, which exist to control 3D object visibility, size, position, orientation, colour, transparency. They are also used to specify human poses, Inverse Kinematics, and modify vertex displacement for facial expressions.[p03]

WorldsEye associates additional information with each 3D model

SkeletonsSkeletal control of structures
Shape displacementsAssociated with the object, used to depict emotions
PartsPolygonal faces presenting significant areas of the surface
Colour PartsSet of parts to be coloured when the object is specified by the text as having a specific colour
Opacity partsParts that get transparency
Default SizeObjects are given a default size (expressed in feet)
Functional propertiesUsed to depict how an object can be used
Spatial tagsTo help with spatial relations, the shape of the object must be known.

Spatial relations define the basic layout of the scene. Relative positions such as 'next to' 'behind' 'facing' are frequently an implicit part of actions and compound objects.

Example: "The bird is on the cat", find a top surface tag for the cat (on its back) and a base tag for the bird (under its feet). The bird is then repositioned so that its feet are on the cat's back.

Standalone posesConsist of a character in a particular position
Specialized usageInvolve a character using a specific instrument
Generic usageA character interacting with a generic stand-in object
Grip posesA character holding a specific object in a certain way

Depiction Process

WordsEye's depiction module translates the high level semantic representation produced by the linguistic analysis into low-level depictors.[p06]


  1. Convert the semantic representation from the node structure to a list of typed semantic elements.
  2. Interpret the semantic representation
  3. Assign depictors to each semantic element
  4. Resolve implicit and conflicting constraints
  5. Read in referenced 3D models
  6. Apply each depictor, maintaining constraints, incrementally build up the scene
  7. Add background environment
  8. Adjust the camera (framing the scene)
  9. Render

The main semantic element types are ENTITY (noun), ACTION (verbs), ATTRIBUTE (adjectives) and RELATION (prepositions). Additional more specialized types are PATHS, TIMESPEC, CONJUNCTION, POSSESSIVE, NEGATIOn, CARDINALITY

In order to depict a sentence, the semantic elements must be made graphically realizable. This done by applying a set of depiction rules which are test for applicability then applied to translate semantic elements into graphical depictors.

Example: "The cowboy rode the red bicycle to the store"[p06]

  1. Entity: cowboy
  2. Entity: bicycle
  3. Entity: store
  4. Attribute:
    Subject: <element 2>
    Property: red
  5. Action:
    Actor: <element 1>
    Action: ride
    Object: <element 2>
    Path: <element 6>
  6. Path:
    Relation: to
    Figure: <element 5>
    Ground <element 3>

A set of transduction rules invoke depictors for constraints based on common sense knowledge and are not explicitly stated. If X is next to Y, X is not already on a surface, and X, put X on the same surface as Y.

Depiction specifications sometimes conflict with one another, which are resolved by a transduction rule which removes the tentative depictor.

Applying constrains are done in a prioritized manner:

  1. Objects are initialized to their default size and shape
  2. Apply shape changes, poses are applied.
  3. Once objects are in their correct shapes and poses, objects are positioned and grouped
  4. Poses/shapes are applied
  5. Dynamic operations such as placing objects on paths and IK are performed

Interpretation, Activities, Environment

For text to be depicted, it must first be interpreted. This is done by relying on the functional properties of objects, and will depend on the setting of the scene or an activity.[p07]

Figurative and Metaphorical Depiction.

Sentences can include abstractions or non-physical properties and relations which cannot be directly depicted. This can be resolved using the following techniques:[p08]

TextualizationGenerate a 3D extruded text of the word
EmblematizationWhen an entity cannot be depicted then some 3D object can be used as an emblem for it.
CharacterizationSpecial type of emblematization related to human characters in their roles, and is solved using clothing or having the character hold an emblem.
Conventional iconsThought bubbles, red circle with a slash
LiteralizationFigurative or metaphorical meanings depicted in a literal manner
PersonificationMetaphorical statement depicted in a human role
DegeneralalizationGeneral categorical terms cannot be depicted directly, this is solved by picking a specified object instance of the same class.