Wolfgang Wahlster
Deutsches Forschungszentrum für Künstliche Intelligenz, Saarbrücken, Germany
Text and images are ubiquitous in human communication and there are deep connections between the use of these two modalities. Whereas humans easily become experts for mapping from images to text (e.g., a radio reporter describing a soccer game) or from text to images (e.g., a cartoonist transforming a story into a comic strip), such complex transformations are a great challenge for computer systems. In multimodal communication humans utilize a combination of text and images (e.g., illustrated books, sub-titles for animations) taking advantage of both the individual strength of each communication mode and the fact that both modes can be employed in parallel. Allowing the two modalities to refer to and depend upon each other is a key to the richness of multimodal communication. Recently, a new generation of intelligent multimodal human-computer interfaces has emerged with the ability to interpret some forms of multimodal input and to generate coordinated multimodal output.
Over the past years, researchers have begun to explore how to translate visual information into natural language [McK94]. Starting from a sequence of digitized video frames, a vision system constructs a geometrical representation of the observed scene, including the type and location of all visible objects on a discrete time scale. Then spatial relations between the recognized objects and motion events are extracted and condensed into hypothesized plans and plan interactions between the observed agents. These conceptual structures are finally mapped onto natural language constructs including spatial prepositions, motion verbs, temporal adverbs or conjunctions, and causal clauses. This means in terms of reference semantics, that explicit links between sensory data and natural language expressions are established by a bottom-up process.
While early systems like HAM-ANS [WMJB83], LandScan [BJKZ85], and NAOS [Neu89] generated retrospective natural language scene descriptions after the processing of the complete image sequence, current systems like VITRA [Wah89] aim at an incremental analysis of the visual input to produce simultaneous narration. VITRA incrementally generates reports about real-world traffic scenes or short video clips of soccer matches. The most challenging open question in this research field is a tighter coordination of perception and language production by integrating the current bottom-up cascaded architectures with top-down and expectation-driven processing of images, such that text production can influence the low-level vision processing, e.g., by focusing on particular objects and by providing active control of the vision sensors.
A great practical advantage of natural language image description is the possibility of the application-specific selection of varying degrees of condensation of visual information. There are many promising applications in medical technology, remote sensing, traffic control and other surveillance tasks.
Only a small number of researchers have dealt with the inverse direction, the generation of images from natural language text. The work in this area of natural language processing has shown how a physically based semantics of motion verbs and locative prepositions can be seen as conveying spatial, kinematic and temporal constraints, thereby enabling a system to create an animated graphical simulation of events described by natural language utterances.
The AnimNL project [BPW93] aims to enable
people to use natural language instructions as high-level
specifications to guide animated human figures through a task. The
system is able to interpret simple instructional texts in terms of
intentions that the simulated agent should adopt, desired constraints
on the agent's behavior and expectations about what will happen in
the animation. In the ANTLIMA system
[SS93] the generation of animations from text is
based on the assumption that the descriptions always refer to the
most typical case of a spatial relation or motion. Typicality
potential fields are used to characterize the default distribution
for the location and velocity of objects, the duration of events, and
the temporal relation between events. In ANTLIMA and the
SPRINT system [YYI
92] all objects in the
described scene are moved to a position with maximal
typicality using a hill-climbing algorithm. If the
location of an object is described by several spatial predications
holding simultaneously, the algebraic average of the corresponding
typicality distributions is used to compute the position of the
object in the animation.
There is an expanding range of exciting applications for these methods like advanced simulation, entertainment, animation and computer aided design (CAD) systems.
Whereas mapping images to text is a process of abstraction, mapping
text to images is a process of concretion
(Figure
). However, in many situations the
appropriate level of detail can only be achieved by a combination of
text and images.
Figure: Generating and Transforming Presentations in Different Modes and Media.
A new generation of intelligent multimodal systems [May93] goes beyond the standard canned text, predesigned graphics and prerecorded images and sounds typically found in commercial multimedia systems of today. A basic principle underlying these so-called intellimedia systems is that the various constituents of a multimodal communication should be generated on the fly from a common representation of what is to be conveyed without using any preplanned text or images. It is an important goal of such systems not simply to merge the verbalization and visualization results of a text generator and a graphics generator, but to carefully coordinate them in such a way that they generate a multiplicative improvement in communication capabilities. Such multimodal presentation systems are highly adaptive, since all presentation decisions are postponed until runtime. The quest for adaptation is based on the fact that it is impossible to anticipate the needs and requirements of each potential user in an infinite number of presentation situations.
The most advanced multimodal presentation systems, that generate text
illustrated by 3-D graphics and animations, are COMET
[FM93] and WIP [WAF
93].
COMET generates directions for maintenance and repair of a
portable radio and WIP designs multimodal explanations in
German and English on using an espresso-machine,
assembling a lawn-mower, or installing a modem.
Intelligent multimodal presentation systems include a number of key processes: content planning (determining what information should be presented in a given situation), mode selection (apportioning the selected information to text and graphics), presentation design (determining how text and graphics can be used to communicate the selected information), and coordination (resolving conflicts and maintaining consistency between text and graphics).
Figure: A text-picture combination generated by the WIP-System.
An important synergistic use of multimodality in systems generating
text-picture combinations is the disambiguation of referring
expressions. An accompanying picture often makes clear what the
intended object of a referring expression is. For example, a
technical name for an object unknown to the user may be introduced by
clearly singling out the intended object in the accompanying
illustration (Figure
). In addition,
WIP and COMET can generate cross-modal expressions like, ``The on/off switch is shown in the upper left part
of the picture,'' to establish referential relationships of
representations in one modality to representations in another
modality.
The research so far has shown that it is possible to adapt many of the fundamental concepts developed to date in computational linguistics in such a way that they become useful for text-picture combinations as well. In particular, semantic and pragmatic concepts like communicative acts, coherence, focus, reference, discourse model, user model, implicature, anaphora, rhetorical relations and scope ambiguity take on an extended meaning in the context of multimodal communication.
Areas which require further investigation include the question how to reason about multiple modes so that the system becomes able to block false implicatures and to ensure that the generated text-picture combination is unambiguous, the role of layout as a rhetorical force, influencing the intentional and attential state of the viewer, the integration of facial animation and speech of the presentation agent, and the formalization of design knowledge for creating interactive presentations.
Key applications for intellimedia systems are multimodal helpware, information retrieval and analysis, authoring, training, monitoring, and decision support.