Postscript Version

Multimodal Access to Spatial Data

Jerry R. Hobbs and Andrew Kehler

Artificial Intelligence Center
SRI International

CONTACT INFORMATION

Jerry R. Hobbs
SRI International
333 Ravenswood Avenue
Menlo Park, CA 94025
Phone: (415) 859-2229
Fax : (415) 859-3735
Email: hobbs@ai.sri.com

WWW PAGE

http://www.ai.sri.com/~kehler/multimodal.html

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

multimodal interfaces, spoken and gestural reference, discourse structure, spatial language

PROJECT SUMMARY

SRI International's project, ``Multimodal Access to Spatial Data'' focuses on the use of language and gesture in interacting with a computerized terrain model, in the context of solving spatial problems.

There is a broad class of crisis situations that can be abstractly characterized in a similar fashion: The goal of a team of people is to gain access to specific structures or objectives through a terrain that presents obstacles and is subject to continuously changing conditions. Examples include fire fighting operations, earthquake and flood response, rescues, tank battles, urban warfare, and hostage situations. Various route-planning and travel-planning tasks can be viewed in this way as well. The members of the team interact in part through language and in part through a model of the terrain, either a map or some richer representation. They use the terrain model to identify structures, terrain features, routes of access, and obstacles along these routes, and they must update the model as conditions change.

The terrain model is essentially a dynamic database of geographical information. It is potentially very rich in information, containing many levels of detail and best viewed with a specific focus or perspective. Thus, for complex tasks it is essential to have a computer-based presentation of the terrain model.

The most convenient means for interacting with such a computerized terrain model would be natural language and gesture. (By ``gesture'' we mean the use of pointing or the drawing of simple figures on a display.) This raises the problem of reference, broadly construed as the recognition of the mapping from the way meanings are expressed to the entities in the model that they indicate. This includes the problem of resolving referential expressions in context, on the basis of their form and content. But another aspect of the problem of reference arises from the fact that the conceptualizations of space that underlie natural Language and gesture, on the one hand, and current terrain models, on the other, are radically different. There is a significant gap that must be bridged. The essentially topological conceptualization of space that underlies natural language must be mapped into the more geometric representations of the terrain model.

We propose to investigate the properties of interactions with the terrain model and between team members as they would occur in such a crisis situation. Our focus will be on elucidating the mapping from natural language and gesture to the terrain model. Specifically, we will investigate

Our first task has been to design ``Wizard of Oz'' experiments to elicit the coordinated use of language and gesture in interacting with a terrain model. Because of the software that is already available at SRI, our first experiment, to be carried out in the summer of 1997, will involve travel planning with a computer-based map and other information about a city to be toured. We have run pilot sessions on this set-up, and turned up a wide variety of styles of interacting with such a system, from users who try to keep their inputs simple for the computer with tightly coordinated speech and gesture, to users who ramble on and on with little use of gesture. One of the problems we face in this task is how to encourage the use of gesture without biasing users to a particular small set of gestures.

The second scenario we have been exploring would involve expert or trainee fire-fighters directing resources to objectives while using a terrain model rich in topographic and other information. The design of this experiment is in a much more preliminary stage.

In addition to the experimental work, we are looking at fundamental aspects about how spatial information is represented in language. Specifically, we are developing an axiomatic theory of scales, or scalar notions, which underlie our conceptualizations of space and other phenomena. This work will be reported on at the AAAI Workshop on Language and Space, in Providence, RI, July 27 and 28.

The problems of multimodal reference and how discourse structure influences it, and of nature of the mapping from the linguistic and conceptual representations of language and gesture and the more geometric representation of terrain models are problems of significant scientific interest and practical utility, and they are the focus of this project.

PROJECT REFERENCES

Cheyer, A. and L. Julia. 1995. Multimodal maps: An agent-based approach. In Proceedings of the International Conference on Cooperative Multimodal Communication, Eindhoven, The Netherlands, May.

Hobbs, Jerry R. 1997. Toward a Theory of Scales. To appear in the Proceedings of the AAAI Workshop on Language and Space, Providence, RI, July.

Moran, Douglas B. and Adam J. Cheyer. 1995. Intelligent agent-based user interfaces. In Proceedings of International Workshop on Human Interface Technology 95 (IWHIT'95), pages 7--10, Aizu-Wakamatsu, Fukushima, Japan, 12-13 October. The University of Aizu.

Moran, Douglas B., Adam J. Cheyer, Luc E. Julia, and David L. Martin. 1997. The open agent architecture and its multimodal user interface. In Proceedings of the 1997 International Conference on Intelligent User Interfaces (IUI97).

AREA BACKGROUND

Efficient communication with computers requires the use of multiple modalities, including speech, writing, drawing, and gesture. A communicative act that combines several modalities - e.g., one in which a user simultaneously utters a pronoun and points to an object on a computer screen - is a natural and convenient means for establishing reference to an object. Likewise, a spoken command to "scroll the display", accompanied by an arrow drawn on the screen, together tell the system to do something and the way in which to do it.

In communicative acts such as these, considering the speech or the gesture alone is not enough to determine what is being said. Communication in such a setting succeeds only through a mutually constraining combination of language and gesture, in which the language may be interpretable only with respect to the gesture, and vice versa.

Although there is a substantial literature on natural language use in interactive settings, less has been done in multimodal environments. For instance, research on reference in natural language processing has generally focused on the relationship between linguistic expressions and the salience of objects that have previously been mentioned. When a user interacts with a multimodal system, however, entities that have not been mentioned are also salient in the shared context provided by the visual display, and references to them are central to the interaction. Hence the need for basic research to enhance our understanding of the use of natural language in general, and of reference in particular, in a multimodal environment.

AREA REFERENCES

Allen, James F., Bradford W. Miller, Eric K. Ringger, and Teresa Sikorski. 1996. A robust system for natural language spoken dialogue. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL-96), Santa Cruz, CA.

Andre, Elisabeth, Guido Bosch, Gerd Herzog, and Thomas Rist. 1987. Coping with the intrinsic and deictic uses of spatial prepositions. In K. Jorrand and L Sgurev, editors, Artificial Intelligence II: Methodology, Systems, Applications. Amsterdam: North-Holland.

Bolt, R.A. 1980. Put-that there: Voice and gesture at the graphics interface. ACM Computer Graphics, 14(3):262--270.

Cheyer, A. and L.~Julia. 1995. Multimodal maps: An agent-based approach. In Proceedings of the International Conference on Cooperative Multimodal Communication, Eindhoven, The Netherlands, May.

Cohen, P. R., A. Cheyer, M. Wang, and S. C. Baeg. 1994. An open agent architecture. In O. Etzioni, editor, Proceedings of the AAAI Spring Symposium Series on Software Agents, pages 1--8, Menlo Park, California, March. American Association for Artificial Intelligence.

Cohen, Philip R., Mary Dalrymple, Douglas B. Moran, Fernando C. N. Pereira, Joseph W. Sullivan, Robert A. Gargan, Jon L. Schlossberg, and Sherman W. Tyler. 1989. Synergistic use of direct manipulation and natural language. In Human Factors in Computing Systems: CHI'89 Conference Proc., pages 227--234, New York.

Gapp, Klaus-Peter. 1994. Basic meanings of spatial relations: Computation and evaluation in 3d space. In Proceedings of AAAI-94, pages 1411--1417.

Hanne, K. H. and H. J. Bullinger. 1992. Multimodal communication: Integrating text and gestures. In Multimedia Interface Design, pages 127--138. ACM Press.

Hauptmann, A. G. and P. McAvinney. 1993. Gestures with speech for graphic manipulation. International Journal of Man-Machine Studies, 38:231--249.

Huls, Carla, Wim Claassen, and Edwin Bos. 1995. Automatic Referent Resolution of Deictic and Anaphoric Expressions. Computational Linguistics, 21(1):59-79.

Mignot, C. and N. Carbonell. 1996. Oral and Gestural Command: An Empirical Study. Techniques et Sciences Informatiques, 15(10):1399-1428.

Moran, Douglas B. and Adam J. Cheyer. 1995. Intelligent agent-based user interfaces. In Proceedings of International Workshop on Human Interface Technology 95 (IWHIT'95), pages 7--10, Aizu-Wakamatsu, Fukushima, Japan, 12-13 October. The University of Aizu.

Moran, Douglas B., Adam J. Cheyer, Luc E. Julia, and David L. Martin. 1997. The open agent architecture and its multimodal user interface. In Proceedings of the 1997 International Conference on Intelligent User Interfaces (IUI97).

Neal, Jeannette G., Zuzana Dobes, Keith E. Bettinger, and Jong S. Byoun. 1988. Multimodal references in human-computer dialogue. In Proceedings of AAAI-88, pages 819--823. Morgan Kaufmann.

Novak Jr., Gordon S. and William C. Bulko. 1990. Understanding natural language with diagrams. In Proceedings of the Eighth National Conference on Artificial Intelligence, pages 465--470, Boston.

Oviatt, S. L. 1992. Pen/voice: Complementary multimodal communication. In Proc. of Speech Tech'92, pages 238--241.

Oviatt, S. L. 1996. Multimodal interfaces for dynamic interactive maps. In Human Factors in Computing Systems (CHI'96). ACM Press.

Rajagopalan, Raman. 1994. A model for integrated qualitative spatial and dynamic reasoning about physical systems. In Proceedings of AAAI-94, pages 1411--1417.

Siroux, J., M. Guyomard, F. Multon, and C. Remondeau. 1995. Oral and Gestural Activities of the Users in the GEORAL system. In Proceedings of the First International Workshop on Intelligence and Multimodality in Multimedia Inferfaces: Research and Applications.

Srihari, Rohini. 1995. Computational models for integrating linguistic and visual information: A survey. Artificial Intelligence Review, 8(5,6).

Vo, M. T. and A. Waibel. 1993. A multi-modal human-computer interface: Combination of gesture and speech recognition. In INTERCHI '93, Adjunct Proceedings, pages 231--249, Amsterdam.

RELATED PROGRAM AREAS

Other Communication Modalities, Adaptive Human Interfaces, Usability and User-Centered Design