Postscript Version bgcolor="FFFFFF">

A Unified Framework for Multimodal Conversational Behaviors in Interactive Humanoid Agents

Justine Cassell

Media Lab
MIT

CONTACT INFORMATION

E15-315
20 Ames Street
Cambridge, MA 02139
Phone: (617) 253-4899
Fax : (617) 253-6215
Email: justine@media.mit.edu

WWW PAGE

http://www.media.mit.edu/~justine/

PROGRAM AREA

Other Communication Modalities

KEYWORDS

gesture, facial animation, conversation, multimodal interaction, interactive agent,

PROJECT SUMMARY

Humans communicate with one another using speech, prosodic cues, hand gestures, gaze and facial expression. The interaction among these communicative phenomena has been studied to a limited extent by social scientists, but for the most part in a descriptive manner not amenable to implementation in computer systems. The interaction among these communicative phenomena has been to a large extent ignored in the development of interactive systems. Instead, gestures have been studied in the absence of speech, or speech and discourse have been studied in the absence of gesture and facial expression. And yet the dominant paradigm in human-computer interaction today is natural face-to-face interaction, and the computer as conversational partner. Our long term goal is, in terms of underlying theory and computational implementation, to be able to synthesize and understand natural face-to-face conversational behavior -- that is, spontaneous gesture and facial movements in the context of speech with intonation.

Our specific goal for this project is to continue our work on determining the rules that govern the interactions among gesture, facial expression, speech, information structure and turn-taking in such a way as to be able to automatically generate these behaviors, and to exploit them in human-computer interaction. The approach described here also advances our previous work synthesizing conversation between autonomous agents (Cassell et al, 1994a; Cassell et al, 1994b). Here we propose to fulfill two promissory notes made in our earlier work: to extend our framework to interaction between an autonomous agent and a human, and to generate the verbal and nonverbal behaviors truly from scratch -- from semantic features upwards.

PROJECT REFERENCES

http://gn.www.med ia.mit.edu/groups/gn/publications.html

AREA BACKGROUND

In order to build a computational human-like conversational partner, one must understand how human-human conversation works. This requires an understanding of discourse structure: how we organize information in language to achieve certain communicative goals -- to emphasize certain things as important, to relate a topic to a previous topic, to let the listener know what the overall structure of the discourse will be. It also requires an understanding of conversational structure: how we organize talk to achieve certain interactional goals -- to make the other person feel heard, to make sure we get heard, to maintain cultural norms of politeness, to start the conversation, segue from one topic to another, and end the conversation gracefully. Some of discourse structure and conversational structure is conveyed in language, some is conveyed through intonation, and some is conveyed nonverbally, through gaze patterns and gesture. Gaze and gesture are in fact the most powerful cues to who is talking, and who is about to take a turn talking. Gesture also conveys information about the entities being referred to in a discourse and about the speaker's beliefs concerning what is being said.

Some of the concerns that arise in translating this research into the computational domain are traditional computational linguistics concerns: how to generate language from knowledge, beliefs and plans. And some of the concerns have more to do with interactive systems and interface systems in general: how to recognize the intentions of a user from the user's behavior, including the user's language. Recognizing the user's intentions requires, in turn, a way to recognize the form of the behaviors -- in this phase of the project we are turning away from using hardware (cybergloves and an eyetracker) to gather data and turning to computer vision systems to recognize gesture and gaze -- and a way to interprete those behaviors -- an architecture that allows the integration of multimodal behaviors without privileging any one of those behaviors over the others, based on a semantics that includes information from language, gesture and gaze.

AREA REFERENCES

Bolt, R.A. (1987). The integrated multi-modal interface. Transactions of the Institute of Electronics, Information and Communication Engineers (Japan), J79-D(11), 2017-2025.

Chen, D., Pieper, S., Singh, S., Rosen, J. & Zeltzer, D. (1993). The virtual sailor: an implementation of interactive human body modeling. In Proc. 1993 Virtual Reality Annual International Symposium, Seattle, WA.

Condon, W. & Ogston, W. (1971). Speech and body motion synchrony of the speaker-hearer. In D.H. Horton & J.J. Jenkins, eds., The Perception of Language, 150-184. Academic Press.

Duncan, S. (1974) Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology, 23 (2): 283-292.

Ekman, P. (1979). About brows: emotional and conversational signals. In M. von Cranach, K. Foppa, W. Lepenies, and D. Ploog, eds., Human Ethology: Claims and Limits of a New Discipline, 169-248. Cambridge University Press.

Essa, I. & Pentland, A. (1994). A vision system for observing and extracting facial action parameters. In Proceedings of Computer Vision and Pattern Recognition (CVPR 94), 76-83.

Koons, D.B., Sparrell, C.J. & Thorisson, K.R. (1993). Integrating simultaneous input from speech, gaze and hand gestures. In M.T. Maybury (Ed.), Intelligent Multi-Media Interfaces. Cambridge, MA: AAAI Press/MIT Press.

McKeown, K. (1985). Text Generation: Using Discourse Strategies and Focus Constraints to Generate Natural Language Text. Cambridge University Press.

McNeill, D. (1992). Hand and mind: What gestures reveal about thought. Chicago: University of Chicago Press.

Pierrehumbert, J. (1993). Prosody, intonation, and speech technology. In M. Bates & R. Weischedel, eds., Challenges in Natural Language Processing, 257-280. Cambridge University Press.

Quek, F., (1994). Toward a vision-based hand gesture interface, Proceedings of the Virtual Reality System Technology Conference, pp. 17-29, August 23-26, 1994, Singapore.

Takeuchi, A. and Nagao, K. (1993). Communicative facial displays as a new conversational modality. In ACM/IFIP INTERCHI '93, Amsterdam.

For further references, see: http://jus tine.www.media.mit.edu/people/justine/disc-bib96.htm

RELATED PROGRAM AREAS

Virtual Environments

Speech and Natural Language Understanding

Intelligent Interactive Systems for Persons with Disabilities