Media Lab
MIT
Our specific goal for this project is to continue our work on determining the rules that govern the interactions among gesture, facial expression, speech, information structure and turn-taking in such a way as to be able to automatically generate these behaviors, and to exploit them in human-computer interaction. The approach described here also advances our previous work synthesizing conversation between autonomous agents (Cassell et al, 1994a; Cassell et al, 1994b). Here we propose to fulfill two promissory notes made in our earlier work: to extend our framework to interaction between an autonomous agent and a human, and to generate the verbal and nonverbal behaviors truly from scratch -- from semantic features upwards.
Some of the concerns that arise in translating this research into the computational domain are traditional computational linguistics concerns: how to generate language from knowledge, beliefs and plans. And some of the concerns have more to do with interactive systems and interface systems in general: how to recognize the intentions of a user from the user's behavior, including the user's language. Recognizing the user's intentions requires, in turn, a way to recognize the form of the behaviors -- in this phase of the project we are turning away from using hardware (cybergloves and an eyetracker) to gather data and turning to computer vision systems to recognize gesture and gaze -- and a way to interprete those behaviors -- an architecture that allows the integration of multimodal behaviors without privileging any one of those behaviors over the others, based on a semantics that includes information from language, gesture and gaze.
Chen, D., Pieper, S., Singh, S., Rosen, J. & Zeltzer, D. (1993). The virtual sailor: an implementation of interactive human body modeling. In Proc. 1993 Virtual Reality Annual International Symposium, Seattle, WA.
Condon, W. & Ogston, W. (1971). Speech and body motion synchrony of the speaker-hearer. In D.H. Horton & J.J. Jenkins, eds., The Perception of Language, 150-184. Academic Press.
Duncan, S. (1974) Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology, 23 (2): 283-292.
Ekman, P. (1979). About brows: emotional and conversational signals. In M. von Cranach, K. Foppa, W. Lepenies, and D. Ploog, eds., Human Ethology: Claims and Limits of a New Discipline, 169-248. Cambridge University Press.
Essa, I. & Pentland, A. (1994). A vision system for observing and extracting facial action parameters. In Proceedings of Computer Vision and Pattern Recognition (CVPR 94), 76-83.
Koons, D.B., Sparrell, C.J. & Thorisson, K.R. (1993). Integrating simultaneous input from speech, gaze and hand gestures. In M.T. Maybury (Ed.), Intelligent Multi-Media Interfaces. Cambridge, MA: AAAI Press/MIT Press.
McKeown, K. (1985). Text Generation: Using Discourse Strategies and Focus Constraints to Generate Natural Language Text. Cambridge University Press.
McNeill, D. (1992). Hand and mind: What gestures reveal about thought. Chicago: University of Chicago Press.
Pierrehumbert, J. (1993). Prosody, intonation, and speech technology. In M. Bates & R. Weischedel, eds., Challenges in Natural Language Processing, 257-280. Cambridge University Press.
Quek, F., (1994). Toward a vision-based hand gesture interface, Proceedings of the Virtual Reality System Technology Conference, pp. 17-29, August 23-26, 1994, Singapore.
Takeuchi, A. and Nagao, K. (1993). Communicative facial displays as a new conversational modality. In ACM/IFIP INTERCHI '93, Amsterdam.
For further references, see: http://jus tine.www.media.mit.edu/people/justine/disc-bib96.htm
Speech and Natural Language Understanding
Intelligent Interactive Systems for Persons with Disabilities