Postscript Version

Vision-based Hand Gesture Analysis in a Multimodal Interface for Controlling Virtual Environments

Thomas S. Huang

The Beckman Institute and the Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign

Rajeev Sharma

Department of Computer Science and Engineering
Pennsylvania State University at University Park

CONTACT INFORMATION

The Beckman Institute
405 N. Mathews Avenue
Urbana, IL 61801
Phone: (217) 244-1638
Fax : (217) 244-8371
Email: huang@ifp.uiuc.edu

WWW PAGE

http://www.beckman.uiuc.edu/groups/IFP/people/Huang.html
http://www.cse.psu.edu/~rsharma

PROGRAM AREA

Other Communication Modalities

KEYWORDS

vision-based gesture analysis, gesture recognition, speech/gesture integration, multimodal interface, virtual environments.

PROJECT SUMMARY

To fully exploit the potential that virtual reality offers and will offer in the future as a means of visualizing and interacting with information, it is important to develop "natural" means of interacting with the virtual display. Clearly, the most natural means of human communication is multimodal, involving a mixture of speech, hand and body movement, facial expression, and eye motion. The goal of this research is to explore a multimodal framework where several interaction modes will be used as means of manipulating a 3D virtual display. The main focus is to explore the use of free hand gestures for manipulating virtual objects using a set of strategically positioned video cameras and to study the interaction of hand gestures with speech, gaze, and the content of a virtual display.

The project involves the development of computer vision techniques that are able to extract the user hand from the background, track the hand/arm motion, distinguish a meaningful gesture from unintentional hand movements using context, and resolve the conflicts between gestures from multiple users. A key challenge of gesture recognition in multimodal setting of a VR is to find ways of improving the performance of gesture recognition using, for example, speech recognition and gaze direction. Another challenge is to develop appropriate computational architectures for integrating two interaction modalities such as speech and hand gestures.

PROJECT REFERENCES

V. I. Pavlovic, R. Sharma, and T. S. Huang. "Visual interpretation of hand gestures for human-computer interaction: A review." IEEE Transaction on Pattern Analysis and Machine Intelligence , 19(7), July 1997.

R. Sharma, T. S. Huang, V. I. Pavlovic, Y. Zhao, Z. Lo, S. Chu, K. Schulten, A. Dalke, J. Phillips, M. Zeller, and W. Humphrey. "Speech/gesture interface to a visual computing environment for molecular biologist." In Proc. International Conference on Pattern Recognition, pp. 964-968, 1996, Vienna, Austria.

T. S. Huang, V. I. Pavlovic, and R. Sharma. "Speech/gesture-based human computer interface in virtual environments." In Proc. Workshop on the Integration of Gesture in Language and Speech (WIGLS) , pp. 41-57, October 1996, Wilmington, DE.

V. I. Pavlovic, R. Sharma, and T. S. Huang. "Gestural Interface to a Visual Computing Environment for Molecular Biologists." In Second International Conference on Automatic Face and Gesture Recognition , pp. 30-35, October 1996, Killington, VT.

Yusuf Azoz. "Vision-Based Human Arm Tracking For Gesture Analysis." MS Thesis, Pennsylvania State University, Department of Electrical Engineering, 1997.

R. Sharma, V. I. Pavlovic, and T. S. Huang. "A multimodal framework for interacting with virtual environments." In C. A. Ntuen and E. H. Park, editors, Human Interaction with Complex Systems. pp. 53-71, Kluwer Academic Publishers, 1996.

AREA BACKGROUND

Although there has been a tremendous progress in recent years in 3-D, immersive display or virtual reality (VR) technologies, the corresponding human-computer interaction (HCI) technologies have lagged behind. For example, current interfaces involve the use of heavy headset, datagloves, tethers and other VR devices. Even though the use of such specific devices may be justified by a highly specialized application domain of VR technologies of today, the "everyday" VR user of the future may certainly be deterred or distracted by such cumbersome tools. Further, in everyday life, the natural communication between people consists of a complex mixture of speech, body movements, facial expressions, and eye motions. Thus a "natural" human-computer interface should be multimodal. In the past, attempts have been made to study and incorporate some of the natural modes of communication into human-computer interfaces, for example, speech, simple hand gestures, etc. However, very little work exists where multiple element of human communication have been incorporated into the HCI.

The communication mode that seems most relevant to the manipulation of physical objects is the hand motion, also called hand gestures. To keep the interaction natural, it is necessary that there be a minimal number of devices attached to the user. Receiving of speech signals using statically mounted arrays of microphones has achieved this goal. To accomplish the same level of naturalness for HCI using hand gestures, it is possible to use computer vision techniques for analyzing free hand gestures. Although, some progress has been made in developing computer vision-based gesture recognition techniques, the problem is far from being solved. It is hoped that computer vision-based gesture recognition can be greatly improved by exploiting the other sensor modalities that might also be present in a multimodal human-computer interface.

AREA REFERENCES

A. G. Hauptmann and P. McAvinney. "Gesture with speech for graphics manipulation." International Journal of Man-Machine Studies, 38(2):231--249, Feb. 1993.

J. Streeck. "Gesture as communication I: its coordination with gaze and speech." Communication monographs, 60:275--299, December 1993.

V. I. Pavlovic, R. Sharma, and T. S. Huang. "Visual interpretation of hand gestures for human-computer interaction: A review." IEEE Transaction on Pattern Analysis and Machine Intelligence, 19(7), July 1997.

M. T. Vo and C. Wood, "Building an application framework for speech and pen input integration in multimodal learning interfaces", Proc. Int'l Conference on Acoustics, Speech, and Signal Processing, 3545-3548, 1996

P. R. Cohen, M. Johnston, D. McGee, S. Oviatt, and Jay Pittman, "QuickSet: Multimodal Interaction for Simulation Set-Up and Control", Proc. of the 5th Applied Natural Language Processing Meeting, 1997, Washington, DC.

F. K. H. Quek. "Eyes in the interface." Image and Vision Computing, vol. 13, August 1995.

J. M. Rehg and T. Kanade. "Model-based tracking of self-occluding articulated objects." In Proc. IEEE International Conference on Computer Vision, pp. 612--617, June 1995, Cambridge, MA.

T. E. Starner and A. Pentland. "Visual recognition of american sign language using hidden Markov models." In Proc. International Workshop on Automatic Face and Gesture Recognition , (Zurich, Switzerland), pp. 189--194, June 1995.

RELATED PROGRAM AREAS

Virtual Environments, Speech and Natural Language Understanding.

POTENTIAL RELATED PROJECTS

Speech and Natural Language Understanding: Gesture recognition shares some of the same problems which are being addressed in speech and natural language understanding. Further, the integration of gesture with speech for human-computer interaction would require collaboration with researchers involved in speech understanding for human-computer interfaces.

Virtual Environments: Integration of gesture recognition into the control of Virtual Environments could benefit from collaboration with researchers involved in human factor studies for virtual environments. This collaboration could shed some light on how to eventually evaluate the use of glove-free gestures as a interaction modality in virtual environments.