next up previous contents index
Next: 9.5 Modality Integration: Facial Up: 9 Multimodality Previous: 9.3 Text and Images

9.4 Modality Integration: Speech and Gesture

Yacine Bellik
LIMSI-CNRS, Orsay, France

Speech and gestures are the expression means which are the most used in communication between human beings. Learning of their use begins with the first years of life. Therefore they should be the modalities to be privileged in communicating with computers [HM93]. Compared to speech, research that aims to integrate gesture as an expression mean (not only as an object manipulation mean) in Human-Computer Interaction (HCI) has recently began. These works have been launched thanks to the appearance of new devices, in particular datagloves which allow us to know about the hand configuration (flexing angles of fingers) at any moment and to follow its position into the 3D space.

Multimodality aims not only at making several modalities cohabit in an interactive system, but especially at making them cooperate together [CNS93,Sal90] (for instance, if the user wants to move an object using a speech recognition system and a touch screen as in Figure gif, he has just to say put that there while pointing at the object and at its new position; [Bol80]).


Figure: Working with a multimodal interface including speech and gesture. The user speaks while pointing on the touch screen to manipulate the objects. The time correlation of pointing gestures and spoken utterances is important to determine the meaning of his action.

In human communication, the use of speech and gestures is completely coordinated. Unfortunately, and at the opposite of human communication means, the devices used to interact with computers have not been designed at all to cooperate.

For instance, the difference between time responses of devices can be very large (a speech recognition system needs more time to recognize a word than a touch screen driver to compute the point coordinates relative to a pointing gesture). This implies that the system receives an information stream in an order which does not correspond to the real chronological order of user's actions (like a sentence in which words have been mixed up). Consequently, this can lead to bad interpretations of user statements.

The fusion of information issued from speech and gesture constitutes a major problem. Which criteria should we use to decide the fusion of an information with another one, and at what abstraction level should this fusion be done? On the one hand, a fusion at a lexical level allows for designing generic multimodal interface tools, though fusion errors may occur. On the other hand, a fusion at a semantic level is more robust because it exploits many more criteria, but it is in general application-dependent. It is also important to handle possible semantic conflicts between speech and gesture and to exploit information redundancy when it occurs.

Time is an important factor in interfaces which integrate speech and gesture [Bel95]. It is one of the basic criterion necessary (but not sufficient) for the fusion process and it allows for reconstituting the real chronological order of information. So it is necessary to assign dates (timestamps) to all messages (words, gestures, etc.) produced by the user.

It is also important to take into account the characteristics of each modality [Ber93] and their technological constraints. For instance, operations which require high security should be assigned to the modalities which present lower error recognition risks, or should demand redundancy to reduce these risks. It can be necessary to define a multimodal grammar. In a perfect case, this grammar should also take into account other parameters such as the user state, current task, and environment (for instance, a high noise level will prohibit the use of speech).

9.4.1 Future Directions

The effectiveness of a multimodal interface depends in a large part on performances of each modality taken separately. If remarkable progress has been accomplished in speech processing, more efforts should be produced to improve gesture recognition systems, in particular for continuous gestures. Systems with touch feed-back and/or force feed-back which become more and more numerous will allow us to improve the comfort of gesture use, in particular for 3D applications, in the near future.



next up previous contents
Next: 9.5 Modality Integration: Facial Up: 9 Multimodality Previous: 9.3 Text and Images