Christian Benoit,
Dominic W. Massaro,
& Michael M. Cohen
Universite Stendhal, Grenoble, France
University of California, Santa Cruz, California, USA
There is valuable and effective information afforded by a view of the speaker's face in speech perception and recognition by humans. Visible speech is particularly effective when the auditory speech is degraded, because of noise, bandwidth filtering, or hearing-impairment [SP54,Erb75,Sum79,Mas87,BMK94]
The strong influence of visible speech is not limited to situations with degraded auditory input, however. A perceiver's recognition of an auditory-visual syllable reflects the contribution of both sound and sight. When an auditory syllable /ba/ is dubbed onto a videotape of a speaker saying /ga/, subjects perceive the speaker to be saying /da/ [MM76].
There is thus an evidence that: (1) synthetic faces increase the intelligibility of synthetic speech, (2) but under the condition that facial gestures and speech sounds are coherent. To reach this goal, the articulatory parameters of the facial animation have to be controlled so that it looks like and it sounds like the auditory output is generated by the visual displacements of the articulators. Not only disynchrony or incoherence between the two modalities don't increase speech intelligibility; they might even decrease it.
Most of the existing parametric models of the human face have been
developed in the perspective of optimizing the visual rendering of
facial expressions
[Par74,PB81,BL85,Wat87,MTPT88,VY92]. Few
models have focused on the specific articulation of speech
gestures:
[STHN90,BLMA92,HL94]
prestored a limited set of facial images occurring in the natural
production of speech in order to synchronize the processes of
diphone concatenation and visemes display in a
text-to-audio-visual speech synthesizer. Ultimately, the
coarticulation effects and the transition smoothing are much more
naturally simulated by means of parametric models specially
controlled for visual speech animation, such as the 3-D lip model developed by [GMAB94] or the 3-D
model of the whole face adapted to speech control by
[CM90]. Those two models are displayed on Figure
.
Figure: Left panel:
gouraud shading of the face model
originally developed by [Par74] and adapted to
speech gestures by [CM93]. A dozen
parameters allow the synthetic face to be correctly controlled for
speech. Right panel: wireframe structure of the 3-D model of the lips
developed by [GMAB94]. The
internal and external contours of the model can take all the
possible shapes of natural lips speaking in a neutral expression.
A significant gain in intelligibility due to a coherent animation of a synthetic face has obviously been obtained at the University of California in Santa Cruz by improving the Parke model [CM93] and then synchronizing it to the MITalk rule-based speech synthesizer (even though no quantitative measurements are yet available). In parallel, intelligibility tests have been carried out at the ICP-Grenoble in order to compare the benefit of seeing the natural face, a synthetic face, or synthetic lips while listening to natural speech under various conditions of acoustic degradation [GGMCB94].
Whatever the degradation level, the two thirds of the missing information are compensated by the vision of the entire speaker's face; half is compensated by the vision of a synthetic face controlled through six parameters directly measured on the original speaker's face; a third of the missing information is compensated by the vision of a 3-D model of the lips, controlled only through four of these command parameters (without seeing the teeth, the tongue or the jaw). All these findings support the evidence that technological spin-offs are expected in two main areas of application. On one hand, even though the quality of some text-to-speech synthesizers is now such that simple messages are very intelligible when synthesized in clear acoustic conditions, it is no longer the case when the message is less predictable (proper names, numbers, complex sentences, etc.) or when the speech synthesizer is used in a natural environment (e.g., the telephone network or in public places with background noise.) Then, the display of a synthetic face coherently animated in synchrony with the synthetic speech makes the synthesizer sound more intelligible and look more pleasant and natural. On the other hand, the quality of computer graphics rendering is now such that human faces can be very naturally imitated. Today, the audience no longer accepts all those synthetic actors behaving like if their voice was dubbed from another language. There is thus a strong pressure from the movie and the entertainment industry to overcome the problem of automatizing the lip-synchronization process so that the actors facial gestures look natural.
To conclude, research in the area of visible speech is a fruitful paradigm for psychological inquiry [Mas87]. Video analysis of human faces is a simple investigation technique which allows a better understanding of how speech is produced by humans [AL91]. Face and lip modeling allows the experimenters to manipulate controlled stimuli and to evaluate hypotheses and descriptive parametrizations in terms of visual and bimodal intelligibility of speech. Finally, bimodal integration of facial animation and acoustic synthesis is a fascinating challenge for a better description and comprehension of each language in which this technology is developed. It is also a necessary and promising step towards the realization of autonomous agents in human-machine virtual interfaces.