Alan J. Goldschen
Center of Innovative Technology, Herndon, Virginia, USA
A machine should be capable of performing automatic speech recognition through the use of several knowledge sources, analogous, to a certain extent, to those sources that humans use [EL90]. Current speech recognizers use only acoustic information from the speaker, and in noisy environments often use secondary knowledge sources such as a grammar and prosody. One source of secondary information that has been primarily been ignored is optical information (from the face and in particular the oral-cavity region of a speaker), that often has information redundant with the acoustic information, and is often not corrupted by the processes that cause the acoustical noise [Sil93]. In noisy environments, humans rely on a combination of speech (acoustical) and visual (optical) sources, and this combination improves the signal-to-noise ratio by a gain of 10 to 12 dB [Bro90]. Analogously, machine recognition should improve when combining the acoustical source with an optical source that contains information from the facial region such as gestures, expressions, head-position, eyebrows, eyes, ears, mouth, teeth, tongue, cheeks, jaw, neck, and hair [PBV94]. Human facial expressions provide information about emotion (anger, surprise), truthfulness, temperament (hostility), and personality (shyness) [EHSH93]. Furthermore, human speech production and facial expression are inherently linked by a synchrony phenomenon, where changes often occur simultaneously with speech and facial movements [PBV94,CO71]. An eye blink movement may occur at the beginning or end of a word, while oral-cavity movements may cease at the end of a sentence.
In human speech perception experiments, the optical information is complementary to the acoustic information because many of the phones that are said to be close to each other acoustically are very distant from each other visually [Sum87]. Visually similar phones such as /p/, /b/, /m/ form a viseme, which is specific oral-cavity movements that corresponds to a phone [Fis68]. It appears that the consonant phone-to-viseme mapping is many-to-one [Fin86,Gol93] and the vowel phone-to-viseme mapping is nearly one-to-one [Gol93]. For example, the phone /p/ appears visually similar to the phones /b/ and /m/ and at a signal-to-noise ratio of zero /p/ is acoustically similar to the phones /t/, /k/, /f/, /th/, and /s/ [Sum87]. Using both sources of information, humans (or machines) can determine the phone /p/. However, this fusion of acoustical and optical sources does sometimes cause humans to perceive a phone different from either the acoustically or optically presented phone, and is known as the McGurk effect [MM76]. In general, the perception of speech in noise improves greatly when presented with acoustical and optical sources because of the complementarity of the sources.
Some speech researchers are developing systems that use the complementary acoustical and optical sources of information to improve their acoustic recognizers, especially in noisy environments. These systems primarily focus on integrating optical information from the oral-cavity region of a speaker (automatic lipreading) with acoustic information. The acoustic source often consists of a sequence of vectors containing, or some variation of, linear predictive coefficients or filter bank coefficients [RS78,DPH93]. The optical source consists of a sequence of vectors containing static oral-cavity features such as the area, perimeter, height, and width of the oral-cavity [Pet84,PBBB88], jaw opening [SWL92], lip rounding and number of regions or blobs in the oral-cavity [Gol93,GGP92,GGP94]. Other researchers model the dynamic movements of the oral cavity using derivatives [Gol93,Smi89,Nis86], surface learning [BOK94], deformable templates [HPS94,RM94], or optical flow techniques [PM89,MP91].
There have been two basic approaches towards building a system that uses both acoustical and optical information. The first approach uses a comparator to merge the two independently recognized acoustical and optical events. This comparator may consist of a set of rules (e.g., if the top two phones from the acoustic recognizer is /t/ or /p/, then choose the one that has a higher ranking from the optical recognizer) [PBBB88] or a fuzzy logic integrator (e.g., provides linear weights associated with the acoustically and optically recognized phones) [Sil93,Sil94]. The second approach performs recognition using a vector that includes both acoustical and optical information, such systems typically use neural networks to combine the optical information with the acoustic to improve the signal-to-noise ratio before phonemic recognition [YGS89,BOK94,BHMW93,SWL92,Sil94].
Regardless of the signal-to-noise ratio, most systems perform better using both acoustical and optical sources of information than when using only one source of information [BOK94,BHMW93,MA94,Pet84,PBBB88,Sil94,Sil93,Smi89,SWL92,YGS89]. At a signal-to-noise ratio of zero with a 500-word task [Sil93], achieves word accuracy recognition rates of 38%, 22%, and 58% respectively, using acoustical information, optical information, and both sources of information. Similarly, for a German alphabetical letter recognition task, [BHMW93] achieve a recognition accuracy of 47%, 32%, and 77%, respectively, using acoustical information, optical information, and both sources of information.
In summary, most of the current systems use an optical source containing information from the oral-cavity region of speaker (lipreading) to improve the robustness of the information from the acoustic source. Future systems will likely improve this optical source and use additional features from the facial region.