A principle focus of research at CSLU has been practical speech recognition using neural networks. Practical means that the algorithm can run in approximately real time on readily available platforms. The theoretical motivation behind this work has been the belief that knowledge sources that are known to be used by people when processing speech should also be used by recognition algorithms.
Work on robust spoken dialogues is proceeding on three fronts: (a) robust signal representations and processing strategies; (b) new approaches to modeling phonemes and words; and (c) developing tools and techniques for understanding and designing more graceful dialogues. Work in each of these areas focuses on improving and enabling robust human/machine interactions using voice.
In the area of environmental robustness, research by Hynek Hermansky, Nelson Morgan and Steve Greenberg continues to explore the potential of the RASTA signal processing technique, which has proved to be robust to changes in channel conditions. Experiments during the past year have focused on processing related to properties of the spectral trajectories, as interpreted by RASTA-like filters or by a modulation spectrum. RASTA-like temporal filtering of the amplitude compressed power spectrum of speech has been investigated for enhancing the intelligibility of noisy speech. This work has also yielded a general methodology for the design of temporal RASTA filters from clean and noisy speech data, and the methodology has been used to motivate signal processing used in speech recognition algorithms.
These researchers have also begun work on multi-band automatic speech recognition (ASR). This research is motivated by psychoacoustic and physiological data showing that perceivers are able to combine information from relatively independent frequency bands. Human perception is robust to loss of information in specific frequency bands (due to redundancy across bands), whereas ASR systems degrade dramatically when such loss occurs. It was demonstrated that by selecting the appropriate sub-set of sub-bands, training an independent probability estimator on each band, and then combining the estimates with a separatetly trained non-linear classifier, the multiband approach can dramatically improve recognition accuracy in the presence of sinusoidal additive noise.
Work on improving feature representations has been conducted under the guidance of Ron Cole, Etienne Barnard and Hynek Hermansky. A recent thesis by Philipp Schmid successfully used a new robust formant-tracking algorithm to do phoneme recognition. Zhihong Hu's thesis research is using formants and other representations to better model the dynamics of speech over time. Tim Carmell is using the theoretical framework of auditory scene analysis to discover and process the important auditory features of a speech signal. Sarel van Vuuren is working on preprocessing the signal to exploit syllable-length correlations in the structure of speech. John-Paul Hosom is experimenting with alternate modeling units called diphones that span parts of traditional phonemes in order to more effectively and directly model the coarticulation effects between phonemes.
To complement the discovery of better features, CSLU is also trying to improve the ability of neural networks to use those features. Wei Wei is studying the ways in which neural networks fail to achieve their theoretical promise of modeling posterior probabilities in order to improve on these shortcomings. Xin Tu is trying to improve the performance of neural networks by using an error criterion during training that is more relevant to the real goal of the network accurate recognition of words.
Adapting recognizers to new speakers or conditions is an important approach to robustness. Work by Dan Burnett and Mark Fanty developed a technique to adapt very rapidly -- even with just one word -- by modifying front end (signal processing) parameters so as to obtain the maximum recognition score on the adaptation utterance. A number of parameters were tried, with the best results obtained using a parameter which shifts the bark scale offset up or down during the PLP processing. This warps the frequency mapping in a way which is intended to mimic the effects of different vocal track lengths. The approach was tested by adapting an adult-trained digit recognizer to children's speech using the TI-digits corpus (band limited to 4kHz). Without adaptation, children have a 9.6% string error rate. Adapting using just a single digit lowers that to 4.2%. Adapting using seven digits lowers that to 3.5%. This eliminates nearly two thirds of the errors, but still does not equal the adult error rate of under 1%.
Work on graceful and robust dialogues spoken dialogue systems has been undertaken by David Novick, Brian Hansen, Stephen Sutton and Karen Ward attempts to shed light on nature of effective communication. Brian has investigated a principled approach to the creation of system prompts; this approach was evaluated in the context of a spoken language system used to conduct a census interview, which resulting in over 97% informative responses. Karen has studied prosodic cues that accompany acknowledgments. Stephen has applied dialogue knowledge by building a dialogue design tool (CSLUrp) which is part of the CSLU toolkit. These tools allow researchers to observe and/or review all stages of human computer interaction during or after system use. For example, the researcher can view the system's recognition response, and overide it to perform "Wizard" studies to investigate different prompts, convesational repair strategies, and so forth.
Sangita Tibrewala and Hynek Hermansky: Multi-band and adaptation approaches to robust speech recognition, to appear in Proceedings of EUROSPEECH 97, Rhodos , Greece, September 1997.
Sarel van Vuuren and Hynek Hermansky: Data-driven design of RASTA-like filters, to appear in Proceedings of EUROSPEECH 97, Rhodos , Greece, September 1997.
Noboru Kanedera, Takayuki Arai, Hynek Hermansky and Misha Pavel: On the importance of various modulation frequencies for speech recognition, to appear in Proceedings of EUROSPEECH 97, Rhodos , Greece, September 1997.
A robust spoken language system is one that works well for many people under many conditions and fails gracefully when the demands placed upon the system exceed its capabilities. To a first approximation, spoken language systems fail because of bad speech recognition or bad dialogue modeling. Robust spoken dialogue systems require advances in each of these areas. We focus here on recognition problems.
At the level of recognition, spoken language systems typically fail because they are presented with speech input that is sufficiently different from the data with which they are trained. The sources of variability that cause systems to fail include environmental noise, channel characteristics, speaker differences and utterances that are not in the system's recognition vocabulary. Much of the research in the past has been focused on how to improve statistical modeling techniques so that variability can be modeled from vast amounts of speech data. Another approach is to analyze the main sources of variability that cause systems to fail and to perform basic research leading to more effective computational strategies.
In the area of computer speech recognition, hidden Markov modeling is the 800 pound gorilla. Most systems today rely on frame-based statistical modeling techniques. A serious limitation of these techniques is the difficulty of incorporating linguistic knowledge into the recognition paradigm. The IBM speech group, one of the pioneers of speech recognition using hidden Markov models, worked with linguists for several years to incorporate syntactic and semantic knowledge into IBM's systems, always with the same result-- an increase in word recognition error rates. This led Bob Mercer, then of the IBM speech group, to assert in a keynote address to a speech recognition workshop that the most effective technique IBM has found for decreasing error rates is to fire a linguist.
The difficulty of incorporating linguistic knowledge into the dominant research paradigm stands as a major stumbling block to progress. Accurate speech recognition requires the integration of diverse acoustic cues, such as stop bursts, formant movements, changes in pitch and comparison of acoustic features across segments. Similarly, speech understanding requires the integration of these acoustic cues with syntactic, semantic, pragmatic and situational knowledge. No paradigm exists today that allows these information sources to be combined in a principled way that improves system performance. The result is that those with the most knowledge about human communication and spoken language are largely excluded from the research process.
Fortunately, the situation is changing. It is now widely recognized that new approaches are needed, and exciting research is underway worldwide to discover alternatives to purely statistical approaches. Studies of human communication are yielding new insights into speech production, perception and human conversation, and researchers are investigating how to use this knowledge to improve spoken language technology.
Survey of the State of the Art in Human Language Technology
Articles by the MIT Speech Group