next up previous contents index
Next: 1.5: HMM Methods in Spech Recognition Up: Spoken Language Input Previous: 1.3 Signal Representation

1.4 Robust Speech Recognition

Richard M. Stern
Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

Robustness in speech recognition refers to the need to maintain good recognition accuracy even when the quality of the input speech is degraded, or when the acoustical, articulatory, or phonetic characteristics of speech in the training and testing environments differ. Obstacles to robust recognition include acoustical degradations produced by additive noise, the effects of linear filtering, nonlinearities in transduction or transmission, as well as impulsive interfering sources, and diminished accuracy caused by changes in articulation produced by the presence of high-intensity noise sources. Some of these sources of variability are illustrated in Figure gif. Speaker-to-speaker differences impose a different type of variability, producing variations in speech rate, co-articulation, context, and dialect. Even systems that are designed to be speaker independent exhibit dramatic degradations in recognition accuracy when training and testing conditions differ [CH92,Jua91].


Figure: Schematic representation of some of the sources of variability that can degrade speech recognition accuracy, along with compensation procedures that improve environmental robustness.

Speech recognition systems have become much more robust in recent years with respect to both speaker variability and acoustical variability. In addition to achieving speaker independence, many current systems can also automatically compensate for modest amounts of acoustical degradation caused by the effects of unknown noise and unknown linear filtering.

As speech recognition and spoken language technologies are being transferred to real applications, the need for greater robustness in recognition technology is becoming increasingly apparent. Nevertheless, the performance of even the best state-of-the art systems tends to deteriorate when speech is transmitted over telephone lines, when the signal-to-noise ratio (SNR) is extremely low (particularly when the unwanted noise consists of speech from other talkers), and when the speaker's native language is not the one with which the system was trained.

Substantial progress has also been made over the last decade in the dynamic adaptation of speech recognition systems to new speakers, with techniques that modify or warp the systems' phonetic representations to reflect the acoustical characteristics of individual speakers [GL91,HL93,SCK87]. Speech recognition systems have also become more robust in recent years, particularly with regard to slowly-varying acoustical sources of degradation.

In this section we focus on approaches to environmental robustness. We begin with a discussion of dynamic adaptation techniques for unknown acoustical environments and speakers. We then discuss two popular alternative approaches to robustness, the use of multiple microphones, and the use of signal processing based on models of auditory physiology and perception.

1.4.1 Dynamic Parameter Adaptation

Dynamic adaptation of either the features that are input to the recognition system, or of the system's internally stored representations of possible utterances, is the most direct approach to environmental and speaker adaptation. We discuss separately three different approaches to speaker and environmental adaptation: (1) the use of optimal estimation procedures to obtain new parameter values in the testing conditions; (2) the development of compensation procedures based on empirical comparisons of speech in the training and testing environments; and (3) the use of high-pass filtering of parameter values to improve robustness.

Optimal Parameter Estimation:

Many successful robustness techniques are based on a formal statistical model that characterizes the differences between speech used to training and test the system. Parameter values of these models are estimated from samples of speech in the testing environments, and either the features of the incoming speech or the internally-stored representations of speech in the system are modified. Typical structural models for adaptation to acoustical variability assume that speech is corrupted either by additive noise with an unknown power spectrum [PB84,Eph92,EW90,GY92,LBB92,BdSN92], or by a combination of additive noise and linear filtering [AS90a]. Much of the early work in robust recognition involved a re-implementation of techniques developed to remove additive noise for the purpose of speech enhancement, as reviewed in section 12.3. The fact that such approaches were able to substantially reduce error rates in machine recognition of speech even though they were largely ineffective in improving human speech intelligibility (when measured objectively) [LO79] is one indication of the limited capabilities of automatic speech recognition systems, compared to human speech perception.

Approaches to speaker adaptation are similar in principle, except that the models are more commonly general statistical models of feature variability [GL91,HL93], rather than models of the sources of speaker-to-speaker variability. Solution of the estimation problems frequently requires either analytical or numerical approximations, or the use of iterative estimation techniques such as the estimate-maximize (EM) algorithm [DLR77]. These approaches have all been successful in applications where the assumptions of the models are reasonably valid, but they are limited in some cases by computational complexity.

Another popular approach is to use knowledge of background noise drawn from examples to transform the means and variances of phonetic models that had been developed for clean speech to enable these models to characterize speech in background noise [VM90,GY92]. The technique known as parallel model combination[GY92] extends this approach, providing an analytical model of the degradation that accounts for both additive and convolutional noise. These methods work reasonably well, but they are computationally costly at present, and they rely on accurate estimates of the background noise.

Empirical Feature Comparison:

Empirical comparisons of features derived from high-quality speech with features of speech that is simultaneously recorded under degraded conditions can be used (instead of a structural model) to compensate for mismatches between training and testing conditions. In these algorithms, the combined effects of environmental and speaker variability are typically characterized as additive perturbations to the features. Several successful empirically-based robustness algorithms have been described that either apply additive correction vectors to the features derived from incoming speech waveforms [NW94,LSAM94] or that apply additive correction vectors to the statistical parameters characterizing the internal representations of these features in the recognition system e.g., [AMS94,LSAM94]. (In the latter case the variances of the templates may also be modified.) Recognition accuracy can be substantially improved by allowing the correction vectors to depend on SNR, specific location in parameter space within a given SNR, or presumed phoneme identity [NW94,LSAM94]. For example, the numerical difference between cepstral coefficients derived on a frame-by-frame basis from high-quality speech and simultaneously recorded speech that is degraded by both noise and filtering primarily reflects the degradations introduced by the filtering at high SNRs, and the effects of the noise at low SNRs. This general approach can be extended to cases when the testing environment is unknown a priori, by developing ensembles of correction vectors in parallel for a number of different testing conditions, and by subsequently applying the set of correction vectors (or acoustic models) from the condition that is deemed to be most likely to have produced the incoming speech. In cases where the test condition is not one of the ones used to train correction vectors, recognition accuracy can be further improved by interpolating the correction vectors or statistics representing the best candidate conditions.

Empirically-derived compensation procedures are extremely simple, and they are quite effective in cases when the testing conditions are reasonably similar to one of the conditions used to develop correction vectors. For example, in a recent evaluation using speech from a number of unknown microphones in a 5000-word continuous dictation task, the use of adaptation techniques based on empirical comparisons of feature values reduced the error rate by 40% relative to a baseline system with only cepstral mean normalization (described below). Nevertheless, the empirical approaches have the disadvantage of requiring stereo databases of speech that are simultaneously recorded in the training environment and the testing environment.

Cepstral High-pass Filtering:

The third major adaptation technique is cepstral high-pass filtering, which provides a remarkable amount of robustness at almost zero computational cost [HMBK91,HMR91]. In the well-known RASTA method [HMBK91], a high-pass (or band-pass) filter is applied to a log-spectral representation of speech such as the cepstral coefficients. In cepstral mean normalization (CMN), high-pass filtering is accomplished by subtracting the short-term average of cepstral vectors from the incoming cepstral coefficients.

The original motivation for the RASTA and CMN algorithms is discussed in section gif. These algorithms compensate directly for the effects of unknown linear filtering because they force the average values of cepstral coefficients to be zero in both the training and testing domains, and hence equal to each other. An extension to the RASTA algorithm known as J-RASTA [KMH94] can also compensate for noise at low SNRs. In an evaluation using 13 isolated digits over telephone lines, it was shown [KMH94] that the J-RASTA method reduced error rates by as much as 55 percent relative to RASTA when both noise and filtering effects are present. Cepstral high-pass filtering is so inexpensive and effective that it is currently embedded in some form in virtually all systems that are required to perform robust recognition.

1.4.2: Use of Multiple Microphones

Further improvements in recognition accuracy can be obtained at lower SNRs by the use of multiple microphones. As noted in the discussion on speech enhancement in section 12.3, microphone arrays can, in principle, produce directionally sensitive gain patterns that can be adjusted to increase sensitivity to the speaker and reduce sensitivity in the direction of competing sound sources. In fact, results of recent pilot experiments in office environments [CLP94,SS93] confirm that the use of delay-and-sum beamformers in combination with a post-processing algorithm that compensates for the spectral coloration introduced by the array itself can reduce recognition error rates by as much as 61%.

Array processors that make use of the more general minimum mean square error (MMSE)-based classical adaptive filtering techniques can work well when signal degradation is dominated by additive independent noise, but they do not perform well in reverberant environments when the distortion is at least in part a delayed version of the desired speech signal [Pet89,AS90b]. (This problem can be avoided by adapting only during non-speech segments: [VC90].)

A third approach to microphone array processing is the use of cross-correlation-based algorithms, which have the ability to reinforce the components of a sound field arriving from a particular azimuth angle. These algorithms are appealing because they are similar to the processing performed by the human binaural system, but thus far they have demonstrated only a modest superiority over the simpler delay-and-sum approaches [SS93].

1.4.3: Use of Physiologically Motivated Signal Processing

A number of signal processing schemes have been developed for speech recognition systems that mimic various aspects of human auditory physiology and perception (e.g., [Coh89,Ghi88,Lyo82,Sen88,Her90,PRH91]). Such auditory models typically consist of a bank of bandpass filters (representing auditory frequency selectivity) followed by nonlinear interactions within and across channels (representing hair-cell transduction, lateral suppression, and other effects). The nonlinear processing is (in some cases) followed by a mechanism to extract detailed timing information as a function of frequency [Sen88,DLS90].

Recent evaluations indicate that auditory models can indeed provide better recognition accuracy than traditional cepstral representations when the quality of the incoming speech degrades, or when training and testing conditions differ [HL89,MZ90]. Nevertheless, auditory models have not yet been able to demonstrate better recognition accuracy than the most effective dynamic adaptation algorithms, and conventional adaptation techniques are far less computationally costly [Ohs93]. It is possible that the success of auditory models has been limited thus far because most of the evaluations were performed using hidden Markov model classifiers, which are not well matched to the statistical properties of features produced by auditory models. Other researchers suggest that we have not yet identified the features of the models' outputs that will ultimately provide superior performance. The approach of auditory modeling continues to merit further attention, particularly with the goal of resolving these issues.

1.4.4: Future Directions

Despite its importance, robust speech recognition has become a vital area of research only recently. To date, major successes in environmental adaptation have been limited either to relatively benign domains (typically with limited amounts of quasi-stationary additive noise and/or linear filtering, or to domains in which a great deal of environment-specific training data are available). Speaker adaptation algorithms have been successful in providing improved recognition for native speakers languages other than the one with which a system is trained, but recognition accuracy obtained using non-native speakers remains substantially worse, even with speaker adaptation, (e.g., [PFF95]).

At present, it is fair to say that hardly any of the major limitations to robust recognition cited in section 1.1 have been satisfactorily resolved. It is suggested that success in the following key problem areas is likely to accelerate the development and deployment of practical speech-based applications.

Speech over Telephone Lines:

Recognition of telephone speech is difficult because each telephone channel has its own unique SNR and frequency response. Speech over telephone lines can be further corrupted by transient interference and nonlinear distortion. Telephone-based applications must be able to adapt to new channels on the basis of a very small amount of channel-specific data.

Low-SNR Environments:

Even with state-of-the art compensation techniques, recognition accuracy degrades when the channel SNR decreases below about 15 dB, even though humans can obtain excellent recognition accuracy at lower SNRs.

Co-channel Speech Interference:

Interference by other talkers poses a much more difficult challenge to robust recognition than interference by broadband noise sources. So far, efforts to exploit speech-specific information to reduce the effects of co-channel interference by other talkers have been largely unsuccessful.

Rapid Adaptation for Non-native Speakers:

In today's pluralistic and highly mobile society, successful spoken-language applications must be able to cope with the speech of non-native as well as native speakers. Continued development of non-intrusive rapid adaptation to the accents of non-native speakers will be needed to ensure commercial success.

Common Speech Corpora with Realistic Degradations:

Continued rapid progress in robust recognition will depend on the formulation, collection, transcription, and dissemination of speech corpora that contain realistic examples of the degradations encountered in practical environments. The selection of appropriate tasks and domains for shared database resources is best accomplished through the collaboration of technology developers, applications developers, and end users. The contents of these databases should be realistic enough to be useful as an impetus for solutions to actual problems, even in cases for which it may be difficult to calibrate the degradation for the purpose of evaluation.


next up previous contents
Next: 1.5: HMM Methods in Spech Recognition Up: Spoken Language Input Previous: 1.3 Signal Representation <\BODY>