Robustness in speech recognition refers to the need to maintain good
recognition accuracy even when the quality of the input speech is
degraded, or when the acoustical, articulatory, or phonetic
characteristics of speech in the training and testing environments
differ. Obstacles to robust recognition include acoustical
degradations produced by additive
noise, the effects of linear
filtering, nonlinearities in
transduction or transmission,
as well as impulsive interfering sources, and diminished accuracy
caused by changes in articulation produced by the presence of
high-intensity noise sources. Some of
these sources of variability are illustrated in Figure
. Speaker-to-speaker differences impose a different
type of variability, producing variations in speech
rate, co-articulation,
context, and dialect. Even systems that are designed
to be speaker independent exhibit dramatic degradations in
recognition accuracy when training and testing conditions differ
[CH
92,Jua91].
Figure: Schematic representation of some of the sources of
variability that can degrade speech recognition accuracy, along with
compensation procedures that improve environmental robustness.
Speech recognition systems have become much more robust in recent years with respect to both speaker variability and acoustical variability. In addition to achieving speaker independence, many current systems can also automatically compensate for modest amounts of acoustical degradation caused by the effects of unknown noise and unknown linear filtering.
As speech recognition and spoken language technologies are being transferred to real applications, the need for greater robustness in recognition technology is becoming increasingly apparent. Nevertheless, the performance of even the best state-of-the art systems tends to deteriorate when speech is transmitted over telephone lines, when the signal-to-noise ratio (SNR) is extremely low (particularly when the unwanted noise consists of speech from other talkers), and when the speaker's native language is not the one with which the system was trained.
Substantial progress has also been made over the last decade in the dynamic adaptation of speech recognition systems to new speakers, with techniques that modify or warp the systems' phonetic representations to reflect the acoustical characteristics of individual speakers [GL91,HL93,SCK87]. Speech recognition systems have also become more robust in recent years, particularly with regard to slowly-varying acoustical sources of degradation.
In this section we focus on approaches to environmental robustness. We begin with a discussion of dynamic adaptation techniques for unknown acoustical environments and speakers. We then discuss two popular alternative approaches to robustness, the use of multiple microphones, and the use of signal processing based on models of auditory physiology and perception.
Dynamic adaptation of either the features that are input to the recognition system, or of the system's internally stored representations of possible utterances, is the most direct approach to environmental and speaker adaptation. We discuss separately three different approaches to speaker and environmental adaptation: (1) the use of optimal estimation procedures to obtain new parameter values in the testing conditions; (2) the development of compensation procedures based on empirical comparisons of speech in the training and testing environments; and (3) the use of high-pass filtering of parameter values to improve robustness.
92],
or by a combination of additive noise and linear
filtering [AS90a]. Much of
the early work in robust recognition involved a re-implementation of
techniques developed to remove additive noise for the purpose of
speech enhancement, as reviewed in section 12.3.
The fact that such approaches were able to substantially reduce error
rates in machine recognition of speech even though they were largely
ineffective in improving human speech intelligibility (when measured
objectively) [LO79] is one indication of the
limited capabilities of automatic speech recognition systems,
compared to human speech perception.
Approaches to speaker adaptation are similar in principle, except that the models are more commonly general statistical models of feature variability [GL91,HL93], rather than models of the sources of speaker-to-speaker variability. Solution of the estimation problems frequently requires either analytical or numerical approximations, or the use of iterative estimation techniques such as the estimate-maximize (EM) algorithm [DLR77]. These approaches have all been successful in applications where the assumptions of the models are reasonably valid, but they are limited in some cases by computational complexity.
Another popular approach is to use knowledge of background noise drawn from examples to transform the means and variances of phonetic models that had been developed for clean speech to enable these models to characterize speech in background noise [VM90,GY92]. The technique known as parallel model combination[GY92] extends this approach, providing an analytical model of the degradation that accounts for both additive and convolutional noise. These methods work reasonably well, but they are computationally costly at present, and they rely on accurate estimates of the background noise.
Empirically-derived compensation procedures are extremely simple, and they are quite effective in cases when the testing conditions are reasonably similar to one of the conditions used to develop correction vectors. For example, in a recent evaluation using speech from a number of unknown microphones in a 5000-word continuous dictation task, the use of adaptation techniques based on empirical comparisons of feature values reduced the error rate by 40% relative to a baseline system with only cepstral mean normalization (described below). Nevertheless, the empirical approaches have the disadvantage of requiring stereo databases of speech that are simultaneously recorded in the training environment and the testing environment.
The original motivation for the RASTA and CMN
algorithms is discussed in section
. These algorithms compensate directly for the effects of unknown linear
filtering because they force the average values of cepstral
coefficients to be zero in both the training and testing domains, and
hence equal to each other. An extension to the RASTA algorithm
known as J-RASTA [KMH
94] can also compensate
for noise at low SNRs. In an evaluation using 13 isolated
digits over telephone lines, it was shown [KMH
94]
that the J-RASTA method reduced error rates by as much as 55
percent relative to RASTA when both noise and filtering
effects are present. Cepstral high-pass filtering is so inexpensive
and effective that it is currently embedded in some form in virtually
all systems that are required to perform robust recognition.
Further improvements in recognition accuracy can be obtained at lower
SNRs by the use of multiple
microphones. As noted in the discussion
on speech enhancement in section 12.3, microphone
arrays can, in principle, produce
directionally sensitive gain patterns that can be adjusted to
increase sensitivity to the speaker and reduce sensitivity in the
direction of competing sound sources. In fact, results of recent
pilot experiments in office environments
[CLP
94,SS93] confirm that the use of
delay-and-sum beamformers in combination with a post-processing
algorithm that compensates for the spectral
coloration introduced by the array itself
can reduce recognition error rates by as much as 61%.
Array processors that make use of the more general minimum mean square error (MMSE)-based classical adaptive filtering techniques can work well when signal degradation is dominated by additive independent noise, but they do not perform well in reverberant environments when the distortion is at least in part a delayed version of the desired speech signal [Pet89,AS90b]. (This problem can be avoided by adapting only during non-speech segments: [VC90].)
A third approach to microphone array processing is the use of cross-correlation-based algorithms, which have the ability to reinforce the components of a sound field arriving from a particular azimuth angle. These algorithms are appealing because they are similar to the processing performed by the human binaural system, but thus far they have demonstrated only a modest superiority over the simpler delay-and-sum approaches [SS93].
A number of signal processing schemes have been developed for speech
recognition systems that mimic various aspects of human auditory
physiology and perception (e.g.,
[Coh89,Ghi88,Lyo82,Sen88,Her90,PRH
91]). Such
auditory models
typically consist of a bank of bandpass filters (representing
auditory frequency selectivity)
followed by nonlinear interactions within and across channels
(representing hair-cell transduction, lateral suppression,
and other effects). The nonlinear processing is (in some cases)
followed by a mechanism to extract detailed timing information as a
function of frequency [Sen88,DLS90].
Recent evaluations indicate that auditory models can indeed provide better recognition accuracy than traditional cepstral representations when the quality of the incoming speech degrades, or when training and testing conditions differ [HL89,MZ90]. Nevertheless, auditory models have not yet been able to demonstrate better recognition accuracy than the most effective dynamic adaptation algorithms, and conventional adaptation techniques are far less computationally costly [Ohs93]. It is possible that the success of auditory models has been limited thus far because most of the evaluations were performed using hidden Markov model classifiers, which are not well matched to the statistical properties of features produced by auditory models. Other researchers suggest that we have not yet identified the features of the models' outputs that will ultimately provide superior performance. The approach of auditory modeling continues to merit further attention, particularly with the goal of resolving these issues.
Despite its importance, robust speech recognition has become a vital
area of research only recently. To date, major successes in
environmental adaptation have been limited either to relatively
benign domains (typically with limited amounts of quasi-stationary
additive noise and/or linear filtering, or to domains in which a
great deal of environment-specific training data are available).
Speaker adaptation algorithms have been successful in providing
improved recognition for native speakers languages other than the one
with which a system is trained, but recognition accuracy obtained
using non-native speakers remains substantially worse, even with
speaker adaptation, (e.g., [PFF
95]).
At present, it is fair to say that hardly any of the major limitations to robust recognition cited in section 1.1 have been satisfactorily resolved. It is suggested that success in the following key problem areas is likely to accelerate the development and deployment of practical speech-based applications.
Continued rapid progress in robust recognition will depend on the
formulation, collection, transcription, and dissemination of speech
corpora that contain realistic examples of the
degradations encountered in practical environments. The selection of
appropriate tasks and domains for shared database resources is best
accomplished through the collaboration of technology developers,
applications developers, and end users. The contents of these
databases should be realistic enough to be useful as an impetus for
solutions to actual problems, even in cases for which it may be
difficult to calibrate the degradation for the purpose of
evaluation.
Next: 1.5: HMM Methods in Spech Recognition
Up: Spoken Language Input
Previous: 1.3 Signal Representation
<\BODY>