John Makhoul
BBN Systems and Technologies, Cambridge, Massachusetts, USA
Digital signal processing (DSP) techniques have been at the heart of progress in speech processing during the last 25 years [RS78]. Simultaneously, speech processing has been an important catalyst for the development of DSP theory and practice. Today, DSP methods are used in speech analysis, synthesis, coding, recognition, and enhancement, as well as voice modification, speaker recognition, and language identification.
DSP techniques have also been very useful in written language recognition in all its forms (on-line, off-line, printed, handwritten). Some of the methods include preprocessing techniques for noise removal, normalizing transformations for line width and slant removal, global transforms (e.g., Fourier transform, correlation), and various feature extraction methods. Local features include the computation of slopes, local densities, variable masks, etc., while others deal with various geometrical characteristics of letters (e.g., strokes, loops). For summaries of various DSP techniques employed in written language recognition, the reader is referred to [IOO91,TSW90], as well as the following edited special issues: [Imp94,PM92,IS92].
This section is a brief summary of DSP techniques that are in use today, or that may be useful in the future, especially in the speech recognition area. Many of these techniques are also useful in other areas of speech processing.
In theory, it should be possible to recognize speech directly from
the digitized waveform. However, because of the large
variability of the speech signal, it is a good idea to
perform some form of feature extraction that would reduce
that variability. In particular, computing the envelope of
the short-term spectrum reduces the variability
significantly by smoothing the detailed spectrum, thus eliminating
various source information, such as whether the sound is voiced or
fricated and, if voiced, it eliminates the effect of the periodicity
or pitch. For nontonal languages, such as English,
the loss of source information does not appear to affect recognition
performance much because it turns out that the spectral envelope is highly correlated with the source information. However,
for tonal languages, such as Mandarin Chinese, it
is important to include an estimate of the fundamental frequency as
an additional feature to aid in the recognition of tones
[HYC
94].
To capture the dynamics of the vocal tract movements, the short-term spectrum is typically computed every 10--20 ms using a window of 20--30 ms. The spectrum can be represented directly in terms of the signal's Fourier coefficients or as the set of power values at the outputs from a bank of filters. The envelope of the spectrum can be represented indirectly in terms of the parameters of an all-pole model, using linear predictive coding (LPC), or in terms of the first dozen or so coefficients of the cepstrum---the inverse Fourier transform of the logarithm of the spectrum.
One reason for computing the short-term spectrum is that the cochlea of the human ear performs a quasi-frequency analysis. The analysis in the cochlea takes place on a nonlinear frequency scale (known as the Bark scale or the mel scale). This scale is approximately linear up to about 1000 Hz and is approximately logarithmic thereafter. So, in the feature extraction, it is very common to perform a frequency warping of the frequency axis after the spectral computation.
Researchers have experimented with many different types of features for use in speech recognition [RJ93]. Variations on the basic spectral computation, such as the inclusion of time and frequency masking, have been shown to provide some benefit in certain cases [ASKT93,BA94,Her90]. The use of auditory models as the basis of feature extraction has been useful in some systems [Coh89], especially in noisy environments [HRBP91].
Perhaps the most popular features used for speech recognition today
are what are known as mel-frequency cepstral
coefficients (MFCCs) [DM80]. These
coefficients are obtained by taking the inverse Fourier transform of the log spectrum after it is warped according
to the mel scale. Additional discussion of feature extraction
issues can be found in section 1.3 and section
.
Spectral distortions due to various channels, such as a different microphone or telephone, can have enormous effects on the performance of speech recognition systems. To render recognition systems more robust to such distortions, many researchers perform some form of removal of the average spectrum. In the cepstral domain, spectral removal amounts to subtracting out the average cepstrum. Typically, the average cepstrum is estimated over a period of time equal to about one sentence (a few seconds), and that average is updated on an ongoing basis to track any changes in the channel. Other similarly simple methods of filtering the cepstral coefficients have been proposed for removing channel effects [HMH93]. All these methods have been very effective in combating recognition problems due to channel effects. Further discussion of issues related to robust speech recognition can be found in section 1.4.
For recognition systems that use hidden Markov models, it is important to be able to estimate probability distributions of the computed feature vectors. Because these distributions are defined over a high-dimensional space, it is often easier to start by quantizing each feature vector to one of a relatively small number of template vectors, which together comprise what is called a codebook. A typical codebook would contain about 256 or 512 template vectors. Estimating probability distributions over this finite set of templates then becomes a much simpler task. The process of quantizing a feature vector into a finite number of template vectors is known as vector quantization [MRG85]. The process takes a feature vector as input and finds the template vector in the codebook that is closest in distance. The identity of that template is then used in the recognition system.
Historically, there has been an ongoing search for features that are resistant to speaker, noise, and channel variations. In spite of the relative success of MFCCs as basic features for recognition, there is a general belief that there must be more that can be done. One challenge is to develop ways in which our knowledge of the speech signal, and of speech production and perception, can be incorporated more effectively into recognition methods. For example, the fact that speakers have different vocal tract lengths could be used to develop more compact models for improved speaker-independent recognition. Another challenge is somehow to integrate speech analysis into the training optimization process. For the near term, such integration will no doubt result in massive increases in computation that may not be affordable.
There have been recent developments in DSP that point to potential future use of new nonlinear signal processing techniques for speech recognition purposes. Artificial neural networks, which are capable of computing arbitrary nonlinear functions, have been explored extensively for purposes of speech recognition, usually as an adjunct or substitute for hidden Markov models. However, it is possible that neural networks may be best utilized for the computation of new feature vectors that would rival today's best features.
Work by [MKQ92] with instantaneous energy operators, which have been shown to separate amplitude and frequency modulations, may be useful in discovering such modulations in the speech signal and, therefore, may be the source of new features for speech recognition. The more general quadratic operators proposed by [AF92] offer a rich family of possible operators that can be used to compute a large number of features that exhibit new properties which should have some utility for speech processing in general and speech recognition in particular.