In statistically based automatic speech recognition, the speech
waveform is sampled at a rate between 6.6 kHz
and 20 kHz and processed to produce a new representation as a
sequence of vectors containing values of what are generally called
parameters. The vectors (
in the
notation used in section
) typically comprise
between 10 and 20 parameters, and are usually computed every 10 or 20
msec. These parameter values are then used in succeeding stages in
the estimation of the probability that the portion of waveform just
analyzed corresponds to a particular phonetic event that occurs in
the phone-sized or whole-word reference unit being hypothesized. In
practice, the representation and the probability
estimation interact strongly: what one
person sees as part of the representation another may see as part of
the probability estimation process. For most systems, though, we can
apply the criterion that if a process is applied to all speech it is
part of the representation, while if its application is contingent on
the phonetic hypothesis being tested it is part of the later matching
stage.
Representations aim to preserve the information needed to determine the phonetic identity of a portion of speech while being as impervious as possible to factors such as speaker differences, effects introduced by communications channels, and paralinguistic factors such as the emotional state of the speaker. They also aim to be as compact as possible.
Representations used in current speech recognizers (Figure
), concentrate
primarily on properties of the speech signal attributable to the
shape of the vocal tract rather than to the excitation, whether
generated by a vocal-tract constriction or by the larynx.
Representations are sensitive to whether the
vocal folds are vibrating or not (the voiced/unvoiced
distinction), but try to ignore effects due to variations in their
frequency of vibration (
).
Figure: Examples of representations used in current speech recognizers.
(a) time varying waveform of the word speech, showing changes in
amplitude (y axis) over time (x axis);
(b) speech spectrogram of (a), in terms of frequency
(y axis), time (x axis) and amplitude (darkness of the pattern); (c)
expanded waveform of the vowel ee (underlined in b);
(d) spectrum of the vowel ee, in terms of amplitude (y axis) and
frequency (x axis); and (e) Mel-scale spectrogram.
Representations are almost always derived from the short-term power spectrum; that is, the short-term phase structure is ignored. This is primarily because our ears are largely insensitive to phase effects. Consequently, speech communication and recording equipment often does not preserve the phase structure of the original waveform, and such equipment as well as factors such as room acoustics can alter the phase spectrum in ways that would disturb a phase-sensitive speech recognizer even though a human listener would not notice them.
The power spectrum is, moreover, almost always represented on a log scale. When the gain applied to a signal varies, the shape of the log power spectrum is preserved; the spectrum is simply shifted up or down. More complicated linear filtering caused, for example, by room acoustics or by variations between telephone lines, which appear as convolutional effects on the waveform and as multiplicative effects on the linear power spectrum, become simply additive constants on the log power spectrum. Indeed, a voiced speech waveform amounts to the convolution of a quasi-periodic excitation signal and a time-varying filter determined largely by the configuration of the vocal tract. These two components are easier to separate in the log-power domain, where they are additive. Finally, the statistical distributions of log power spectra for speech have properties convenient for statistically based speech recognition that are not shared by linear power spectra, for example. Because the log of zero is infinite, there is a problem in representing very low energy parts of the spectrum. The log function therefore needs a lower bound both to limit the numerical range and to prevent excessive sensitivity to the low-energy, noise-dominated parts of the spectrum.
Before computing short-term power spectra, the waveform is usually processed by a simple pre-emphasis filter giving a 6 dB/octave increase in gain over most of its range to make the average speech spectrum roughly flat.
The short-term spectra are often derived by taking successive
overlapping portions of the preemphasized waveform, typically 25 msec
long, tapering both ends with a bell-shaped window function, and
applying a Fourier transform. The resulting
power spectrum has undesirable harmonic fine structure at multiples
of
. This can be reduced by grouping neighboring sets of
components together to form about 20 frequency bands before converting to log power. These bands are often made
successively broader with increasing frequency above 1 kHz, usually
according to the technical mel
frequency scale [DM80],
reflecting the frequency resolution of the human ear. A less common
alternative to the process just described is to compute the energy in
the bands directly using a bank of digital filters. The results are
similar.
Since the shape of the spectrum imposed by the vocal tract is smooth,
energy levels in adjacent bands tend to be correlated. Removing the
correlation allows the number of parameters to be reduced while
preserving the useful information. It also makes it easier to
compute reasonably accurate probability estimates in a subsequent
statistical matching process. The cosine
transform (a version of the Fourier
transform using only cosine basis functions) converts the set of log
energies to a set of cepstral coefficients, which turn out to
be largely uncorrelated. Compared with the number of bands,
typically only about half as many of these cepstral
coefficients need be kept. The first cepstral
coefficient
describe the shape of the log spectrum independent of its overall
level:
measures the balance between the upper and lower
halves of the spectrum, and the higher order coefficients are
concerned with increasingly finer features in the spectrum.
To the extent that the vocal tract can be regarded as a lossless unbranched acoustic tube with plane-wave sound propagation along it, its effect on the excitation signal is that of a series of resonances; that is, the vocal tract can be modeled as an all-pole filter. For many speech sounds in favorable acoustic conditions, this is a good approximation. A technique known as linear predictive coding (LPC) [MG76] or autoregressive modeling in effect fits the parameters of an all-pole filter to the speech spectrum, though the spectrum itself need never be computed explicitly. This provides a popular alternative method of deriving cepstral coefficients.
LPC has problems with certain signal degradations and is not so convenient for producing mel-scale cepstral coefficients. Perceptual Linear Prediction (PLP) combines the LPC and filter-bank approaches by fitting an all-pole model to the set of energies (or, strictly, loudness levels) produced by a perceptually motivated filter bank, and then computing the cepstrum from the model parameters [Her90].
Many systems augment information on the short-term power spectrum with information on its rate of change over time. The simplest way to obtain this dynamic information would be to take the difference between consecutive frames. However, this turns out to be too sensitive to random interframe variations. Consequently, linear trends are estimated over sequences of typically five or seven frames [Fur86b].
Some systems go further and estimate acceleration features as well as linear rates of change. These second-order dynamic features need even longer sequences of frames for reliable estimation [AH89].
Steady factors affecting the shape or overall level of the spectrum
(such as the characteristics of a particular telephone link) appear
as constant offsets in the log spectrum and cepstrum. (In a
technique called blind deconvolution
[SCI75],
cepstrum is computed and this average is substracted from the
individual frames.) This method is largely confined to non-real-time
experimental systems. Since they are based on differences, however,
dynamic features are intrinsically immune to such constant effects.
Consequently, while
is usually cast aside, its dynamic
equivalent,
, depending only on relative
rather than absolute energy levels, is widely used.
If first-order dynamic parameters are passed through a leaky integrator, something close to the original static parameters are recovered except that constant and very slowly varying features are reduced to zero, thus giving independence from constant or slowly varying channel characteristics. This technique, amounting to band-pass filtering of sequences of log power spectra and sometimes called RASTA, is better suited than blind deconvolution to real-time systems [HMH93]. A similar technique applied to sequences of power spectra before logs are taken is capable of reducing the effect of steady or slowly varying additive noise [HMR91].
Because cepstral coefficients are largely uncorrelated, a
computationally efficient method of obtaining reasonably good
probability estimates in the subsequent matching process consists of
calculating Euclidean distances from
reference model vectors after suitably weighting the coefficients.
Various weighting schemes have been used. An empirical scheme that
works well derives the weights for the first 16 coefficients from the
positive half cycle of a sine wave [JRW86].
For PLP cepstral coefficients, weighting each coefficient by
its index (root power sum (RPS) weighting) giving
a
weight of zero, etc., has proved effective. Statistically based
methods weight coefficients by the inverse of their standard
deviations computed about their overall means, or preferably computed
about the means for the corresponding speech sound and then averaged
over all speech sounds (so-called grand-variance
weighting)
[LMP87].
While cepstral coefficients are substantially uncorrelated, a technique called principal components analysis (PCA) can provide a transformation that can completely remove linear dependencies between sets of variables. This method can be used to de-correlate not just sets of energy levels across a spectrum but also combinations of parameter sets such as dynamic and static features, PLP and non-PLP parameters. A double application of PCA with a weighting operation, known as linear discriminant analysis (LDA), can take into account the discriminative information needed to distinguish between speech sounds to generate a set of parameters, sometimes called IMELDA coefficients, suitably weighted for Euclidean-distance calculations. Good performance has been reported with a much reduced set of IMELDA coefficients, and there is evidence that incorporating degraded signals in the analysis can improve robustness to the degradations while not harming performance on undegraded data [HL89].
The vast majority of major commercial and experimental systems use representations akin to those described here. However, in striving to develop better representations, wavelet transforms [Dau90] are being explored, and neural network methods are being used to provide non-linear operations on log spectral representations. Work continues on representations more closely reflecting auditory properties [Gre88] and on representations reconstructing articulatory gestures from the speech signal [SS94]. This latter work is challenging because there is a one-to-many mapping between the speech spectrum and the articulatory settings that could produce it. It is attractive because it holds out the promise of a small set of smoothly varying parameters that could deal in a simple and principled way with the interactions that occur between neighboring phonemes and with the effects of differences in speaking rate and of carefulness of enunciation.
As we noted earlier, current representations concentrate on the spectrum envelope and ignore fundamental frequency; yet we know that even in isolated-word recognition fundamental frequency contours are an important cue to lexical identity not only in tonal languages such as Chinese but also in languages such as English where they correlate with lexical stress. In continuous speech recognition fundamental frequency contours can potentially contribute valuable information on syntactic structure and on the intentions of the speaker (e.g., No, I said 2 5 7). The challenges here lie not in deriving fundamental frequency but in knowing how to separate out the various kinds of information that it encodes (speaker identity, speaker state, syntactic structure, lexical stress, speaker intention, etc.) and how to integrate this information into decisions otherwise based on identifying sequences of phonetic events.
The ultimate challenge is to match the superior performance of human listeners over automatic recognizers. This superiority is especially marked when there is little material to allow adaptation to the voice of the current speaker, and when the acoustic conditions are difficult. The fact that it persists even when nonsense words are used shows that it exists at least partly at the acoustic/phonetic level and cannot be explained purely by superior language modeling in the brain. It confirms that there is still much to be done in developing better representations of the speech signal. For additional references, see [RS78] and [Hun93].