Renato De Mori
& Fabio Brugnara
McGill University, Montréal, Québéc, Canada
Istituto per la Ricerca Scientifica e Tecnologica, Trento, Italy
Modern architectures for Automatic Speech Recognition (ASR) are mostly software architectures generating a sequence of word hypotheses from an acoustic signal. The most popular algorithms implemented in these architectures are based on statistical methods. Other approaches can be found in [WL90] where a collection of papers describes a variety of systems with historical reviews and mathematical foundations.
A vector
of acoustic features is
computed every 10 to 30 msec. Details of this component can be found
in section
. Various possible choices of
vectors together with their impact on recognition performance are
discussed in [HUGN93].
Sequences of vectors of acoustic parameters are treated as
observations of acoustic word models used to compute
,
the probability of observing a sequence
of vectors when a word sequence W is pronounced.
Given a sequence
, a word sequence
is
generated by the ASR system with a search process based on the
rule:

corresponds to the candidate having maximum
a-posteriori probability (MAP).
is computed by
Acoustic Models (AM), while
is computed by
Language Models (LM).
For large vocabularies, search is performed in two steps. The first generates a word lattice of the n-best word sequences with simple models to compute approximate likelihoods in real-time. In the second step more accurate likelihoods are compared with a limited number of hypotheses. Some systems generate a single word sequence hypothesis with a single step. The search produces an hypothesized word sequence if the task is dictation. If the task is understanding then a conceptual structure is obtained with a process that may involve more than two steps. Ways for automatically learning and extracting these structures are described in [KDMM94].
In a statistical framework, an inventory of elementary probabilistic models of basic linguistic units (e.g., phonemes) is used to build word representations. A sequence of acoustic parameters, extracted from a spoken utterance, is seen as a realization of a concatenation of elementary processes described by hidden Markov models (HMMs). An HMM is a composition of two stochastic processes, a hidden Markov chain, which accounts for temporal variability, and an observable process, which accounts for spectral variability. This combination has proven to be powerful enough to cope with the most important sources of speech ambiguity, and flexible enough to allow the realization of recognition systems with dictionaries of tens of thousands of words.
A hidden Markov model is defined as a pair of stochastic processes
. The
process is a first order Markov
chain, and is not directly observable, while the
process is a
sequence of random variables taking values in the space of acoustic
parameters, or observations.
Two formal assumptions characterize HMMs as used in speech recognition. The first-order Markov hypothesis states that history has no influence on the chain's future evolution if the present is specified, and the output independence hypothesis states that neither chain evolution nor past observations influence the present observation if the last chain transition is specified.
Letting
be a variable representing observations and
be variables representing model states, the model
can be represented by the following parameters:

with the following definitions:

A useful tutorial on the topic can be found in [Rab89].
HMMs can be classified according to the nature of the elements of the B matrix, which are distribution functions.
Distributions are defined on finite spaces in the so called
discrete HMMs. In this case,
observations are vectors of symbols in a finite alphabet of N
different elements. For each one of the Q vector components, a
discrete density
is defined, and the
distribution is obtained by multiplying the probabilities of each
component. Notice that this definition assumes that the different
components are independent. Figure
shows an example of
a discrete HMM with one-dimensional
observations. Distributions are associated with model transitions.
Figure: Example of a discrete HMM. A transition probability and an
output distribution on the symbol set is associated with every
transition.
Another possibility is to define distributions as probability
densities on continuous observation spaces. In this case, strong
restrictions have to be imposed on the functional form of the
distributions, in order to have a manageable number of statistical
parameters to estimate. The most popular approach is to characterize
the model transitions with mixtures of base densities g of a family
G having a simple parametric form. The base densities
are usually Gaussian or Laplacian,
and can be parameterized by the mean vector and the covariance
matrix. HMMs with these kinds of distributions are usually
referred to as continuous HMMs.
In order to model complex distributions in this way a rather large
number of base densities has to be used in every mixture. This may
require a very large training corpus of data for the estimation of
the distribution parameters. Problems arising when the available
corpus is not large enough can be alleviated by sharing distributions
among transitions of different models. In semicontinuous
HMMs [HAJ90], for
example, all mixtures are expressed in terms of a common set of base
densities. Different mixtures are characterized only by different
weights.
A common generalization of semicontinuous modeling consists of
interpreting the input vector y as composed of several components
, each of which is associated with a different
set of base distributions. The components are assumed to be
statistically independent, hence the distributions associated with
model transitions are products of the component density functions.
Computation of probabilities with discrete models is faster than with continuous models, nevertheless it is possible to speed up the mixture densities computation by applying vector quantization (VQ) on the gaussians of the mixtures [Boc93].
Parameters of statistical models are estimated by iterative learning algorithms [Rab89] in which the likelihood of a set of training data is guaranteed to increase at each step.
[BDFK92] propose a method for extracting additional acoustic parameters and performing transformations of all the extracted parameters using a Neural Network (NN) architecture whose weights are obtained by an algorithm that, at the same time, estimates the coefficients of the distributions of the acoustic models. Estimation is driven by an optimization criterion that tries to minimize the overall recognition error.
Words are usually represented by networks of phonemes. Each path in a word network represents a pronunciation of the word.
The same phoneme can have different acoustic distributions of observations if pronounced in different contexts. Allophone models of a phoneme are models of that phoneme in different contexts. The decision as to how many allophones should be considered for a given phoneme may depend on many factors, e.g., the availability of enough training data to infer the model parameters.
A conceptually interesting approach is that of
polyphones
[STNE
92]. In principle, an allophone
should be considered for every different word in which a phoneme
appears. If the vocabulary is large, it is unlikely that there are
enough data to train all these allophone models, so models for
allophones of phonemes are considered at a different level of detail
(word, syllable, triphone, diphone, context independent
phoneme). Probability distributions for an allophone having a certain
degree of generality can be obtained by mixing the distributions of
more detailed allophone models. The loss in specificity is
compensated by a more robust estimation of the statistical parameters
due to the increasing of the ratio between training data and free
parameters to estimate.
Another approach consists of choosing allophones by
clustering possible contexts. This choice
can be made automatically with Classification and Regression
Trees (CART). A CART is a binary tree having a phoneme at the
root and, associated with each node
, a question
about the
context. Questions
are of the type, ``Is the previous phoneme
a nasal consonant?'' For each possible answer (YES or
NO) there is a link to another node with which other questions
are associated. There are algorithms for growing and pruning CARTs
based on automatically assigning questions to a node from a manually
determined pool of questions. The leaves of the tree may be simply
labeled by an allophone symbol. Papers by [BdSG
91]
and [HL91] provide examples of the application of this
concept and references to the description of a formalism for training
and using CARTs.
Each allophone model is an HMM made of states, transitions and probability distributions. In order to improve the estimation of the statistical parameters of these models, some distributions can be the same or tied. For example, the distributions for the central portion of the allophones of a given phoneme can be tied reflecting the fact that they represent the stable (context-independent) physical realization of the central part of the phoneme, uttered with a stationary configuration of the vocal tract.
In general, all the models can be built by sharing distributions taken from a pool of, say, a few thousand cluster distributions called senones. Details on this approach can be found in [HH93].
Word models or allophone models can also be built by concatenation of
basic structures made by states, transitions and distributions. These
units, called fenones, were introduced by
[BBdS
93b]. Richer models of the same type
but using more sophisticated building blocks, called
multones, are described in
[BBdS
93a].
Another approach consists of having clusters of distributions characterized by the same set of Gaussian probability density functions. Allophone distributions are built by considering mixtures with the same components but with different weights [DM94].
The probability
of a sequence of words
is
computed by a Language Model (LM). In general
can be
expressed as follows:

Motivations for this approach and methods for computing these probabilities are described in the following section.
H2>1.5.4: Generation of Word Hypotheses
Generation of word hypotheses can result in a single sequence of words, in a collection of the n-best word sequences, or in a lattice of partially overlapping word hypotheses.
This generation is a search process in which a sequence of vectors of acoustic features is compared with word models. In this section, some distinctive characteristics of the computations involved in speech recognition algorithms will be described, first focusing on the case of a single-word utterance, and then considering the extension to continuous speech recognition.
In general, the speech signal and its transformations do not exhibit clear indication of word boundaries, so word boundary detection is part of the hypothesization process carried out as a search. In this process, all the word models are compared with a sequence of acoustic features. In the probabilistic framework, ``comparison'' between an acoustic sequence and a model involves the computation of the probability that the model assigns to the given sequence. This is the key ingredient of the recognition process. In this computation, the following quantities are used:
:
and being in state i at time t
:
given that the model is in state i at time
t

:
along the best
path ending in state i at time t:

and
can be used to compute the total emission
probability
as
An approximation for computing this probability consists of following
only the path of maximum probability. This can be done with the
quantity:
The computations of all the above probabilities share a common
framework, employing a matrix called a trellis,
depicted in Figure
. For the sake of simplicity, we
can assume that the HMM in Figure
represents
a word and that the input signal corresponds to the pronunciation of
an isolated word.
Every trellis column holds the values of one of the just introduced
probabilities for a partial sequence ending at different time
instants, and every interval between two columns corresponds to an
input frame. The arrows in the trellis represent model transitions
composing possible paths in the model from the initial time instant
to the final one. The computation proceeds in a column-wise manner,
at every time frame updating the scores of the nodes in a column by
means of recursion formulas which involve the values of an adjacent
column, the transition probabilities of
the models, and the values of the output distributions for the
corresponding frame. For
and
coefficients, the
computation starts from the leftmost column, whose values are
initialized with the values of
, and ends at the opposite
side, computing the final value with (
) or
(
). For the
coefficients, the computation goes
from right to left.
The algorithm for computing
coefficients is known as the
Viterbi algorithm, and can be seen as an
application of dynamic programming for finding a maximum probability
path in a graph with weighted arcs. The recursion formula for its
computation is the following:

By keeping track of the state j giving the maximum value in the above recursion formula, it is possible, at the end of the input sequence, to retrieve the states visited by the best path, thus performing a sort of time-alignment of input frames with models states.
All these algorithms have a time complexity
, where M is
the number of transitions with non-zero probability and T is the
length of the input sequence. M can be at most equal to
,
where S is the number of states in the model, but is usually much
lower, since the transition probability matrix is generally
sparse. In fact, a common choice in speech recognition is to impose
severe constraints on the allowed state sequences, for example
for j<i, j>i+2, as is the case of the model in Figure
.
In general, recognition is based on a search process which takes into account all the possible segmentations of the input sequence into words, and the a-priori probabilities that the LM assigns to sequences of words.
Good results can be obtained with simple LMs based on
bigram or trigram probabilities. As an
example, let us consider a bigram language model. This model can be
conveniently incorporated into a finite state automaton as shown in
Figure
, where dashed arcs correspond to transitions
between words with probabilities of the LM.
Figure: Bigram LM represented as a weighted word graph.
stands for
,
stands for
. The leftmost
node is the starting node, rightmost ones are finals.
After substitution of the word-labeled arcs with the corresponding HMMs, the resulting automaton becomes a big HMM itself, on which a Viterbi search for the most probable path, given an observation sequence, can be carried out. The dashed arcs are to be treated as empty transitions, i.e., transitions without an associated output distribution. This requires some generalization of the Viterbi algorithm. During the execution of the Viterbi algorithm, a minimum of backtracking information is kept to allow the reconstruction of the best path in terms of word labels. Note that the solution provided by this search is suboptimal in the sense that it gives the probability of a single state sequence of the composite model and not the total emission probability of the best word model sequence. In practice, however, it has been observed that the path probabilities computed with the above mentioned algorithms exhibit a dominance property, consisting of a single state sequence accounting for most of the total probability [ME91].
The composite model grows with the vocabulary, and can lead to large search spaces. Nevertheless the uneven distribution of probabilities among different paths can help. It turns out that, when the number of states is large, at every time instant, a large portion of states have an accumulated likelihood which is much less than the highest one, so that it is very unlikely that a path passing through one of these states would become the best path at the end of the utterance. This consideration leads to a complexity reduction technique called beam search [NMNP92], consisting of neglecting states whose accumulated score is lower than the best one minus a given threshold. In this way, computation needed to expand bad nodes is avoided. It is clear from the naivety of the pruning criterion that this reduction technique has the undesirable property of being not admissible, possibly causing the loss of the best path. In practice, good tuning of the beam threshold results in a gain in speed by an order of magnitude, while introducing a negligible amount of search errors.
When the dictionary is of the order of tens of thousands of words, the network becomes too big, and others methods have to be considered.
At present, different techniques exist for dealing with very large
vocabularies. Most of them use multi-pass algorithms. Each pass
prepares information for the next one, reducing the size of the
search space. Details of these methods can be found in
[AHH93,ADNS94,MBDW93,KAM
94].
In a first phase a set of candidate interpretations is represented in
an object called word lattice, whose
structure varies in different systems: it may contain only hypotheses
on the location of words, or it may carry a record of acoustic scores
as well. The construction of the word lattice may involve only the
execution of a Viterbi beam-search with memorization
of word scoring and localization, as in [ADNS94], or
may itself require multiple steps, as in
[AHH93,MBDW93,KAM
94].
Since the word lattice is only an intermediate result, to be
inspected by other detailed methods, its generation is performed with
a bigram language model, and often with simplified
acoustic models.
The word hypotheses in the lattice are scored with a more accurate
language model, and sometimes with more detailed acoustic models.
Lattice rescoring may require new calculations of HMM
probabilities [MBDW93], may proceed on the basis
of precomputed probabilities only
[ADNS94,AHH93], or even exploit acoustic
models which are not HMMs [KAM
94]. In
[AHH93], the last step is based on an
search [Nil71] on the word
lattice, allowing the application of a long distance language
model, i.e., a model where the probability of
a word may not only depend on its immediate predecessor. In
[ADNS94] a dynamic
programming algorithm, using
trigram probabilities, is performed.
A method which does not make use of the word lattice is presented in
[Pau94]. Inspired by one of the first methods proposed for
continuous speech recognition (CSR)
[Jel69], it combines both powerful language
modeling and detailed acoustic modeling in a single step,
performing an
based search.
Interesting software architectures for ASR have been recently developed. They provide acceptable recognition performance almost in real time for dictation of large vocabularies (more than 10,000 words). Pure software solutions require, at the moment, a considerable amount of central memory. Special boards make it possible to run interesting applications on PCs.
There are aspects of the best current systems that still need
improvement. The best systems do not perform equally well with
different speakers and different speaking environments. Two
important aspects, namely recognition in noise and speaker
adaptation, are discussed in section
. They
have difficulty in handling out of vocabulary words, hesitations,
false starts and other phenomena typical
of spontaneous speech. Rudimentary
understanding capabilities are available for speech understanding in
limited domains. Key research challenges for the future are acoustic
robustness, use of better acoustic
features and models, use of multiple word pronunciations and
efficient constraints for the access of a very large lexicon,
sophisticated and multiple language models capable of
representing various types of contexts, rich
methods for extracting conceptual representations from word
hypotheses and automatic learning methods for extracting various
types of knowledge from corpora.