Words are usually represented by networks of phonemes. Each path in a word network represents a pronunciation of the word.
The same phoneme can have different acoustic distributions of observations if pronounced in different contexts. Allophone models of a phoneme are models of that phoneme in different contexts. The decision as to how many allophones should be considered for a given phoneme may depend on many factors, e.g., the availability of enough training data to infer the model parameters.
A conceptually interesting approach is that of
polyphones
[STNE
92]. In principle, an allophone
should be considered for every different word in which a phoneme
appears. If the vocabulary is large, it is unlikely that there are
enough data to train all these allophone models, so models for
allophones of phonemes are considered at a different level of detail
(word, syllable, triphone, diphone, context independent
phoneme). Probability distributions for an allophone having a certain
degree of generality can be obtained by mixing the distributions of
more detailed allophone models. The loss in specificity is
compensated by a more robust estimation of the statistical parameters
due to the increasing of the ratio between training data and free
parameters to estimate.
Another approach consists of choosing allophones by
clustering possible contexts. This choice
can be made automatically with Classification and Regression
Trees (CART). A CART is a binary tree having a phoneme at the
root and, associated with each node
, a question
about the
context. Questions
are of the type, ``Is the previous phoneme
a nasal consonant?'' For each possible answer (YES or
NO) there is a link to another node with which other questions
are associated. There are algorithms for growing and pruning CARTs
based on automatically assigning questions to a node from a manually
determined pool of questions. The leaves of the tree may be simply
labeled by an allophone symbol. Papers by [BdSG
91]
and [HL91] provide examples of the application of this
concept and references to the description of a formalism for training
and using CARTs.
Each allophone model is an HMM made of states, transitions and probability distributions. In order to improve the estimation of the statistical parameters of these models, some distributions can be the same or tied. For example, the distributions for the central portion of the allophones of a given phoneme can be tied reflecting the fact that they represent the stable (context-independent) physical realization of the central part of the phoneme, uttered with a stationary configuration of the vocal tract.
In general, all the models can be built by sharing distributions taken from a pool of, say, a few thousand cluster distributions called senones . Details on this approach can be found in [HH93].
Word models or allophone models can also be built by concatenation of
basic structures made by states, transitions and distributions. These
units, called fenones , were introduced by
[BBdS
93b]. Richer models of the same type
but using more sophisticated building blocks, called
multones , are described in
[BBdS
93a].
Another approach consists of having clusters of distributions characterized by the same set of Gaussian probability density functions. Allophone distributions are built by considering mixtures with the same components but with different weights [DM94].