The goal of a speech recognition algorithm is to map a speech signal to the words which were spoken. Many different approaches to this problem have been presented by the speech research community. \ recognizes phonemes (basic sounds of speech), and then uses these phonemes to phonemically pronounce the word. In this chapter we will learn how to define phoneme probability estimators within the \ environment. Although we attempt to keep things simple, some knowledge regarding modeling of speech units will definitely help understand the process involved in defining phoneme probability estimators.
The syntax rules for the phoneme models are as follows. Each output of the phoneme probability estimator is represented by a model. The collection of models describe the relation between the word pronunciation in terms of the base symbol set and the outputs of the particular phoneme probability estimator. In the most simple case each phoneme is represented by a corresponding context independent model or monophone. However, there can be a wide range of variation in the way that phonemes are realized. Much of this variation is dependent on context. As a result monophones are poor discriminators. One way to improve discrimination is to have context dependent models. Such models are called triphones if they account for both the left and right context, or biphones if they account for either the left or the right context but not both. Currently supports only biphone modeling.
Each model is defined by its nucleus(phoneme), and optional left or right context.
phoneme = char{char}
Here char represent any character except one of the meta
characters <> $ = ;. These may however be escaped using a single backslash. The left or right context of the model may be
defined either as a single phoneme, or as a list of phonemes.
phoneme_list = phoneme{`` '' phoneme}
Variables are defined by a leading $ character. Variables define a list or grouping of phonemes which can be used to described the left or right context of a particular biphone model.
name = char{char}
variable = $name
context = variable ``='' phoneme_list ``;''
A biphone model may therefore be dependent on a particular phoneme to
the left or right of the nuclues(center phoneme) or on a group of
phonemes defined using variables. The following syntax can then be
used to define a particular model.
model = ``<'' phoneme ``>'' |
phoneme ``<'' phoneme |
phoneme ``>'' phoneme |
variable ``<'' phoneme |
phoneme ``>'' variable
The braces <> denote either the left or the right context. Context independent models are defined with braces on both sides of the center phoneme. Using this syntax a phoneme can be modeled as (a) a one part model (context independent, left dependent or right dependent), (b) a two part model (left biphone model followed by a right biphone) or (c) a three part phoneme (left biphone followed by context independent model followed by a right biphone).
The complete description of the outputs of the phoneme estimator is defined by a list of models, where the order of the list corresponds to the order of the outputs of the particular phoneme probability estimator.
model_list = model{`` '' model}
estimator = ``define'' model_list ``;''
Multiple define commands may be used throughout the probability estimator definition file. The final set of models will then be defined as the concatenation of each list of models defined by the separate instances of the define command.
In the Viterbi search, the length of a particular phoneme determines how important it is to the overall score. Because they are only a short component of the path, very short phonemes influence the score less than long ones, which can lead to misrecognitions. A significant reduction in errors results if we impose minimum and maximum durations for phonemes, although too-long phonemes are a less frequent source of errors.
Rather than apply absolute minimums or maximums, our recognizers impose a per-frame penalty on the score for segments which are too long or too short. This gives the recognizer some flexibility in overcoming poor word models or sloppy articulation where a segment may really be missing. These duration limits are derived from "generic" samples of the phonemes. The following syntax defines a duration model.
mindur = digit{digit}
maxdur = digit{digit}
duration = phoneme mindur maxdur |
model mindur maxdur
Minimum and maximum durations are specified in milliseconds. Model durations which are not specified are calculated from the base phoneme duration models using the transformation rules as described in table 4.1.
Table 4.1: Durations model rules for
calculating duration models from base phoneme durations.
The table read as follows: For example, for the two-part phoneme case, the first line describes a phoneme, with the first part a left-dependent phoneme (x<A) and the second part a right-dependent phoneme (A>x). Based on the table 4.1, the durations computed for these models would therefore be:
The following two models are a bit unconventional. The second line describes a phoneme, with the first part a left-dependent phoneme (x<A) and the second part a context-indepent phoneme (<A>). The duration models computed for these models would therefore be:
Finally, the third line describes a phoneme, with the first part a context-independent phoneme (<A>) and the second part a right-dependent phoneme (A>x). The durations computed for these models would therefore be:
In order to fully specify the models described a duration model needs to be specified at least for each of the unique phoneme nuclei. If minimum and maximum durations are specified for some of the predefined models, then these durations will overide the default durations which are calculated from the base phoneme minimum and maximum durations. The list of duration models are specified using the duration command.
duration_list = duration{`` '' duration}
duration_models = ``duration'' duration_list ``;''
Multiple duration commands may be used throughout the phoneme probability description file. The final set of duration models will then be defined as the concatenation of each set of duration models defined.
When designing phoneme probability estimators it sometimes happens that certain models need to be defined, but can not be trained due to a lack of training data. The tie command can then be used to tie this particular model to a model having similar characteristics. The tie command ties a model to a model, or a list of models to a model.
tied_models = ``tie'' model model_list ``;''
Similar to the define and duration commands, multiple tie commands may be used throughout the phoneme probability description file. The final set of tied models will then be defined as the concatenation of each.
Phoneme mapping allows the complete sharing of models defined with the
same nuclei. For example all models defined using the phoneme
``I_x'' will have much the same characteristics than models
defined using the phoneme ``I'', with the main difference
relating to the underlying duration models of each. Using the map
command we only need to define models for one of the symbols(let say
``I'') and then all models can be duplicated for ``I_x''
by mapping ``I_x'' to ``I''.
The map command maps either a phoneme to a phoneme, or a list of phonemes to a phoneme.
mappings = ``map'' phoneme phoneme_list ``;''
Multiple map commands may be used throughout the probability names description file. The final list of phoneme mappings is formed as the concatenation of all mappings defined by each of the individual map commands.
This section describes the process involved in defining the base word
pronunciations for the words ``yes'' and ``no'' in terms of a simple
context independent phoneme recognizer which can recognize only the
phonemes /silence j E s n oU/. Since all of these phonemes are
context independent, each phoneme is uniquely described as consisting
of only one part. For context dependent modeling, phonemes are
typically described as having two or even three parts.
The buildNames function creates a structure which contains all
the necessary information needed for pronunciating words according to
the output categories of the specific phoneme classifier. This
structure will be referred to as the probability estimator names
description object, or names object for short. The source code
discussed below can be found in the file yesno.c
char *yesno_net = "\
define <.pau> <j> <E> <s> <n> <oU>;\
duration .pau 49 5000 \
j 18 177 \
E 40 226 \
s 49 270 \
n 28 193 \
oU 59 397; ";
.
.
main()
{
.
.
probnameT *p;
.
.
p = buildNames(result, yesno_net);
.
.
}
A complete listing of duration models for the base worldbet symbols can be found in the general purpose recognizer description file ( mark1.desc).
In the example above, the names object consists of the phonemes
/.pau j E s n oU/, with corresponding neural network category names
/<.pau> <j> <E> <s> <n> <oU>/. This object can be used to create
word models which are described in terms of the neural network output
classes rather than the base worldbet symbols.
main()
{
.
.
wordT *words[2];
wordT *xwords[2];
.
.
/* create yes no word models */
for(i=0; i<2; i++)
words[i] = buildWord(result, yesno[i].pro, yesno[i].name);
/* build context expanded word models */
for(i=0; i<2; i++) {
xwords[i] = expandWord(result, words[i], p);
printWord(xwords[i]);
}
}
The yesno example above produces the following output describing the pronunciation models in terms of the probability names estimator defined.
system prompt% yesno <j> <E> <s> <n> <oU> system prompt%
Things become a little more complicated when dealing with context
dependent phonetic modeling. The file mark1.desc contains the
model definitions needed for the general purpose recognizer supplied
with the development environment. We will refer to this file,
to explain the process involved.
The first phase consists of deciding how each phoneme will be described, namely which are one part phonemes, two part phonemes etc. This is very much dependent on the design of the phoneme probability estimator. For our purposes we have chosen to model all english phonemes as a collection of one part(i.e. context independent), two part, three part and also right dependent phonemes. Right dependent phonemes are phonemes which characterize themselves as only being influenced by the phonemes or context to the right of the phoneme.
Table 4.2: General purpose recognizer design.
For each of the multipart phonemes, we need to decide upon the left
and right context of each phoneme. In the most specific case we would
have each phoneme dependent on each other phoneme. This is however not
necessary. Studying contextual influences of phonemes, we find that
phonemes can be grouped according to their influence of the phoneme in
question. For example in the general purpose recognizer the phoneme
/A/ has three parts: a left context part, a context independent
part and a right context part. For either of the left context and
right contexts we find that the collection of phonemes
/m n N N= n=/ all have roughly the same contextual influence on
the base phoneme /A/. These can therefore be grouped together,
as is done in the general purpose recognizer.
$sil = .pau pc kc tc tSc dZc .garbage; /* silence models */ $bck = \> A aU oU w l l= U U_x; /* back vowels */ $mid = E @ ^ &; /* mid vowels */ $fnt = u i: I_x I ei aI \>i j; /* front vowels */ $ret = 9r 3r &r; /* retro flex */ $nas = m n N N= n=; /* nasals */ $obs = ph kh th h s z Z S T f ts dZ; /* obstruents */ $wev = dZc bc dc gc b d g D t( d( d_( th_( v; /* what ever */ . . .
In many cases we find that some phonemes are very hard to recognize as
independent entities. They tend to have the same characteristics as
other phonemes which are very similar. For this reason the neural
network has a tough time distinguishing among very similar
phonemes. One of the design steps is to identify these phonemes and
then map them to some central representing phoneme. This information
is also needed when describing the word pronounciation models
according to the specific phoneme probability estimator. For example
all voiced closures /bc dc gc dZc/ can be mapped to some general
voiced closure /vc/.
. . . /* map some phonemes to what we have */ map vc bc dc gc dZc; /* voiced closure */ map uc pc tc kc tSc; /* unvoiced closure */ map A \>; map l l; map n n= N N=; map m m=; map ^ &; map 3r &r; map s Z; map A 5; . . .
The next phase is to define the output probability names according to the syntax specified above. For our general purpose recognizer, the neural network output class names are defined as follows:
.
.
.
/* now define the outputs of the nnet */
define <.pau> <.br> <vc> <uc>;
define $obs<f $fnt<f $bck<f $sil<f $wev<f $nas<f $mid<f $ret<f
f>$fnt f>$bck f>$sil f>$wev f>$nas f>$mid f>$obs f>$ret;
define $obs<v $fnt<v $bck<v $sil<v $wev<v $nas<v $mid<v $ret<v
v>$fnt v>$bck v>$sil v>$wev v>$nas v>$mid v>$obs v>$ret;
define $obs<T $fnt<T $bck<T $sil<T $wev<T $nas<T $mid<T $ret<T
T>$fnt T>$bck T>$sil T>$wev T>$nas T>$mid T>$obs T>$ret;
define $obs<D $fnt<D $bck<D $sil<D $wev<D $nas<D $mid<D $ret<D
D>$fnt D>$bck D>$sil D>$wev D>$nas D>$mid D>$obs D>$ret;
.
.
.
We also need to define the set of duration models for each of our models defined above. In this case we define the duration models for the underlying base phonemes, and rely on the automatic table translation to compute the duration models for the models specified.
.
.
.
/* specify base duration models */
duration .garbage 49 5000
.pau 10 5000
dZc 20 261
bc 23 172
dc 15 153
gc 20 163
pc 30 163
tc 20 168
kc 25 159
tSc 20 168
f 37 247
v 26 172
.
.
.
Given all these information sources we are now ready to define the probability names description object. These data are then used when defining word pronunciation models in terms of the outputs of the neural network category names.
The file digit.desc contains the model definitions needed for
the continuous digit recognizer supplied with the development
environment. With the exception of the following groupings
$sil = .pau tc .garbage; $denl = s ks th; $denr = s th;the digit recognizer was designed using full context dependent models for each of the phonemes which occur in continuous digit strings. Full context dependent modeling consists of a set of biphones where each context of the center phoneme is a singular phoneme, rather than a grouping of phonemes as found in the previous example.
.
.
.
define f<\>r <\>r> \>r>$sil \>r>T \>r>ei \>r>f \>r>w \>r>z \>r>oU
\>r>n \>r>$denr;
define v<^2 ^2>n n<^3 ^3>$sil ^3>ei ^3>f ^3>w ^3>oU;
.
.
Note that the models are defined using a combination of single
phonemes as either the left or right context as well as variables
denoting phoneme groupings.
Full context dependent modeling however has the disadvantage that not all possible context dependent models may be found in the training set. The tie command may therefore be used to tie models for which no data could be found in the training set to models which are similar enough and for which sufficient training data could be found.
.
.
/* define here all models which are not in nnet and tie them to
the silence context dependent model */
define ^3<z
^3<T
^3<s
^3<n;
define ^3>z
^3>T
^3>s
^3>n;
tie $sil<z ^3<z;
tie $sil<T ^3<T;
tie $sil<s ^3<s;
tie $sil<n ^3<n;
tie ^3>$sil ^3>z ^3>T ^3>s ^3>n;
.
.
In the digit recognizer we chose to tie all models which need be defined, but which do not have sufficient training data to the equivalent silence dependent model.