next up previous contents
Next: Word ModelsLexical trees Up: Using CSLU-C for Speech Previous: Feature Extraction

 

Phoneme Probability Models

The goal of a speech recognition algorithm is to map a speech signal to the words which were spoken. Many different approaches to this problem have been presented by the speech research community. \ recognizes phonemes (basic sounds of speech), and then uses these phonemes to phonemically pronounce the word. In this chapter we will learn how to define phoneme probability estimators within the \ environment. Although we attempt to keep things simple, some knowledge regarding modeling of speech units will definitely help understand the process involved in defining phoneme probability estimators.

Phoneme probability model definition

Model definition

The syntax rules for the phoneme models are as follows. Each output of the phoneme probability estimator is represented by a model. The collection of models describe the relation between the word pronunciation in terms of the base symbol set and the outputs of the particular phoneme probability estimator. In the most simple case each phoneme is represented by a corresponding context independent model or monophone. However, there can be a wide range of variation in the way that phonemes are realized. Much of this variation is dependent on context. As a result monophones are poor discriminators. One way to improve discrimination is to have context dependent models. Such models are called triphones if they account for both the left and right context, or biphones if they account for either the left or the right context but not both. Currently supports only biphone modeling.

Each model is defined by its nucleus(phoneme), and optional left or right context.

          phoneme = char{char}

Here char represent any character except one of the meta characters <> $ = ;. These may however be escaped using a single backslash. The left or right context of the model may be defined either as a single phoneme, or as a list of phonemes.

          phoneme_list = phoneme{`` '' phoneme}

Variables are defined by a leading $ character. Variables define a list or grouping of phonemes which can be used to described the left or right context of a particular biphone model.

          name = char{char}
          variable = $name
          context = variable ``='' phoneme_list ``;''
A biphone model may therefore be dependent on a particular phoneme to the left or right of the nuclues(center phoneme) or on a group of phonemes defined using variables. The following syntax can then be used to define a particular model.

          model = ``<'' phoneme ``>''      |
                  phoneme ``<'' phoneme    |
                  phoneme ``>'' phoneme    |
                  variable ``<'' phoneme   |
                  phoneme ``>'' variable

The braces <> denote either the left or the right context. Context independent models are defined with braces on both sides of the center phoneme. Using this syntax a phoneme can be modeled as (a) a one part model (context independent, left dependent or right dependent), (b) a two part model (left biphone model followed by a right biphone) or (c) a three part phoneme (left biphone followed by context independent model followed by a right biphone).

The complete description of the outputs of the phoneme estimator is defined by a list of models, where the order of the list corresponds to the order of the outputs of the particular phoneme probability estimator.

         model_list  = model{`` '' model}
         estimator = ``define'' model_list ``;''

Multiple define commands may be used throughout the probability estimator definition file. The final set of models will then be defined as the concatenation of each list of models defined by the separate instances of the define command.

Duration modeling

  In the Viterbi search, the length of a particular phoneme determines how important it is to the overall score. Because they are only a short component of the path, very short phonemes influence the score less than long ones, which can lead to misrecognitions. A significant reduction in errors results if we impose minimum and maximum durations for phonemes, although too-long phonemes are a less frequent source of errors.

Rather than apply absolute minimums or maximums, our recognizers impose a per-frame penalty on the score for segments which are too long or too short. This gives the recognizer some flexibility in overcoming poor word models or sloppy articulation where a segment may really be missing. These duration limits are derived from "generic" samples of the phonemes. The following syntax defines a duration model.

          mindur = digit{digit}
          maxdur = digit{digit}
          duration = phoneme mindur maxdur   |
                     model   mindur maxdur

Minimum and maximum durations are specified in milliseconds. Model durations which are not specified are calculated from the base phoneme duration models using the transformation rules as described in table 4.1.

  table447
Table 4.1:   Durations model rules for calculating duration models from base phoneme durations.

The table read as follows: For example, for the two-part phoneme case, the first line describes a phoneme, with the first part a left-dependent phoneme (x<A) and the second part a right-dependent phoneme (A>x). Based on the table 4.1, the durations computed for these models would therefore be:

eqnarray468

eqnarray478

The following two models are a bit unconventional. The second line describes a phoneme, with the first part a left-dependent phoneme (x<A) and the second part a context-indepent phoneme (<A>). The duration models computed for these models would therefore be:

eqnarray488

eqnarray498

Finally, the third line describes a phoneme, with the first part a context-independent phoneme (<A>) and the second part a right-dependent phoneme (A>x). The durations computed for these models would therefore be:

eqnarray498

eqnarray518

In order to fully specify the models described a duration model needs to be specified at least for each of the unique phoneme nuclei. If minimum and maximum durations are specified for some of the predefined models, then these durations will overide the default durations which are calculated from the base phoneme minimum and maximum durations. The list of duration models are specified using the duration command.

          duration_list    = duration{`` '' duration}
          duration_models  = ``duration'' duration_list ``;''

Multiple duration commands may be used throughout the phoneme probability description file. The final set of duration models will then be defined as the concatenation of each set of duration models defined.

Model tying

When designing phoneme probability estimators it sometimes happens that certain models need to be defined, but can not be trained due to a lack of training data. The tie command can then be used to tie this particular model to a model having similar characteristics. The tie command ties a model to a model, or a list of models to a model.

          tied_models = ``tie'' model  model_list ``;''

Similar to the define and duration commands, multiple tie commands may be used throughout the phoneme probability description file. The final set of tied models will then be defined as the concatenation of each.

Phoneme mapping

Phoneme mapping allows the complete sharing of models defined with the same nuclei. For example all models defined using the phoneme ``I_x'' will have much the same characteristics than models defined using the phoneme ``I'', with the main difference relating to the underlying duration models of each. Using the map command we only need to define models for one of the symbols(let say ``I'') and then all models can be duplicated for ``I_x'' by mapping ``I_x'' to ``I''.

The map command maps either a phoneme to a phoneme, or a list of phonemes to a phoneme.

         mappings = ``map'' phoneme phoneme_list ``;''

Multiple map commands may be used throughout the probability names description file. The final list of phoneme mappings is formed as the concatenation of all mappings defined by each of the individual map commands.

A simple yes/no recognizer

This section describes the process involved in defining the base word pronunciations for the words ``yes'' and ``no'' in terms of a simple context independent phoneme recognizer which can recognize only the phonemes /silence j E s n oU/. Since all of these phonemes are context independent, each phoneme is uniquely described as consisting of only one part. For context dependent modeling, phonemes are typically described as having two or even three parts.

The buildNames function creates a structure which contains all the necessary information needed for pronunciating words according to the output categories of the specific phoneme classifier. This structure will be referred to as the probability estimator names description object, or names object for short. The source code discussed below can be found in the file yesno.c

 

char *yesno_net = "\
define <.pau> <j> <E> <s> <n> <oU>;\
duration .pau 49  5000 \
         j    18  177 \
         E    40  226 \
         s    49  270 \
         n    28  193 \
         oU   59  397; ";
        .
        .
main()
{
        .
        .
  probnameT *p;
        .
        .
  p = buildNames(result, yesno_net);
        .
        .
}

A complete listing of duration models for the base worldbet symbols can be found in the general purpose recognizer description file ( mark1.desc).

In the example above, the names object consists of the phonemes /.pau j E s n oU/, with corresponding neural network category names /<.pau> <j> <E> <s> <n> <oU>/. This object can be used to create word models which are described in terms of the neural network output classes rather than the base worldbet symbols.

     

main()
{
        .
        .
  wordT *words[2];
  wordT *xwords[2];
        .
        .
  /* create yes no word models */
  for(i=0; i<2; i++) 
    words[i] = buildWord(result, yesno[i].pro, yesno[i].name);

  /* build context expanded word models */
  for(i=0; i<2; i++) {
    xwords[i] = expandWord(result, words[i], p);
    printWord(xwords[i]);
  }
}

The yesno example above produces the following output describing the pronunciation models in terms of the probability names estimator defined.

system prompt% yesno
<j> <E> <s> 
<n> <oU> 
system prompt%

A bit more complicated example

Things become a little more complicated when dealing with context dependent phonetic modeling. The file mark1.desc contains the model definitions needed for the general purpose recognizer supplied with the development environment. We will refer to this file, to explain the process involved.

The first phase consists of deciding how each phoneme will be described, namely which are one part phonemes, two part phonemes etc. This is very much dependent on the design of the phoneme probability estimator. For our purposes we have chosen to model all english phonemes as a collection of one part(i.e. context independent), two part, three part and also right dependent phonemes. Right dependent phonemes are phonemes which characterize themselves as only being influenced by the phonemes or context to the right of the phoneme.

  table549
Table 4.2:   General purpose recognizer design.

For each of the multipart phonemes, we need to decide upon the left and right context of each phoneme. In the most specific case we would have each phoneme dependent on each other phoneme. This is however not necessary. Studying contextual influences of phonemes, we find that phonemes can be grouped according to their influence of the phoneme in question. For example in the general purpose recognizer the phoneme /A/ has three parts: a left context part, a context independent part and a right context part. For either of the left context and right contexts we find that the collection of phonemes /m n N N= n=/ all have roughly the same contextual influence on the base phoneme /A/. These can therefore be grouped together, as is done in the general purpose recognizer.

$sil = .pau pc kc tc tSc dZc .garbage;          /* silence models */
$bck = \>  A  aU oU w  l l= U U_x;              /* back vowels    */
$mid = E  @  ^  &;                              /* mid vowels     */
$fnt = u  i: I_x I ei aI \>i j;                 /* front vowels   */
$ret = 9r 3r &r;                                /* retro flex     */
$nas = m n N N= n=;                             /* nasals         */
$obs = ph kh th h s z Z S T f  ts dZ;           /* obstruents     */
$wev = dZc bc dc gc b d g D t( d( d_( th_( v;   /* what ever      */
.
.
.

In many cases we find that some phonemes are very hard to recognize as independent entities. They tend to have the same characteristics as other phonemes which are very similar. For this reason the neural network has a tough time distinguishing among very similar phonemes. One of the design steps is to identify these phonemes and then map them to some central representing phoneme. This information is also needed when describing the word pronounciation models according to the specific phoneme probability estimator. For example all voiced closures /bc dc gc dZc/ can be mapped to some general voiced closure /vc/.

.
.
.
/* map some phonemes to what we have */
map vc  bc dc gc dZc;                           /* voiced closure   */
map uc  pc tc kc tSc;                           /* unvoiced closure */
map A   \>;
map l   l;
map n   n= N N=;
map m   m=;
map ^   &;
map 3r  &r;
map s   Z;
map A   5;
.
.
.

The next phase is to define the output probability names according to the syntax specified above. For our general purpose recognizer, the neural network output class names are defined as follows:

.
.
.
/* now define the outputs of the nnet */
define  <.pau>  <.br>  <vc>  <uc>;              
define  $obs<f  $fnt<f  $bck<f  $sil<f  $wev<f  $nas<f  $mid<f  $ret<f
        f>$fnt  f>$bck  f>$sil  f>$wev  f>$nas  f>$mid  f>$obs  f>$ret;

define  $obs<v  $fnt<v  $bck<v  $sil<v  $wev<v  $nas<v  $mid<v  $ret<v
        v>$fnt  v>$bck  v>$sil  v>$wev  v>$nas  v>$mid  v>$obs  v>$ret;

define  $obs<T  $fnt<T  $bck<T  $sil<T  $wev<T  $nas<T  $mid<T  $ret<T
        T>$fnt  T>$bck  T>$sil  T>$wev  T>$nas  T>$mid  T>$obs  T>$ret;

define  $obs<D  $fnt<D  $bck<D  $sil<D  $wev<D $nas<D  $mid<D  $ret<D
        D>$fnt  D>$bck  D>$sil  D>$wev  D>$nas  D>$mid  D>$obs  D>$ret;
.
.
.

We also need to define the set of duration models for each of our models defined above. In this case we define the duration models for the underlying base phonemes, and rely on the automatic table translation to compute the duration models for the models specified.

.
.
.
/* specify base duration models */
duration .garbage   49  5000
         .pau       10  5000
         dZc        20  261
         bc         23  172
         dc         15  153
         gc         20  163
         pc         30  163
         tc         20  168
         kc         25  159
         tSc        20  168 
         f          37  247
         v          26  172
.
.
.

Given all these information sources we are now ready to define the probability names description object. These data are then used when defining word pronunciation models in terms of the outputs of the neural network category names.

Model tying - the digit recognizer

The file digit.desc contains the model definitions needed for the continuous digit recognizer supplied with the development environment. With the exception of the following groupings

$sil  = .pau tc .garbage;
$denl = s ks th;
$denr = s th;
the digit recognizer was designed using full context dependent models for each of the phonemes which occur in continuous digit strings. Full context dependent modeling consists of a set of biphones where each context of the center phoneme is a singular phoneme, rather than a grouping of phonemes as found in the previous example.

.
.
.
define  f<\>r  <\>r>  \>r>$sil  \>r>T  \>r>ei  \>r>f  \>r>w  \>r>z  \>r>oU
        \>r>n  \>r>$denr;

define  v<^2  ^2>n  n<^3  ^3>$sil  ^3>ei  ^3>f  ^3>w  ^3>oU;
.
.
Note that the models are defined using a combination of single phonemes as either the left or right context as well as variables denoting phoneme groupings.

Full context dependent modeling however has the disadvantage that not all possible context dependent models may be found in the training set. The tie command may therefore be used to tie models for which no data could be found in the training set to models which are similar enough and for which sufficient training data could be found.

.
.
/* define here all models which are not in nnet and tie them to 
   the silence context dependent model */
define  ^3<z
        ^3<T
        ^3<s
        ^3<n;

define  ^3>z
        ^3>T
        ^3>s
        ^3>n;

tie $sil<z   ^3<z;
tie $sil<T   ^3<T;
tie $sil<s   ^3<s;
tie $sil<n   ^3<n;

tie ^3>$sil  ^3>z ^3>T ^3>s ^3>n;
.
.

In the digit recognizer we chose to tie all models which need be defined, but which do not have sufficient training data to the equivalent silence dependent model.


next up previous contents
Next: Word ModelsLexical trees Up: Using CSLU-C for Speech Previous: Feature Extraction

Johan Schalkwyk
Wed Nov 27 10:08:24 PST 1996