Training Hidden Markov Model/Artificial Neural Network (HMM/ANN) Hybrids for Automatic Speech Recognition (ASR)

John-Paul Hosom, Jacques de Villiers, Ron Cole, Mark Fanty, Johan Schalkwyk, Yonghong Yan, Wei Wei
Center for Spoken Language Understanding (CSLU)
OGI School of Science & Engineering (OGI)
Oregon Health & Science University (OHSU)

Version 1.1: February, 1999
Version 2.0: February, 2006

Contents

1. Introduction
    1.1 Setup
    1.2 Additional Information
2. General Concepts and Notation
    2.1 Quick Review of Frame-Based Speech Recognition
    2.2 Specifying Categories
    2.3 Example of Specifying Categories
    2.4 Finding Examples to Train On
        2.4.1 Overfitting and Datasets
        2.4.2 Filtering
        2.4.3 Finding Categories
        2.4.4 Number of Examples per Category
    2.5 Training the Network
        2.5.1 Generating Data
        2.5.2 Number of Hidden Nodes
        2.5.3 Negative Penalty
        2.5.4 Number of Training Iterations
        2.5.5 Re-Training on Force-Aligned Data
    2.6 Evaluation
        2.6.1 Word-Level Evaluation
        2.6.2 Choosing the Best Iteration
        2.6.3 Testing
3. Overall Procedure
    3.1 Create Descriptions
    3.2 Find Data
    3.3 Select Data for Training
    3.4 Train and Evaluate
    3.5 Re-Train
    3.6 Evaluate Test Set
4. Complete Example
5. File Formats
    wav files
    txt files
    label files
    corpora file
    info file
    grammar file
    lexicon file
    parts file
    spec file
    files file
    dur file
    counts file
    examples file
    vec file
    neural-network files
    summary file
    ali files
6. Script and Program Usage
    asr.tcl  
    checkvec.exe
    fa.tcl
    find_dur.tcl
    find_files.tcl
    gen_catfiles.tcl
    gen_spec.tcl
    gen_examples.tcl
    nntrain.exe
    pick_examples.tcl
    revise_spec.tcl
    select_best.tcl

1. Introduction

This tutorial describes one method used at Oregon Health & Science University's Center for Spoken Language Understanding (CSLU) for creating automatic speech recognition (ASR) systems called Hidden Markov Model/Artificial Neural Network (HMM/ANN) hybrids, using the CSLU Toolkit.  The CSLU Toolkit contains tools for speech recognition, speech synthesis, facial animation, audio I/O, and other interface tools.  This Toolkit performs all lower-level operations using "C" code, and higher-level operations using a scripting language called "Tcl."  This allows a balance of speed and flexibility that would not be possible with any one programming language. Included in this tutorial are some general concepts behind training such a recognizer, step-by-step instructions on how to train a recognizer, and a description of Tcl scripts that can be used to automate parts of this process.

1.1 Setup

In order to use the scripts mentioned in this tutorial, you must have the CSLU Toolkit installed on your machine. Currently, the CSLU Toolkit is only supported in a Windows environment.

The "Path" environment variable should be modified as follows (assuming Windows XP): click Start » Settings » Control Panel » System. Then click on the "Advanced" tab, and click on the "Environment Variables" button. Under the "System variables" heading, select the "Path" variable. Click on the "Edit" button, which pops up a new window called "Edit System Variable." In this new window, there is an area for the "Variable value". In the corresponding entry field, there are a number of paths, each separated by a semicolon. Assuming that you have installed the CSLU Toolkit into the default directory (C:\Program Files\CSLU), add the following paths to this list (separated by semicolons):
     C:\Program Files\CSLU\Tcl80\bin
     C:\Program Files\CSLU\Toolkit\2.0\bin
     C:\Program Files\CSLU\Toolkit\2.0\shlib
     C:\Program Files\CSLU\Toolkit\2.0\script\sview_1.0
     C:\Program Files\CSLU\Toolkit\2.0\script\training_1.0
Click on "OK" on these windows to finalize the settings.

The .tcl extension can be associated with either a command-line Tcl script, or a GUI Tk script. The default association is with a Tk script, but this should be modified to execute a command-line Tcl script as follows: click Start » Settings » Control Panel » Folder Options.  Then click on the "File Types" tab and scroll down until the "TCL" extension is visible.  Highlight this extension with a single mouse click, so that the bottom of this window shows "Details for TCL Extension."  Click on the "Advanced" button, which pops up a window called "Edit File Type."  Single-click on the action labeled "Tclsh", and then click on the "Set Default" button to the right.  Then click on the "Edit" button just above the "Set Default" button.  This pops up yet another window called "Editing action for type: Tcl/Tk".  In this window, there is an entry box labeled "Application used to perform action:".  Make sure that this entry box contains the following (assuming that CSLU Toolkit installation was in the default directory):
     "C:\Program Files\CSLU\tcl80\bin\tclsh80.exe" "%1" %*%
(note especially the %*% at the end).  Click on "OK" on all of these windows to finalize the settings.

The training process uses commands entered from a DOS prompt; a DOS command window can be found on Windows XP from Start » Programs » Accessories » Command Prompt.  Since it will be used often, it is recommended that the command window be resized for a width of 80 characters, screen buffer height of 2000 lines, window size height of 40 or 50 lines, and screen text font color of pure white for maximum visibility.  A command window can be added to the Start menu for easy access. 

In order to follow the examples in this tutorial, you may want to use the same data files.  These files have been put into a ZIP file containing all waveform and transcription files.  This file is available by clicking here.  The size of this compressed ZIP file is 7.6 MB, and the size of the data files is about 10 MB.  In addition, the configuration files used in the tutorial have been put in a ZIP file.  This ZIP file is available here. Some of these configuration files will need to be modified to reflect your system's path information and other relevant information.

1.2 Additional Information

In this document, phonetic symbols are represented using  the American English subset of Worldbet, which is an ASCII encoding of the International Phonetic Alphabet (IPA) [J. Hieronymus, 1995]. This tutorial has been supported by an NSF "Graduate Research Traineeships" award (grant number 9354959) and the CSLU Member companies. The views expressed in this tutorial do not necessarily represent those of the sponsoring agency and companies.
 

2. General Concepts and Notation

The general steps to creating an HMM/ANN speech recognition system are:

  1. Specify the (sub-)phonetic categories that the neural network will classify.
  2. Find examples of each of these categories in the speech data.
  3. Perform a number of iterations of neural-network training.  The output of each neural network is an estimation of the probabilities of each of the specified categories given a single time-point within a speech waveform. Select the best neural network (and adjust other system parameters) by evaluating each network on a small partition of the speech data that is not used for training or testing.  Evaluation is performed by using the estimated probabilities obtained from a neural network within a Hidden Markov Model framework.
  4. Evaluate the selected best network on a test set of speech data.

2.1 Quick Overview of Frame-Based Speech Recognition

Frame-based speech recognition has the following steps, illustrated in Figure 1:

Overview of Speech Recognition
Figure 1. Overview of frame-based speech recognition using a Hidden Markov Model/Artificial Neural Network (HMM/ANN) architecture.

  1. Divide the speech waveform into  frames, where each frame is a small segment of speech that contains an equal number of waveform samples. In this tutorial, we will assume a frame size of 10 msec.
  2. Compute features for each frame. These features can be thought of as a representation of the spectral envelope of the speech at that frame, and at a small number of surrounding frames called the "context window".
  3. Classify the features in each frame into phonetic-based categories using a neural network. The outputs of the neural network are estimates of the probability of each phonetic category, given the speech features at this frame.  When the neural network is used to classify all frames, this creates a matrix of probabilities, with F columns and C rows, where F is the number of frames and C is the number of categories.
  4. Use the matrix of probabilities, a set of pronunciation models, and a priori information about each category's duration to determine the most likely word(s) using a Viterbi search.

For a much more detailed explanation about HMMs and HMM/ANN hybrids, lecture notes are available.
 

2.2 Specifying Categories

In order to determine the categories that the network will classify, the following three things need to be done:

  1. The designer of the recognizer needs to determine the pronunciations for each of the words that will be recognized. More accurate pronunciation models will generally yield better recognition rates.
  2. Quite often, context-dependent phoneme models are used, which means that the model for a phoneme varies depending on the phonemes that precede or follow this phoneme.  For example, the phoneme /aI/ preceded by a /w/ will have a different model name than an /aI/ preceded by an /n/.  The surrounding contexts (/w/ and /n/ in this example) may be specific phonemes, or groups (clusters) of phonemes.  The grouping of phonemes into similar clusters is specified by the person designing the recognizer.
  3. Finally, when constructing context-dependent phoneme models, each phoneme to be recognized is divided into one, two, or three sub-sections or segments. Each sub-phonetic segment corresponds to one category to be recognized. If a phoneme is specified as having only one segment, then it is used without the context of surrounding phonemes.  If a phoneme is specified as having two segments, then the left segment (sub-phonetic category) is dependent on the preceding phoneme, and the right segment is dependent on the following phoneme.  The a phoneme is specified as having three segments, then the first segment is dependent on the preceding phoneme, the middle segment is independent of surrounding phonemes, and the third segment is dependent on the following phoneme.  If a phoneme is specified as being "right-dependent," it has only one segment, but this segment is dependent on the context of the following phoneme.  Right-dependent categories are typically used for oral stop phonemes.  This method of constructing context-dependent models is somewhat different from the standard triphone or biphone model, focusing the context dependencies on the regions of the phoneme that are most affected by that context.  The number of parts to split each phoneme into is specified by the person designing the recognizer.

Figure 2 shows an illustration of this kind of context-dependent modeling. In this figure, an example is given for the modeling of the word "lines", written in Worldbet as /l aI n z/. Here, the /l/ is split into two parts, the /aI/ is split into three parts, /n/ into two parts, and the /z/ is modeled using one part. There are eight clusters (or groups) of phonemes used for contexts; each cluster represents a broad class of sounds. For this recognizer, the /l/ phoneme is assigned to the "$lat" cluster (for lateral phonemes), the /aI/ phoneme is also assigned to the $bck_r cluster of back vowels occuring in a right-hand context and to the $fnt_l cluster of front vowels occuring in a left-hand context, and both /n/ and /z/ are assigned to the $alv (alveolar) cluster. (The /aI/ phoneme is unusual in that it begins as a back vowel and ends as a front vowel; therefore, it can not be grouped into only one of the $bck (back-vowel) or $fnt (front-vowel) clusters.  The solution is to consider whether the /aI/ is occuring to the left of a phoneme (a left context), in which case it is always a front vowel, or if it is occuring to the right of a phoneme (a right context), in which case it is always a back vowel.)

Context-Dependent Modeling
Figure 2. Context-Dependent Modeling

The context-dependent phonetic categories that the network will be trained on can be determined from the phonetic-level pronunciation models, the groupings of phonemes into clusters of similar phones, and the number of parts to split each phoneme into.

2.3 Example of Specifying Categories

To give an example of how these pronunciation models, clustering, and parts can be determined, we'll use the example of recognizing the isolated words "three", "tea", "zero", and "five".  (For this example, isolated words (words that are surrounded by pauses or silence) will be used; the CSLU Toolkit can be used for recognizing continuous speech.)

First, we come up with some initial pronunciations:

     word

pronunciation

     three      T 9r i:
     tea      tc th i:
     zero      z i: 9r oU
     five      f aI v

We may want to modify these pronunciations, because the /i:/ in "zero" is often pronounced differently from the /i:/ in "three" and "tea". To account for this difference in pronunciation, we can use our own symbol, /i:_x/, to represent the front vowel in "zero". Making this change gives us the following pronunciation models:

     word

pronunciation

     three      T 9r i:
     tea      tc th i:
     zero      z i:_x 9r oU
     five      f aI v

Next, we will determine the number of parts to use for each phoneme. In the table below, "1" means that the phoneme will be context-independent, "2" means that the phoneme will be split into two parts, "3" means that the phoneme will be split into three parts, and "r" means that the phoneme will be "right-dependent":

     phone

parts

     T

1

     9r

2

     i:

3

     tc

1

     th

r

     z

1

     i:_x

2

     oU

3

     f

1

     aI

3

     v

1

     .pau
1

The /.pau/ symbol is used for the pause that is assumed to occur between words.  Now, let's look at the spectrograms of the vowel /i:/ in "three" and "tea". In this case, the vowel /i:/ is the same, but it looks very different when it is preceded by a /9r/ compared to when it is preceded by a /th/ (see Figure 3).


Figure 3. Example of vowel /i:/ in different phonetic contexts.

In this case, we make the initial third of the /i:/ (since it is split into three parts) dependent on a preceding retroflex (/9r/) in one case and dependent on a preceding alveolar sound (/th/ or /z/) in the other case. We usually group the phonemes in a left or right context according to their broad phonetic category; for example, the following groupings can be used (the dollar sign indicates a variable that represents the group of listed phones):

  group   phonemes in group    description
  $bck   oU   back vowels
  $fnt   i: i:_x   front vowels
  $ret   9r   retroflex sounds
  $alvden   T v th z   dentals, labiodentals, and alveolars
  $sil   .pau tc /BOU /EOU
  silence or closure

Notice the two symbols /BOU and /EOU in the "phonemes in group" column.  These are two special symbols defined by the recognizer; /BOU stands for "beginning of utterance" and /EOU stands for "end of utterance."  They are not symbols that need to be trained on, but they are put into context clusters so that the recognizer knows what context-dependent category to assign to the first and last phonemes in an utterance.

This general scheme is relatively straightforward.  However, notice that it then becomes difficult to classify diphthongs such as /aI/, because the phoneme starts as a back vowel and ends as a front vowel. The solution is to modify the categories in the following way:

  group   phonemes in group    description
  $bck_l   oU   back vowels to the left of a target phoneme
  $bck_r   oU aI   back vowels to the right of a target phoneme
  $fnt_l   i: i:_x aI   front vowels to the left of a target phoneme
  $fnt_r   i: i_x   front vowels to the right of a target phoneme
  $ret   9r   retroflex sounds
  $alvden   T v th z   dentals, labiodentals, and alveolars
  $sil   .pau tc /BOU /EOU
  silence or closure

 
First, we have added "_l" and "_r" suffixes to the variable names in question, to indicate whether the phonemes in this grouping occur on the left or right side of the phoneme being classified. Then, because /aI/ has the characteristics of a back vowel when it appears in a right-hand context, it has been put in the grouping $bck_r; because /aI/ has the characteristics of a front vowel in a left-hand context, /aI/ has also been put in the grouping $fnt_l. This method of grouping into left or right contexts is illustrated in Figure 4:


Figure 4. Illustration of labeling a diphthong in the word "five".

The format for specifying different categories is [left_context]<phone>[right_context], so for example the category for /.pau/ will be <.pau>, the category for the initial third of /i:/ in the context of dental sounds will be $den<i:, the middle third of /i:/ will be <i:>, and the right third of /i:/ in the context of silence will be i:>$sil.

Given all this information, it can easily (if tediously) be determined that the 28 categories we need to train on are:
 

<.pau>

 $alvden<9r

$fnt_l<9r

9r>$fnt_r

9r>$bck_r

<T>

f<aI

<aI>

aI>$alvden

<f>

$ret<i:

$alvden<i:

<i:>

i:>$sil

$alvden<i:_x

i:_x>$ret

$ret<oU

<oU>

oU>$sil

<tc>

th>$fnt_r

<v>

<z>



In the following sections, a Tcl script called "gen_spec.tcl" is described; this script can be used to automate the process of determining categories and creating a specification file of what categories a classifier must use.

Different settings for the "parts" and "context clusters" can yield significantly different word-level performance in the final recognizer.  The values that yield best results will depend on a number of factors, including the vocabulary size and grammar.  The general goal is to create categories that will have enough examples for training (maximize the number of examples per category) and also maximize the difference of models for different words.
 

2.4 Finding Examples to Train On

2.4.1 Overfitting and Datasets
As a neural network is trained, the weights of the network are adjusted to minimize the classification error on the training data. For each adjustment of the weights, we have a new iteration (or epoch) of the training process. We can keep generating new iterations until the error no longer decreases. At this point, we have learned the training data to the extent that it is possible.

However, when we train a neural network, we aren't really interested in learning as much as possible about the training data. Instead, we are interested in learning as much as possible about the general properties of the training data, so that when we evaluate on test data, our model is still accurate. By learning the general properties of the data instead of the details that are specific to the training data, we are best able to classify a new utterance that is not in the training set.

In order to determine which iteration of network weights has best learned the general properties of the data, we use a separate (usually smaller) set of data to evaluate each iteration.  Evaluation is conducted at the word level, meaning that the network is used in combination with a Viterbi search to perform word-level recognition; the set of network weights that maximizes word-level accuracy is selected as the "best" and final network.  This second set of data is called the "development" set (or cross-validation set). Because this development set has not been used to adjust the network weights during training, it can be used to evaluate the network's ability to recognize phonetic categories, as opposed to the classifier's ability to recognize (possibly irrelevant) details of the training set. The larger this development set is, the more confidence we can have in the general classification properties of the network.

Once we have determined the best network, we need to evaluate its performance on a test set. In order to have an honest evaluation, the data in the test set must not occur in either the training set or the development set.  In addition, for a speaker-independent recognizer, none of the speakers in the test set must have utterances in the training or development sets.

This means that given a corpus containing our target words, we must divide it into at least three parts: one part for training, one for development, and one for testing. If we have a large enough corpus, we may further divide the development set into subsets, so that as we evaluate and make modifications to our recognizer, we are not tuning performance to one set of development data.

2.4.2 Filtering
When selecting data for training, development, and testing, we can apply various filters to selectively reduce the amount of data. In one case, we may have utterances in our corpus that don't occur in our target vocabulary. In this case, we may want to filter so that words not in our vocabulary list are not included in our datasets. For example, if we are training a digits recognizer and we are using the CSLU Numbers corpus for training, we may want to remove out-of-vocabulary utterances that contain numbers such as "first", "twelve", and "fifty". In another case, we may have so much data that training or evaluation would take too long. In this case, we can filter so that we take every Nth utterance for use in our datasets, where N is some integer greater than 1. For example, we may want to take every sixth waveform for training our digits recognizer, because there are over 6000 utterances available for training on digits. Filtering in this way will still leave over 1000 utterances (or approximately 500 examples of each spoken digit) available for training.
 
2.4.3 Finding Categories
Once we know which files we'll use for training, we need to find examples of each context-dependent phonetic category that we'll train on. This can be one in one of two ways: using data that has been hand-labeled at the phonetic level, or by using forced alignment.

2.4.4 Number of Examples per Category
Finally, the designer of a recognizer must decide how many examples (10-msec frames of speech that have been associated with a particular context-dependent phonetic category) of each category to train on. Networks with decent performance can be trained using up to 500 examples per category, but sometimes 2000 or more examples are used. In order to get best performance, generally all examples in the training set should be used. However, training with all examples may be very time-consuming.

If some categories have very few or no training examples, then there are two options. The first option is to use an additional corpus that contains examples of these infrequent classes. The second option is to "tie" these infrequent categories to phonetically similar categories that do have enough training examples. Categories tied in this way will not be trained on, and during recognition their probabilities will be set equal to the probabilities of the categories that they were tied to.
 

2.5 Training the Network

2.5.1 Generating Data
Once the examples to train on have been found, and the number of training examples per category has been determined, the actual data that will be trained on are collected and stored in a "vector file". This vector file contains, for each training example, the acoustic features that will be input to the neural network and the target category that the network is supposed to learn. (One set of training features and the target category is called a "vector"; it can also be called an "example".)

2.5.2 Number of Hidden Nodes
At CSLU, we use 3-layer feed-forward networks. The number of input nodes is the number of acoustic features, and the number of output nodes is the number of categories to be trained on. The designer of a recognizer must decide how many hidden nodes the network should have; in general, we have found 200 to 300 hidden nodes to be a reasonable number.

2.5.3 Negative Penalty
When using a large number of examples per category, it is nearly inevitable that some categories will have much fewer examples than others, making it difficult to learn these sparse categories. This difficulty in training is due to the fact that there are many more negative examples than positive examples for a sparse category, where negative examples are examples for which the category being trained on has a target value of 0, and positive examples are examples for which the category being trained on has a target value of 1. As a result, these sparse categories often have very small output values that don't reflect the actual posterior probabilities that we want to obtain. To adjust for this, the amount that each negative example contributes to the total error is weighted by a value proportional to the number of examples in that negative category; this value is called a "negative penalty". Training can be done either with or without this negative penalty. A more thorough discussion of the negative penalty can be found in a paper by Wei and van Vuuren from ICASSP-98, "Improved Neural Network Training of Inter-Word Context Units for Connected Digit Recognition."

2.5.4 Number of Training Iterations
It is almost never necessary to continue training until the training error stops decreasing; the best performance on the development set will almost always occur at an earlier iteration. Often, best performance on the development set occurs after about 20 to 30 iterations, and so training is done for a fixed number of iterations, usually between 30 and 45.

2.5.5 Re-Training on Force-Aligned Data
As described above, forced alignment can be used to generate labels for training. In order to generate initial labels using forced alignment, we usually use a general-purpose recognizer. We can also use forced alignment to re-train a network; in this case, we use our current-best network to generate the forced-alignment labels and then train again using these new labels. This re-training often yields better results.

2.6 Evaluation

2.6.1 Word-Level Evaluation
Once we have trained for, say, 30 iterations, we need to determine which iteration has the best performance on the development set. To do this, we recognize each utterance in the development set using the network weights from each iteration and a Viterbi search. We evaluate the performance at each iteration in terms of substitution errors, insertion errors, and deletion errors.  The overall accuracy of a network iteration is defined to be 100% - (Sub + Ins + Del), where Sub is the percentage of substitution errors, Ins is the percentage of insertion errors, and Del is the percentage of deletion errors. We can also measure the "sentence-level accuracy", which is the number of utterances (or entire waveforms) recognized correctly divided by the total number of utterances in the development set.

2.6.2 Choosing the Best Iteration

Usually, we choose the network iteration with the best word-level accuracy; in case of equal word-level accuracies, then we select the iteration with the greater sentence-level accuracy.

2.6.3 Testing
Once we have finished developing a recognizer, we evaluate the final performance on the test set, in terms of word-level and sentence-level accuracy. It is important, however, that once evaluation is done on the test set, the recognizer is not further modified based on these test-set results. In order to ensure that such modifications are not done, the test set is usually reserved until just before the recognizer is put into general-purpose use (or just before publishing results in a journal or at a conference).
 

3. Overall Procedure

Given the background described in the previous section, the process of training a recognizer becomes relatively simple. This section gives the "recipe" for this training process.  

3.1 Create Descriptions

The first step is to create a description of the recognizer and describe how the data will be selected for training. The files that need to be created are:

corpora file
Create a "corpora.txt" file if one doesn't yet exist. The corpora.txt file contains a master list of each corpus, and the location and format of the files in that corpus. The format of this file is given below; there is no automated way of generating this file, but it is easy to modify by hand. The same corpora file can be used for all training tasks.
info files
Create "info" files for training, development, and testing. These info files must be created by hand; the format is given below in Section 5. An info file contains all of the information that is necessary to find examples for training, development, or testing. This info file includes the partition (train, develop, test), how to select the data for the required partition (i.e. filtering parameters, as described above), the basename of the recognizer, the minimum number of examples requested for each category, and corpus-dependent information. One info file is required for each of the tasks of training, re-training using forced alignment, development, and testing.
grammar file
Create a "grammar" file that specifies the grammar that will be used to recognize words.  The format of a grammar file is a modification of the ABNF format published by the W3C.  The exact format used here is described in the Statenet documentation.
lexicon file
Create a "lexicon" file that specifies the pronunciation of each word in the grammar.  The format of a lexicon file is given below.
parts file
Create a "parts" file, which specifies how many parts to split each phoneme into, and what context clusters to use. Once again, this must be created by hand, and the format is given in Section 5.

3.2 Find Data

Given the files created above, the scripts to use in order to find data files for training are:

find_files.tcl
Use "find_files.tcl" to find files for training, development, and testing. This script must be called once for each set of files. At this stage, any filters are applied and the corpus is searched for files that are appropriate for the given partition (such as training or testing).
gen_spec.tcl
Use "gen_spec.tcl" to generate a specification file that contains a list of the categories to train on.  This script uses the info, grammar, lexicon, and parts files to create a "spec" file.  The specification file contains, in addition to the categories used by the recognizer for training and recognition, the specific frame size, sampling rate, the location of code used to compute acoustic features, the context clusters, and any phonetic mappings.
gen_catfiles.tcl
Use "gen_catfiles.tcl" to create time-aligned categories from text transcriptions or from phonetic time-aligned transcriptions. These categories are written to separate files with the extension ".cat", which are put in sub-directories that mirror the directory structure of the corpus (or corpora) being used.
revise_spec.tcl
Use "revise_spec.tcl" to (a) tie categories that don't have enough training examples to categories that do have sufficient examples, and (b) update the minimum and maximum duration parameters for each category.  " gen_catfiles.tcl" creates output files that indicate the number of examples available for each category, as well as the duration information.  The output of this script is a modified "spec" file.

3.3 Select Data for Training

Once the files have been selected, the category files have been created, and the desc file is correct, then we can use the following scripts and programs to select frames for training:

pick_examples.tcl
Use "pick_examples.tcl" to select examples to train on.  The output of this script is an "examples" file, which is used directly by the next script, gen_examples.tcl
gen_examples.tcl
Use "gen_examples.tcl" to create acoustic feature vectors and their associated category information, for each frame to be trained on.  This script creates a binary file with the extension ".vec" (for vectors of features).
checkvec.exe
Use "checkvec" to make sure that the data in the .vec file are valid.  This program also prints out the number of categories and the number of examples of each category.  The number of categories is needed when running nntrain.exe.

3.4 Train and Evaluate

nntrain.exe
Use "nntrain" to train the neural network iterations using the vector file as training data.
select_best.tcl
Use "select_best.tcl" to find the best iteration of the network using the set of development files.

3.5 Re-Train

Create force-aligned data using the best iteration of the network that was just trained. To do this, create an info file for forced alignment that specifies a new directory in which to put the category files and a forced-alignment script to use to create the new .cat files. Then use "find_files.tcl", "gen_spec.tcl", "gen_catfiles.tcl", and "revise_spec.tcl" to generate the force-aligned labels and create a new .spec file. Then repeat Sections 3.3 and 3.4 to create a network trained on this force-aligned data.

3.6 Evaluate Test Set

Use "select_best.tcl" to evaluate the final best network's performance on the test set. These are the final results that are acceptable for publication.
 
 

4. Complete Example

To illusrate the procedure described above, the example of training a continuous-speech digits recognizer is given in this section. All commands should be entered using a DOS command window.  First make sure that the environment is properly set up as described in Section 1.1.   Text given in bold indicates commands that are typed from a command window; text in fixed-width font indicates the output from this command. In DOS, all commands must be entered on one line; if a backslash is used in the examples below to continue the command on another line, this must be typed as one line with no backslash when using DOS. The parameters for each script and program are explained in Section 6. The data files that are used in this example are located in a zip file available for downloading (make sure that you preserve the directory structure of the files in the zip file).  The configuration files (and two scripts, "fa.tcl" and "remap_tutorial.tcl") used in this tutorial have been put in a ZIP file, available here.  You may need to change some information in these files to reflect your directory structure or other information.  The changes that are needed should be clear as the tutorial progresses.  Section 5 describes the format of these files so that you can change them or create them from scratch later on, in order to train on another task or train using different parameters.

If you are familiar with the previous version of this training process, note that there are several differences.  The .vocab file has been replaced by two files, a .lexicon file and a .grammar file.  The format of the .lexicon file is similar to, but slightly different from, the format of the .vocab file, in order to be more consistent with ABNF style.  The .olddesc and .desc files have been replaced with a new format, called a .spec (specification) file.  The use of hscript.exe is no longer necessary.  There are significant other differences, as well, but these other differences may not be as noticable.  If you successfully used the old version, but are having difficulties with the new version, please read the instructions carefully, as there may be subtle changes in the procedure.

[Step 1] In this initialization step, set up the directory structure that you will use. It is recommended that you create one directory for each "project", where a project contains all of the files created during the training of a network. For this example, we will be using a project directory called \tutorial\digit. Note that some files (vector files in particular) may take up a large amount of disk space; you may want to delete these files after you are finished using them. Now is a good time to make sure that your path contains the locations of the training scripts as well as the stand-alone C programs used for training. To check this, if you type "gen_spec.tcl" in your project directory, you should get the following:

and if you type "checkvec" in your project directory, you should get the following:

If you don't get these responses, contact the person who installed the CSLU Toolkit to find the location of the "script\training_1.0" directory and the "bin" directory within the Toolkit directory hierarchy.  The default locations are C:\Program Files\CSLU\Toolkit\2.0\script\training_1.0 and C:\Program Files\CSLU\Toolkit\2.0\bin.  Modify your "path" environment variable to include the correct paths, as described in Section 1.1.

[Step 2] Create a corpora file, called "corpora.txt". For this tutorial, the corpora.txt file might look like this (assuming that the tutorial data are stored in \tutorial\data):

The "format" field specifies the format of files in this corpus, using a regular expression.  The parentheses are used in combination with the "ID:" field to determine the speaker ID associated with a file.  (And, in turn, the speaker ID is used to make sure that the three partitions of training, development, and test data are speaker-independent.)  It is probably also a good idea to make sure that your filenames have the same format as specified in the "corpora" file; the format is case sensitive, so NU-78.zipcode.wav is different from nu-78.zipcode.wav.  Also, note that the path names are specified using a forward slash (unix style) instead of a backslash (MS style).

[Step 3] Create info files for training, development, and testing. They will be called digit.train.info, digit.dev.info, and digit.test.info. We will only request up to 200 examples per category, so that this tutorial doesn't take more time than necessary to run through. If one wanted to maiximize accuracy, it would be better to use all available examples. To specify all examples, use the keyword ALL instead of 200 in the "want:" field in digit.train.info.

For the digit.train.info file, we are specifying that we want training data from the Numbers corpus, and we will put time-aligned category labels in the numbers_train subdirectory (specified in the "partition:", "name:", and "cat_path:" fields). We require the presence of waveform, phonetically-labled, and text transcription files in order to do this (specified in the "require:" field with "w" to require the wavform, "p" to require the phonetically-labeled files, and "t" to require the word-level text transcription files), and we'll use 3/5 of available files (specified in the "partition:" field, where the first {expr $ID % 5} maps the speaker ID to one of five values (0 through 4), and the second part {0 1 2} selects values 0, 1, and 2 for training). We won't skip over any files (specified in the "filter:" field, where "1+1" takes all files), but we will require that all of the vocabulary words in the text file are all words that we want to recognize  (specified in the "lexicon:" field with the lexicon file that contains all of the target words). We will remap the hand-labled phonetic files (which can have a high degree of variability in the phoneme identities used to represent a word) to a consistent set of phonemes using the remap_tutorial.tcl script (specified in the "remap:" field, which specifies that "remap_tutorial.tcl" will be executed to do this remapping).  In addition, we specify that the sampling frequency of the waveforms is 8000 Hz, and the recognizer will use a 10-msec frame rate (in the "sampling_freq:" and "frame_size:" fields).  The "min_samp:" field has no effect when using only one corpus... this field, and all other fields, are explained in more detail in the description of the info file format.

[Step 4] Create a grammar file, called digit.grammar.  This file contains the grammar that the recognizer will use.  In this case, the grammar specifies that a digit is any one of the words "zero", "oh", "one", ... "nine".  It also specifies that the top-level grammar (using the default symbol $grammar) allows an optional "separator" word called "sep*" (which may be pause or "garbage"), followed by one or more repetitions of a digit followed by optional separator, and finally ending with an option separator.

type digit.grammar
$digit   = zero | oh | one | two | three | four | five | six |
           seven | eight | nine;
$grammar = [sep*%%] ($digit [sep*%%])<+> [sep*%%];

[Step 5] Create a lexicon file, called digit.lexicon. This file contains the target words and their pronunciations.  Here you can see that the "sep*" word has been defined as pause, followed by optional garbage, followed by another pause.  Also, the remapping script will map all occurrences of the phoneme sequence /oU 9r/ (which occurs in the word "four") to the symbol />r/, because these two phonemes are heavily coarticulated and may be better represented as one phoneme.  Because the ">" is a pre-defined symbol that can be used in the grammar (to specify a repeat operator, among other things), the .lexicon file and .parts file must precede this symbol with a backslash to indicate that it is a phoneme symbol and not a grammar symbol, leading to the symbol "\>r" for the representation of the vowel and final consonant in the word "four".

[Step 6] Create a parts file, called digit.parts. This contains the number of parts that each phoneme will be split into, the groupings of phonemes into clusters of similar phonemes, and mappings from one phoneme to another symbol.  In this case, the unvoiced closures /tc/ and /kc/ are mapped to the single symbol /uc/, which we hereby define as a "generic" unvoiced closure.  We then train on the /uc/ symbol, although we specify word pronunciations using /tc/ and /kc/. 

[Step 7] Run find_files.tcl in order to find files suitable for training. The output is written to digit.train.numbers.files; this filename is constructed from the basename, the partition, and the corpus. The reason that the user doesn't specify the output filename on the command line is that it is possible, when using several corpora, to create several output files; it seems easier to have the filenames automatically determined than to have the user specify one filename for each corpus.

Then, run find_files.tcl a second and third time to find files suitable for development and testing:

[Step 8] Run gen_spec.tcl to determine the context-dependent categories that will be classified by the recognizer. The input files are the info, grammar, lexicon, and parts files. The output file is the spec file; this specification file contains not only the list of the context-dependent categories, but also some other information about the recognizer that we will be creating.

[Step 9] Run gen_catfiles.tcl to take the list of files for training (digit.train.numbers.files) and create time-aligned labels of categories to train on. The input file (other than digit.train.info and corpora.txt) is digit.train.numbers.files. If specified in digit.train.info, the script in the "remap:" field will be used, or the script in the "force_cat:" or "force_phn:" fields will be used (in this case, we haven't specified the "force_cat:" or "force_phn:" fields because we are not yet doing forced alignment). The category label files that are created are stored in the directory that is specified in  digit.train.info in the "cat_path:" field.  The gen_catfiles.tcl script also creates two other output files: the "dur" file and the "counts" file. The dur file contains minimum and maximum duration limits for each category, as determined from the category label files; the counts file lists the number of occurrence (and total time in msec) of each category.

This script may generate messages such as or These are simply messages to the user that some labels are being merged, deleted, or ignored when converting from hand labels to categories. These messages come from the remapping script, in this case remap_tutorial.tcl. No action needs to be taken by the user. At the end, for each category, the duration that is at the bottom 2nd percentile of all durations for that category is written to the dur file as the minimum duration, and the longest duration of the category is written to the dur file as the maximum duration. These limits help the Viterbi search refrain from inserting very short or very long words during recognition.

[Step 10] Run revise_spec.tcl to make sure that we have enough examples of each category to train on, and to add duration limits to the spec file. If there are not enough examples of a category, this script allows us to tie these categories to categories with more examples. This is the only interactive script in the entire training and recognition process. The input files are the input spec file, the counts file, and the dur file. The output of this script is a new spec file that contains category tie information and duration limits information.

In this case, we have tied the schwa vowel in the context of various subsequent phonemes to the schwa in the context of following silence; /ei/ in the context of a preceding dental to /ei/ in the context of a preceding alveolar, and other changes.  In general, if in doubt and a context-independent category exists (e.g. <ei>), then it is acceptable to tie to this context-independent category.  Because these are categories that are infrequent in the training data, it is also not very likely for these categories to be used in recognition, and so the recognizer should not be very sensitive to the tie categories that are selected.

[Step 11] Run pick_examples.tcl to select frames for training, from the files created by gen_catfiles.tcl. The input files are digit.train.info, corpora.txt, digit.train.spec, and digit.train.numbers.files.   The output of this script is the file digit.train.examples, which contains an ASCII list of files, the frames to be used in each file, and the categories corresponding to these frames.

[Step 12] Run gen_examples.tcl to compute acoustic features for all of the frames given in digit.train.examples. The input files are digit.train.info, digit.train.spec, and digit.train.examples. The computed features and the associated target category values are stored in the binary output file digit.train.vec. Note that if you want to use features that are different from the standard features, you can write the Tcl code used to create the new features; the location of your code can be specified in the info file using the "featuresURI" and "contextURI" fields.  Also, the description of the format of the vector file given in Section 5 may be of interest.

[Step 13] Run checkvec.exe to make sure that the vector file that has been created has the correct format, and that every category has at least one example to train on. The numbers in the left column are the values corresponding to each category (from 1 to the total number of categories), and the numbers in the right column are the number of examples for each category. The input file is digit.train.vec; the only output goes to the screen for the user to check, but may be piped to a file using "> checkvec_output.txt" at the end of the DOS command.

[Step 14] Run nntrain.exe to train the neural network on the vector file digit.train.vec. This program creates a weights file at each iteration; we will select the best weights file after training for 30 iterations. The -l option indicates that the negative penalty will be adjusted to compensate for varying numbers of examples per category; -sn 88 and -sv 88 are random-number seeds;-f digitnet specifies the basename "digitnet" for the output files; -a 3 130 200 128 specifies the architecture of the net: 3 layers, with 130 nodes in the first layer, 200 nodes in the hidden layer, and 128 nodes in the output layer. The value 30 specifies training for 30 iterations, and the last parameter is the vector file to use for training.  Output files will, in this case, be called digitnet.0, digitnet.1, digitnet.2, ... , digitnet.30, with one output file for each iteration.

Notes: For specifying the architecture, note that the number of nodes in the first layer will always be 130 for the standard feature set. The number of hidden nodes is decided by the user, but 200 is a reasonable number. The number of output nodes (128 in this case) must match the largest value in the left column of the output of checkvec.exe.  The number 128 used in this example may change, depending on the number of states that have been tied and the information in the .grammar, .lexicon, and .parts files. 

The only input file to nntrain.exe is the vector file; the output files are the neural-network weights files for each iteration (the default names are nnet.X, where X is an integer from 0 to the number of iterations).

[Step 15] Run select_best.tcl to evaluate the performance of each iteration (weight file) on the development-set data. This script may take a long time, especially if there are many files in the development set. This script calls two other scripts, " asr_multi.tcl" and "eval_results.tcl", which are located in the same directory as select_best.tcl, namely ...\CSLU\Toolkit\2.0\script\training_1.0.  The input files are the neural-network files created by nntrain.exe, the digit.dev.numbers.files file and all waveform and text files specified in digit.dev.numbers.files, the grammar file, the lexicon file, and the spec file. The output files are ali files (with basename "wrdalign_digitnet" in this example) and a summary file.  The summary file shows the performance on each iteration (the WrdAcc% column shows word-level accuracy, and the SntCorr column shows the percentage of "sentence" (entire digit sequences in this case) correctly recognized), as well as the resulting best iteration.

Note that training on only 200 examples per category has a negative influence on results; when trained using all available examples, instead of 200 per category (using the keyword "ALL" instead of 200 in digit.train.info), results on the same development data were 97.07% word accuracy and 88.00% sentence accuracy. The drawback to training on all examples is that pick_examples.tcl, gen_examples.tcl, and especially nntrain.exe take longer to execute. 

[Step 16] Now we have finished the first cycle of training. If we are happy with the level of performance on the development set, we can stop the training process and evaluate on the test set (Step 19).  If we want to try to improve performance on the development set, we can do another cycle of training using force-aligned data. We can create another .info file for doing forced alignment, using the training file as a template. This new file will be called digit.trainfa.info:

(Note that in the "force_cat:" field, the script and associated parameters are specified on two lines. No special marker (such as a backslash) is required.)

We have changed the partition name (to "trainfa") and the path for category files (to "numbers_trainfa"). Also, by specifying "require: wt", we will now require the existence of .wav files and .txt files but not .phn files (because we will create time-aligned phonetic labels from the text transcriptions using the lexicon file and forced alignment). We also add a new field to the corpus description, indicating that we want to do forced alignment and create labels at the "category" level (as opposed to the phoneme level). Also note that this line specifies using iteration 17 from the training we just finished, since iteration 17 had the best word-level performance. Because we are doing forced alignment, it is no longer necessary to use the remapping script that re-maps labels created by hand to the set of labels used by our recognizer.

[Step 17] Now we once again find the files we want to use for training by running find_files.tcl, and then we generate cateogry-level time-aligned labels by running gen_catfiles.tcl. As part of the process of creating category-level label files, we also automatically create new dur and counts files. Finally, we create a new spec file with the new information in the dur and counts files using revise_spec.tcl.  In this case, we don't tie categories with 3 or more occurrences, more for demonstrating the options available than for any theoretically sound reason.

[Step 18] Then, we repeat the training steps to train and select the best force-aligned network:

Note that the neural network weights files have the basename "digitfanet". The results from select_best are:

With the development set, it is acceptable to vary the system parameters to try to maximize performance.  For example, the insertion rate is slightly higher than the deletion rate in this example.  So, performance might improve by reducing the value of the garbage parameter from 10 to, say, 7, so that the insertion rate will decrease.  (This will often cause the deletion rate to increase... the objective here is to get the lowest combined error rate, which often occurs when the insertion and deletion error rates are nearly equal.)  If we try it:

select_best.tcl digitfanet digit.dev.numbers.files digit.grammar \
     digit.lexicon digit.trainfa.spec digit.devfa.summary -g 7 -b 15

we see that the best word-level accuracy of 95.27% is slightly worse than the accuracy of 95.95% with garbage value of 10.  So, we leave the garbage value set at 10.

[Step 18] The resulting network, digitfanet.17, is the final network. The last step is to evaluate this network on the test set.  Here we run select_best.tcl again, but only evaluate on network iteration 17 using the "-o 17" option.

In this case, the output is

Here, the final result of 96.40% word accuracy is slightly better than the word accuracy on the development set, and the sentence-level accuracy of 78.95% is slightly worse than the sentence-level accuracy on the development set.  Usually, the performance on the test set is slightly worse than the performance on the development set for both word-level and sentence-level accuracy, because the development-set performance is the maximum performance over a number of iterations and/or garbage values, while test-set performance reflects a single evaluation meant to indicate performance that can be expected of a final system on unseen data. 
 

5. File Formats

In the following file formats, text in fixed-font bold is a keyword that must be used verbatim. Italicized items in brackets <> must be substituted with the proper values.
 

wav file
A wav file contains the speech waveform that is to be trained on or recognized. The format for wav files may be either Microsoft .wav format or NIST Sphere ulaw format.
 

txt file
A txt file contains a text transcription of the words in a speech waveform. This file is simply an ASCII file containing the words separated by spaces, and it can be created by any text editor that outputs ordinary .txt files.
 

label files (.phn, .cat, .wrd)
Label files, which usually have the extension .phn, .cat, or .wrd, contain time-aligned labels of a waveform utterance. If the file has the extension .phn, then the labels are phonetic labels; if the file has the .cat extension, then the labels are neural-network output categories (usually context-dependent sub-phonetic units); and if the file has the extension .wrd, then the labels are words. A label file has the following format:

where:

The values for <begin_time> and <end_time> are measured in frames (so if <value> is 1.0, then time is measured in milliseconds; if <value> is 10.0, then time is measured in centi-seconds). The <end_time> of one label is usually the same as the <begin_time> of the next label.
 

corpora file
The corpora file contains descriptions of all corpora:

where a <corpus description> has the following format:

where:

info file
An info file has the following format:

The number of corpora specified in an info file is theoretically unlimited, but there must be at least one. Note that each field has a semicolon at the end.