Training Neural Networks for Speech Recognition

John-Paul Hosom, Ron Cole, Mark Fanty, Johan Schalkwyk, Yonghong Yan, Wei Wei
Center for Spoken Language Understanding (CSLU)

Oregon Graduate Institute of Science and Technology
February 2, 1999

Contents

1. Introduction 
2. General Concepts and Notation
    2.1 Quick Review of Frame-Based Speech Recognition
    2.2 Specifying Categories
    2.3 Example of Specifying Categories
    2.4 Finding Samples to Train On
        2.4.1 Overfitting and Datasets
        2.4.2 Filtering
        2.4.3 Finding Categories
        2.4.4 Number of Samples per Category
    2.5 Training the Network
        2.5.1 Generating Data
        2.5.2 Shuffling Data
        2.5.3 Number of Hidden Nodes
        2.5.4 Negative Penalty
        2.5.5 Number of Training Iterations
        2.5.6 Re-Training on Force-Aligned Data
        2.5.7 Forward-Backward (Embedded) Training
    2.6 Evaluation
        2.6.1 Word-Level Evaluation
        2.6.2 Choosing the Best Iteration
        2.6.3 Testing
3. Overall Procedure
    3.1 Create Descriptions
    3.2 Find Data
    3.3 Select Data for Training
    3.4 Train and Evaluate
    3.5 Re-Train
    3.6 Evaluate Test Set
4. Complete Example
5. File Formats
    wav files
    txt files
    label files
    corpora file
    cull file
    info file
    vocab file
    parts file
    desc file
    olddesc file
    files file
    dur file
    counts file
    pick file
    vec file
    neural-network files
    summary file
    ali files
6. Script and Program Usage
    browse.tcl
    categories.tcl
    checkvec.exe
    cull5.tcl
    find_best.tcl
    find_dur.tcl
    find_files.tcl
    force.tcl
    genvec.tcl
    gen_catfiles.tcl
    hmm_embed.tcl
    hnncheckvec.exe
    hnntrain.exe
    hscript.exe
    nntrain.exe
    pickframes.tcl
    recog.tcl
    recog_cslushdigit.tcl
    remap_genpur.tcl
    revise_desc.tcl
    shuffle.exe
    update_descdur.tcl

1. Introduction

This tutorial describes the method used at CSLU for creating neural-network-based speech recognizers. Included in this tutorial are some general concepts behind training a recognizer, step-by-step instructions on how to train a recognizer, and a description of Tcl scripts that can be used to automate parts of this process.  This process has been further automated by Ben Serridge at the Tlatoa speech group (http://info.pue.udlap.mx/~sistemas/tlatoa/), and their simpler training process (still using the CSLU Toolkit) is available at http://info.pue.udlap.mx/~sistemas/tlatoa/howto/train_nnet.html.   In addition, they have provided a hand-labeling guide and additional documentation at http://info.pue.udlap.mx/~sistemas/tlatoa/howto/hand_label.html and http://info.pue.udlap.mx/~sistemas/tlatoa/techdoc/techdoc.html.

In order to use the scripts mentioned in this tutorial, you must have the CSLU Toolkit installed on your machine. Make sure that your path includes the location of the Toolkit's stand-alone executable files (usually located in the "bin" directory, for example C:\CSLU\Toolkit\2.0\bin), the scripts used for training (usually located in the "script\training_1.0" directory, for example C:\CSLU\Toolkit\2.0\script\training_1.0), and the location of the Toolkit's "shlib" directory (usually C:\CSLU\Toolkit\2.0\shlib). In order to follow the example provided in this tutorial, you may want to use the same data files. These files are have been put into a zip file containing all waveform and transcription files. The size of this compressed "zip" file is 7.6MB, and the size of all the data files is about 10MB. The CSLU Toolkit and corpora are free of charge for non-profit use (universities, high schools, and individuals may download the Toolkit at no charge). For more information on the Toolkit and CSLU corpora, visit our WWW site at http://speech.bme.ogi.edu/.

In this document, phonetic symbols are represented using  Worldbet, which is an ASCII encoding of the International Phonetic Alphabet (IPA) [J. Hieronymus, 1995].

The word phone is used extensively in this tutorial; according to  Webster's Dictionary, a phone is "a speech sound considered as a physical event without regard to its place in the sound system of a language." So, the word phone is used here to refer to the phonetic events that we want to classify, whether or not they correspond to phonemes in the language.

Please send all questions, comments, and bug reports to hosom at cslu dot ogi dot edu.


This work has been supported by an NSF "Graduate Research Traineeships" award (grant number 9354959) and the CSLU Member companies. The views expressed in this tutorial do not necessarily represent those of the sponsoring agency and companies.
 

2. General Concepts and Notation

The general steps to creating a neural-network based recognizer are:

  1. Specify the phonetic categories that the network will recognize.
  2. Find many samples of each of these categories in the speech data.
  3. Train a network to recognize these categories.
  4. Evaluate the network performance using a test set.

2.1 Quick Review of Frame-Based Speech Recognition

Frame-based speech recognition has the following steps, illustrated in Figure 1:

Overview of Speech Recognition
Figure 1. Overview of Frame-Based Speech Recognition using Neural Networks.

  1. Divide the waveform into  frames, where each frame is a small segment of speech that contains an equal number of waveform samples. In this tutorial, we will assume a frame size of 10msec.
  2. Compute features for each frame. These features usually describe the spectral envelope of the speech at that frame and at a small number of surrounding frames.
  3. Classify the features in each frame into phonetic-based categories using a neural network. The outputs of the neural network are used as estimates of the probability, for each phonetic category, that the current frame contains that category.
  4. Use the matrix of probabilities and a set of pronunciation models to determine the most likely word(s). Searching is done with a Viterbi search.

For a more detailed explanation, see the on-line recognition tutorial.
 

2.2 Specifying Categories

In order to determine the categories that the network will classify, the following three things need to be done:

  1. The designer of the recognizer needs to determine the pronunciations for each of the words that will be recognized. More accurate pronunciation models will generally yield better recognition rates.
  2. Quite often, we also use context-dependent phone models, which means that one phone is classified differently depending on the phones that surround it (for example, an /aI/ following an /w/ is classified differently from an /aI/ following an /h/). The surrounding context may contain a group of phones or just a single phone. (Using groups of phones reduces the number of categories that need to be classified.) The grouping of phones into clusters of similar phones must be done by the person designing the recognizer.
  3. Finally, when constructing context-dependent phone models, we divide each phone to be recognized into one, two, or three parts. Each sub-phone segment corresponds to one category to be recognized. If we keep a phone as one part, then it is used without the context of surrounding phones. If we divide it into two parts, then the left half of the phone model (the left sub-phone) is dependent on the preceding phone, and the right half of the phone model (the right sub-phone) is dependent on the following phone. If the phone is split into three parts, then the first third is dependent on the preceding phone, the middle third is independent of surrounding phones, and the last third is dependent on the following phone. One final option is to keep the phone as one part, but make it dependent on the following phone; this is called a right-dependent phone and it is used mostly for stop consonants. The designer of the recognizer needs to decide how many parts each phone will be split into.

Figure 2 shows an illustration of this kind of context-dependent modeling. In this figure, an example is given for the modeling of the word "yes", written in Worldbet as /j E s/. Here, the /j/ is split into two parts, the /E/ is split into three parts, and the /s/ is split into two parts. There are eight groups of phones used for contexts; each group represents a broad category of sounds. For the vowel /E/ in a general-purpose recognizer, there are eight categories for the left third, one category for the middle third, and eight categories for the right third, yielding a total of 17 categories for this 3-part phone. The /j/, on the other hand, would have 16 categories in a general-purpose recognizer, because it is split into only left and right halves.

Context-Dependent Modeling
Figure 2. Context-Dependent Modeling

The context-dependent phonetic categories that the network will be trained on can be determined from the phonetic-level pronunciation models, the groupings of phones into clusters of similar phones, and the number of parts to split each phoneme into.

2.3 Example of Specifying Categories

To give an example of how these three items can be determined, we'll use the example of recognizing the isolated words "three", "tea", "zero", and "five".

First, we can come up with some initial pronunciations:

     word

pronunciation

     three      T 9r i:
     tea      tc th i:
     zero      z i: 9r oU
     five      f aI v

We may want to modify these pronunciations, because the /i:/ in "zero" is often pronounced differently from the /i:/ in "three" and "tea". To account for this difference in pronunciation, we can use our own symbol, /i:_x/, to represent the front vowel in "zero". Making this change gives us the following pronunciation models:

     word

pronunciation

     three      T 9r i:
     tea      tc th i:
     zero      z i:_x 9r oU
     five      f aI v

Next, we will determine the number of parts to use for each phone. In the table below, "1" means that the phone will be context-independent, "2" means that the phone will be split into two parts, "3" means that the phone will be split into three parts, and "r" means that the phone will be "right-dependent":

     phone

parts

     T

1

     9r

2

     i:

3

     tc

1

     th

r

     z

1

     i:_x

2

     oU

3

     f

1

     aI

3

     v

1

Now, let's look at the /i:/ in "three" and "tea". In this case, the vowel /i:/ is the same, but it looks very different when it follows a /9r/ compared to when it follows a /th/ (see Figure 3).


Figure 3. Example of vowel /i:/ in different contexts.

In this case, we make the left third of the /i:/ (since it is split into three parts) dependent on a preceding retroflex (/9r/) in one case and dependent on a preceding alveolar sound (/th/ or /z/) in the other case. We usually group the phones in a left or right context according to their broad phonetic category; for example, the following groupings can be used (the dollar sign indicates a variable that represents the group of listed phones):

  group   phones in group    description
  $bck   oU   back vowels
  $fnt   i: i:_x   front vowels
  $ret   9r   retroflex sounds
  $den   T v th z   dentals, labiodentals, and alveolars
  $sil   .pau tc   silence or closure

But notice that it then becomes difficult to classify diphthongs such as /aI/, because the phone starts as a back vowel and ends as a front vowel. The current solution is to modify the categories in the following way:

  group   phones in group    description
  $bck_l   oU   back vowels to the left of a phone
  $bck_r   oU aI   back vowels to the right of a phone
  $fnt_l   i: i:_x aI   front vowels to the left of a phone
  $fnt_r   i: i_x   front vowels to the right of a phone
  $ret   9r   retroflex sounds
  $den   T v th z   dentals, labiodentals, and alveolars
  $sil   .pau tc   silence or closure

 
First, we have added "_l" and "_r" to the variable names in question, to indicate whether the phones in this grouping occur on the left or right side of the phone being classified. Then, because /aI/ looks like a back vowel when it appears to the right of a phone, it has been put in the grouping $bck_r; because /aI/ looks like a front vowel when it appears to the left of a phone, /aI/ has also been put in the grouping $fnt_l. This method of grouping into left or right contexts is illustrated in Figure 4:


Figure 4. Illustration of labeling a diphthong in the word "five".

The format for specifying different categories is [left_context]<phone>[right_context], so for example the category for /.pau/ will be <.pau>, the category for the left third of /i:/ in the context of dental sounds will be $den<i:, the middle third of /i:/ will be <i:>, and the right third of /i:/ in the context of silence will be i:>$sil.

Given all this information, it can easily (if tediously) be determined that the 28 categories we need to train on are:
 

<.pau>

 $den<9r

$fnt_l<9r

9r>$bck_r

9r>$fnt_r

<T>

f<aI

<aI>

aI>$den

<f>

$den<i:

$ret<i:

<i:>

i:>$den

i:>$sil

i:>f

$den<i:_x

<i:_x>

i:_x>$ret

$ret<oU

<oU>

oU>$den

oU>$sil

oU>f

<tc>

th>$fnt_r

<v>

<z>

 

 

In the following sections, a Tcl script called "categories.tcl" is described; this script can be used to automate the process of determining categories.
 

2.4 Finding Samples to Train On

2.4.1 Overfitting and Datasets
As we train a network, we keep adjusting the neural network weights to minimize the error in our training data. For each adjustment of the weights, we have a new iteration (or epoch) in the training process. We can keep generating new iterations until the error no longer decreases. At this point, we have learned the training data to the extent that it is possible.

However, when we train a neural network, we aren't interested in learning the training data. Instead, we are interested in learning the general properties of the training data. By learning the general properties of the data instead of the details that are specific to the training data, we are best able to classify a new utterance not in the training set.

In order to determine which iteration of network weights has best learned the general properties of the data, we use a separate (usually smaller) set of data to evaluate each iteration. This second set of data is called the "development" set (or cross-validation set). Because this development set has not been used to adjust the network weights during training, it can be used to evaluate the network's ability to recognize phonetic categories, as opposed to (possibly irrelevant) details in the training set. The larger this development set is, the more confidence we can have in the general classification properties of the network.

Once we have determined the best network, we need to evaluate its performance on a test set. In order to have an honest evaluation, the data in the test set must not occur in either the training set or the development set.

This means that given a corpus containing our target words, we must divide it into at least three parts: one part for training, one for development, and one for testing. If we have a large enough corpus, we may further divide the development set into subsets, so that as we evaluate and make modifications to our recognizer, we are not tuning performance to one set of development data.

Finally, at CSLU we leave 5% of our target corpus for independent, third-party evaluation. This 5% is culled from the entire corpus before dividing into training, development, or test sets.
 
2.4.2 Filtering
When selecting data for training, development, and testing, we can apply various filters to reduce the amount of data. In one case, we may have utterances in our corpus that don't occur in our target vocabulary. In this case, we may want to filter so that words not in our vocabulary list are not included in our datasets. For example, if we are training a digits recognizer and we are using the CSLU Numbers corpus for training, we may want to remove out-of-vocabulary utterances that contain numbers such as "first", "twelve", and "fifty". In another case, we may have so much data that training or evaluation would take too long. In this case, we can filter so that we take every Nth utterance for use in our datasets, where N is some integer greater than 1. For example, we may want to take every sixth waveform for training our digits recognizer, because there are over 6000 waveforms available for training on digits. Filtering in this way will still leave over 1000 waveforms (or approximately 5000 examples of each digit) available for training.
 
2.4.3 Finding Categories
Once we know which files we'll using for training, we need to find samples of each category that we'll train on. This can be one in one of two ways: using data that has been hand-labeled at the phonetic level, or using forced alignment.

2.4.4 Number of Samples per Category
Finally, the designer of a recognizer must decide how many samples of each category to train on. Usually, networks with decent performance can be trained using up to 500 samples per category, but sometimes 2000 or more samples are used. In order to get best performance, all samples in the training set should be used. However, training with all samples may be very time-consuming.

If some categories have very few or no training samples, then there are two options. The first option is to use an additional corpus that contains samples of these infrequent classes. The second option is to "tie" these infrequent categories to phonetically similar categories that do have enough training samples. Categories tied in this way will not be trained on, and during recognition their probabilities will be set equal to the probabilities of the categories that they were tied to.
 

2.5 Training the Network

2.5.1 Generating Data
Once the categories to train on have been found, and the number of samples per category has been determined, the actual data that will be trained on are collected and stored in a "vector file". This vector file contains, for each training sample, the features that will be input to the neural network and the target category. (One set of training features and the target category is called a "vector"; it is also called a "sample".)

2.5.3 Number of Hidden Nodes
At CSLU, we use 3-layer feed-forward networks. The number of input nodes is the number of spectral features, and the number of output nodes is the number of categories to be trained on. The designer of a recognizer must decide how many hidden nodes the network should have; in general, we have found 200 hidden nodes to be a reasonable number.

2.5.4 Negative Penalty
When using a large number of samples per category, it is nearly inevitable that some categories will have much fewer samples than others, making it difficult to learn these sparse categories. This difficulty in training is due to the fact that there are many more negative samples than positive samples for a sparse category, where negative samples are samples for which the category being trained on has a target value of 0, and positive samples are samples for which the category being trained on has a target value of 1. As a result, these sparse categories often have very small output values that don't reflect the actual posterior probabilities that we want to obtain. To adjust for this, the amount that each negative sample contributes to the total error is weighted by a value proportional to the number of samples in that negative category; this value is called a "negative penalty". Training can be done either with or without this negative penalty. A more thorough discussion of the negative penalty can be found in the paper by Wei and van Vuuren at ICASSP-98, "Improved Neural Network Training of Inter-Word Context Units for Connected Digit Recognition."

2.5.5 Number of Training Iterations
It is almost never necessary to continue training until the training error stops decreasing; the best performance on the development set will almost always happen much sooner. Usually, best performance on the development set occurs after 20 to 30 iterations, and so training is done for a fixed number of iterations, usually 30 to 40.

2.5.6 Re-Training on Force-Aligned Data
As described above, forced alignment can be used to generate labels for training. In order to generate initial labels using forced alignment, we usually use a general-purpose recognizer. We can also use forced alignment to re-train a network; in this case, we use our current-best network to generate the forced-alignment labels and then train again using these new labels. This re-training often yields better results.

2.5.7 Forward-Backward (Embedded) Training
One final method for improving results uses "forward-backward" or "embedded" training. In forward-backward training, the targets of the neural network are not binary values, but posterior probabilities. These probabilities are determined using the forward-backward algorithm, in which a previously-trained neural network is used to compute the observation probabilities. (The forward-backward algorithm is usually used for training a Hidden Markov Model, and a good tutorial on this subject is given in Rabiner and Juang's book "Fundamentals of Speech Recognition", in Chapter 6). A paper on using the forward-backward algorithm for training a neural network is given in a paper by Yan, Fanty, and Cole at ICASSP-97, "Speech Recognition Using Neural Networks with Forward-Backward Probability Generated Targets".

2.6 Evaluation

2.6.1 Word-Level Evaluation
Once we have trained for, say, 30 iterations, we need to determine which iteration has the best performance on the development set. To do this, we recognize each utterance in the development set using the network weights from each iteration. If the number of words in each utterance is not known beforehand, we need to evaluate the performance at each iteration in terms of substitution errors, insertion errors, and deletion errors. (If the number of words is known beforehand, then we only need to measure substitution errors, but the same method can be used). The overall accuracy of a network iteration is defined to be 100% - (Sub + Ins + Del), where Sub is the percentage of substitution errors, Ins is the percentage of insertion errors, and Del is the percentage of deletion errors. We can also measure the "sentence-level accuracy", which is the number of utterances (or waveforms) recognized correctly divided by the total number of utterances in the development set.

2.6.2 Choosing the Best Iteration

Usually, we choose the network iteration with the best word-level accuracy, and in case of equal word-level accuracies, then we select the iteration with the greater sentence-level accuracy.

2.6.3 Testing
Once we have finished developing a recognizer, we evaluate the final performance on the test set, in terms of word-level and sentence-level accuracy. It is important, however, that once evaluation is done on the test set, the recognizer is not further modified based on these test-set results. In order to make sure that such modifications are not done, the test set is usually reserved until just before the recognizer is put into general-purpose use (or just before publishing results in a journal or at a conference).
 

3. Overall Procedure

Given the background described in the previous section, the process of training a recognizer becomes relatively simple. This section gives the "recipe" for this training process.

3.1 Create Descriptions

The first step is to create a description of the recognizer and describe how the data will be selected for training. The files that need to be created are:

corpora file
Create a "corpora" file if one doesn't yet exist. The corpora file contains a master list of each corpus and the location and format of the files in that corpus. The format of this corpora file is given below; there is no automated way of generating this file, but it is easy to modify by hand. The same corpora file should be used for all training tasks.
cull file
Create "cull" files, if necessary. A cull file is a list of files in a corpus that won't be used for training, development, or in-house testing. Usually, 5% of the entire corpus is put into this cull file. The script cull5.tcl can be used to generate a cull file for a particular corpus.
info files
Create "info" files for training, development, and testing. These info files must be created by hand; the format is given below in Section 5. An info file contains all of the information that is necessary to find samples for training, development, or testing. This info file includes the partition (train, develop, test), how to select the data for the required partition, the basename of the recognizer, the minimum number of samples requested for each category, and corpus-dependent information. One info file is required for each of the tasks of training, re-training using forced alignment, forward-backward training, development, and testing.
vocab file
Create a "vocab" file with the vocabulary, pronunciations, and grammar for the task. This must also be created by hand, and the format is given below.
parts file
Create a "parts" file, which specifies how many parts to split each phoneme into, and what context groupings to use. Once again, this must be created by hand, and the format is given in Section 5.

3.2 Find Data

Given the files created above, the scripts to use in order to find data files for training are:

find_files.tcl
Use "find_files.tcl" to find files for training, development, and testing. This script must be called once for each set of files. At this stage, any filters are applied and the corpus is searched for files that are appropriate for the given partition (such as training or testing).
categories.tcl
Use "categories.tcl" to generate categories to train on. This script uses the info, vocab, and parts files to create a "desc" file. A desc file contains a description of the recognizer for use by other training and recognition scripts. An "olddesc" file is also created. This file also contains a description, but in an older format. (The "olddesc" file will, in future versions of the training process, be obsolete. When this time comes, the olddesc file will no longer be generated. For now, however, the "desc" file is used in some training scripts, and the "olddesc" file is used in recognition scripts.)
gen_catfiles.tcl
Use "gen_catfiles.tcl" to create time-aligned categories from text transcriptions or from phonetic time-aligned transcriptions. These categories are written to either
(a) a "master label file", which contains all categories for training in one file, or
(b) separate files with the extension ".cat", which are put in sub-directories that mirror the directory structure of the corpus (or corpora) being used.
revise_desc.tcl
Use "revise_desc.tcl" to make sure that all categories have enough samples for training. If some categories don't have enough samples, then either the vocab and parts files need to be modified (and the entire process repeated), or these sparse categories need to be "tied" to categories with more data. This script is also used to set minimum and maximum duration limits for each category, based on the categories in the training data. The use of these duration limits is optional, as the CSLU Toolkit does have default limits for every phone; however, performance usually improves by using the durations from the categories that are trained on. This script will revise the contents of the desc and olddesc files.
hscript.exe
Use "hscript" to create other files that will be used in training and recognition. These files have the extensions ".rr", ".list", and ".0". The ".rr" file contains a binary description of the recognizer, with information such as the list of categories being recognized. The ".list" file contains an ASCII list of the all categories. The ".0" file is an initial HMM model, used later in the forward-backward training stage.

3.3 Select Data for Training

Once the files have been selected, the category files have been created, and the desc file is correct, then we can use the following scripts and programs to select frames for training:

pickframes.tcl
Use "pickframes.tcl" to select samples to train on. The output of this script is a "pick" file, which is used directly by genvec.tcl.
genvec.tcl
Use "genvec.tcl" to create features for each frame to be trained on.
checkvec.exe
Use "checkvec" to make sure that the data in the vector file is valid.

3.4 Train and Evaluate

nntrain.exe
Use "nntrain" to train the network on the vector file.
find_best.tcl
Use "find_best.tcl" to find the best iteration of the network using the set of development files.
browse.tcl
Use "browse.tcl" to evaluate errors. The errors that are made may give clues about necessary revisions to the recognizer. Repeat steps in the development process, as necessary.

3.5 Re-Train

Create force-aligned data using the best iteration of the network that was just trained. To do this, create an info file for forced alignment that specifies a new directory in which to put the category files and a forced-alignment script to use. Then use "find_files.tcl" and "gen_catfiles.tcl" to generate the force-aligned labels.

Repeat Sections 3.3 and 3.4 to create a network trained on this force-aligned data.

Given a network trained on force-aligned data, create a third network using the forward-backward method. To construct such a "forward-backward network", do the following:

  1. Create a vector file using hmm_embed.tcl
  2. Check this vector file for errors using hnncheckvec.exe
  3. Train on this vector file using hnntrain.exe
  4. Find the best iteration using find_best.tcl

Repeat this cycle to create another forward-backward network. This final network (the result of the second cycle of forward-backward training) should have the best performance of all other networks, although sometimes the first forward-backward network has better performance.

3.6 Evaluate Test Set

Use "find_best.tcl" to evaluate the best network's performance on the test set. These are the final results that are acceptable for publication.
 
 

4. Complete Example

To illusrate the procedure described above, the example of training a continuous-speech digits recognizer is given in this section. All commands should be entered using a DOS or unix command window (In Windows, click on Start, then  Programs, then Command Prompt to bring up a DOS window. In Windows 95, all Tcl commands should be prefaced with "tclsh80" in order to properly invoke the Tcl interpreter).  Text given in bold indicates commands that are typed from a command window; text in courier font indicates the output from this command. In DOS, all commands must be entered on one line; if a backslash is used in the examples below to continue the command on another line, this must be typed as one line with no backslash when using DOS. The parameters for each script and program are explained in Section 6. The data files that are used in this example are located in a zip file available for downloading (make sure that you preserve the directory structure of the files in the zip file).

[Step 1] In this initialization step, set up the directory structure that you will use. We recommend that you create one directory for each "project", where a project contains all of the files created during the training of a network. For this example, we will be using a project directory called \tutorial\digit. Note that some files (vector files in particular) may take up a large amount of disk space; you may want to delete these files after you are finished using them. Now is a good time to make sure that your path contains the location of the training scripts as well as the stand-alone C programs used for training. To check this, if you type "categories.tcl" in your project directory, you should get the following:

and if you type "checkvec" in your project directory, you should get the following:

If you don't get these responses, contact the person who installed the Toolkit to find the location of the "script\training_1.0" directory and the "bin" directory within the Toolkit directory hierarchy.

It may also be convenient to copy the recog.tcl script from the Toolkit's "script\training_1.0" directory into a convenient directory (such as \tutorial\src), as you will need to specify the path to this file several times. Finally, you can copy the files corpora, digit.train.info, digit.dev.info, digit.test.info, digit.vocab, digit.parts, and remap_tutorial.tcl from the links provided here into your project directory (\tutorial\digit). This will save you some typing in the following Steps 2 through 6.  Note that for Windows users, downloading these files will often result in different filenames, where the first "." has been replaced by "_" or the 4th letter of the extension has been removed; we recommend that you rename these files back to their original names.  For this example, these files will only require minor modifications (for the path information), but Section 5 describes the format of these files so that you can change them or create them from scrath later on, in order to train on another task or train using different parameters.

[Step 2] Create a corpora file, called "corpora" (no extension). For this tutorial, the corpora file might look like this (assuming that the tutorial data is storied in \tutorial\data):

It is probably also a good idea to make sure that your filenames have the same format as specified in the "corpora" file; the format is case sensitive, so NU-78.zipcode.wav is different from nu-78.zipcode.wav.

[Step 3] To start with, we run cull5.tcl to remove 5% of all available data for third-party evaluation. This creates a cull file called numbers.cull5. This cull file simply contains a list of waveform files that won't be used for training, development, or testing.

[Step 4] Create info files for training, development, and testing. They will be called digit.train.info, digit.dev.info, and digit.test.info. We will only request 200 samples per category, so that this tutorial doesn't take more time than necessary. If one were constructing a real-life network, it would be better to use all available samples. To specify all samples, use the keyword ALL instead of 200 in the "want:" field in digit.train.info.


For the digit.train.info file, we are specifying that we want training data from the numbers corpus, and we will put time-aligned category labels in the \tutorial\digit\numbers_train directory (specified in the "partition:", "name:", and "cat_path:" fields). We require the presence of waveform, phonetically-labled, and text transcription files in order to do this (specified in the "require:" field), and we'll use 3/5 of available files (specified in the "partition:" field). We won't skip over any files (specified in the "filter:" field), but we will require that all of the vocabulary words in the text file are words we want to recognize  (specified in the "vocab:" field). We will remap the hand-labled phonetic files (which can have a high degree of variability in the phonemes used to represent a word) to a consistent set of phonemes using the remap_tutorial.tcl script (specified in the "remap:" field). For more information about the meaning of the various fields, see the description of the info file format.

 

[Step 5] Create a vocab file, called digit.vocab. This file contains the words, their pronunciations, and the grammar to be used during recognition.

[Step 6] Create a parts file, called digit.parts. This contains the number of parts that each phoneme will be split into, the groupings of phones into clusters of similar phones, and mappings from one phone to another.

[Step 7] Run find_files.tcl in order to find files suitable for training. The input files (besides digit.train.info and corpora) are \tutorial\digit\numbers.cull5 and digit.vocab. The output file is digit.train.numbers.files; this filename is constructed from the basename, the partition, and the corpus. The reason that the user doesn't specify the output filename on the command line is that it is possible, when using several corpora, to create several output files; it seems easier to have the filenames automatically determined than to have the user specify one filename for each corpus.

Then, run find_files.tcl a second and third time to find files suitable for development and testing:

[Step 8] Run categories.tcl to determine the context-dependent categories that will be classified by the recognizer. The input files are the vocab, parts, and info files. The output files are the desc and olddesc files; these files contain not only the list of the context-dependent categories, but also some other information about the recognizer that we will be creating.

[Step 9] Run gen_catfiles.tcl to take the list of files for training (digit.train.numbers.files) and create time-aligned labels of categories to train on. The input file (other than digit.train.info and corpora) is digit.train.numbers.files. If specified in digit.train.info, the script in the "remap:" field will be used, or the script in the "force_cat:" or "force_phn:" fields will be used (in this case, we haven't specified the "force_cat:" or "force_phn:" fields because we are not yet doing forced alignment). The category label files that are created are stored in the directory that is specified in  digit.train.info in the "cat_path:" field. (For the February 1999 release of the Toolkit, there is also the option to store the category labels in one file, called a "master label file"; this master label file is specified instead of the "cat_path:" field, after the partition information in the .info file.)  The gen_catfiles.tcl script also creates two other output files: the "dur" file and the "counts" file. The dur file contains minimum and maximum duration limits for each category, as determined from the category label files; the counts file lists the number of occurrence (and total time in msec) of each category.

This script may generate messages such as

These are simply messages to the user that some labels are being merged or deleted when converting from hand labels to categories. These messages come from the remapping script, in this case remap_tutorial.tcl. No action needs to be taken by the user. At the end, for each category, the duration that is at the bottom 2nd percentile of all durations for that category is written to the dur file as the minimum duration, and the longest duration of the category is written to the dur file as the maximum duration. These limits help the Viterbi search refrain from inserting very short words during recognition.

[Step 10] Run revise_desc.tcl to make sure that we have enough samples of each category to train on, and to add duration limits to the desc and olddesc files. If there are not enough samples of a category, this script allows us to tie these categories to categories with more samples. This is the only interactive script in the entire training and recognition process. The input files are the counts file, the dur file, the desc file, and the olddesc file. The outputs of this script are modifications to the desc and olddesc files to include category tieing information and duration limits information.

[Step 11] Run hscript.exe to create digit.rr, digit.list, and digit.0 from the revised digit.train.desc file. The digit.rr file contains a binary description of the recognizer; the digit.list file contains an ASCII list of the categories; and the digit.0 file contains an initial HMM description that is used in some of the cslush function calls. (These cslush function calls are designed to work with both neural networks and HMMs, and an HMM description is required even when using or training neural networks.)

[Step 12] Run pickframes.tcl to select frames for training, from the files created by gen_catfiles. The input files (other than digit.train.info and corpora) are digit.rr, digit.list, digit.0, and digit.train.numbers.files.  (For the February 1999 release of the Toolkit, if a master label file is generated, then this is also an input file to pickframes.tcl.) The output of this script is the file digit.train.pick, which contains a binary list of files, the frames to be used in each file, and the categories corresponding to these frames.

[Step 13] Run genvec.tcl to compute features for all of the frames given in digit.train.pick. The input files are digit.train.info, digit.train.olddesc, and digit.train.pick. The features that are computed, and the target category values, are stored in the binary output file digit.train.vec. Note that if you want to use features that are different from the standard 130 features, you can write the code used to create the new features; the location of your code can be specified in the olddesc file. Also, the description of the format of the vector file given in Section 5 may be of interest.

[Step 14] Run checkvec.exe to make sure that the vector file that we created has the correct format, and that every category has at least one sample to train on. The numbers on the left are the numbers corresponding to each category, and the numbers on the right are the number of samples for that category. The input file is digit.train.vec; the only output goes to the screen for the user to check.

[Step 15] In order to train the neural network, one of two methods can be used.   The first is the older method, which is an executable program that generates weights in floating-point format.  This older method is the only method available for releases of the Toolkit before February, 1999.  The second, newer, method, is a Tcl script that generates weights in double-precision format.  The advantage of using the program instead of the script is that the program can be scheduled to run even when the user is logged out (in Windows NT), whereas the user must remain logged in to train using the script.  Also, the results from the executable program can be somewhat better than the results from the newer method. The recognition scripts will work with files generated using either method.  As a result, for now we recommend using the older method with the executable program nntrain.exe.

For the executable program method: Run nntrain.exe to train the neural network on the vector file digit.train.vec. This program creates a weights file at each iteration; we will select the best weights file after training for 30 iterations. The -l option indicates that the negative penalty will be adjusted to compensate for varying numbers of samples per category; -sn 88 and -sv 88 are random-number seeds; -a 3 130 200 161 specifies the architecture of the net: 3 layers, with 130 nodes in the first layer, 200 nodes in the hidden layer, and 161 nodes in the output layer. The value 30 specifies training for 30 iterations, and the last parameter is the vector file to use for training.

For the Tcl script method: Run train_nnet.tcl to train the neural network on the vector file digit.train.vec. This script creates a weights file at each iteration; we will select the best weights file after training for 30 iterations. The first argument is the vector file, the second argument (in quotes) are the sizes of each layer, the next argument is the number of training iterations, and the final argument is the name of the output log file.

Notes: For specifying the architecture, note that the number of nodes in the first layer will always be 130 for the standard feature set. The number of hidden nodes is decided by the user, but 200 is a reasonable number. The number of output nodes to use is written in a comment in the digit.desc file after running revise_desc.tcl. (this is the same as the number of categories that are not tied, and excludes the <.garbage> category).  Also, the output of checkvec.exe indicates the number of output nodes to use in the third layer of the network, since the last pair of numbers from checkvec.exe gives the final output number (161 for this example) and the number of samples of that last output.  The number 161 used in this example may change, depending on the number of states that have been tied and the information in the .vocab and .parts files. 

The only input file to either nntrain.exe or train_nnet.tcl is the vector file; the output files are the neural-network weights files for each iteration (the default names are nnet.X, where X is an integer from 0 to the number of iterations).

 

[Step 16] Run find_best.tcl to evaluate the performance of each iteration (weight file) on the development-set data. This script may take a long time, especially if there are many files in the development set. The invocation below assumes that the script for doing recognition is located in \tutorial\src\recog.tcl. The input files are the neural-network files created by nntrain, the digit.dev.numbers.files file, the vocab file, and the olddesc file, as well as digit.rr. The output files are ali files and a summary file.

Note that training on only 200 samples per category has a large influence on results; when I trained using all available samples instead of 200 per category (using the keyword "ALL" instead of 200 in digit.train.info), results were more than 30% better. The drawback to training on all samples is that nntrain.exe or train_nnet.tcl takes longer. If you have time and want better results, it is beneficial to use as much data as possible.

[Step 17] At this point, it may be helpful to see what kinds of errors are being made. We can browse through the development set, looking at the waveform, spectrogram, and word results for each error. The script browse.tcl will go through the alignment file of the best iteration and find errors. It will then perform recognition and create a wrd file and cat file of the result. At the question-mark prompt, you can type "-e" to find the next error, type <return> to go to the next file in the list of files, or type "q" to quit the program. (There are other options, but these are the most commonly used).

In a separate DOS or unix window, while browse.tcl is still running, start the program SpeechView in order to display the waveform, spectrogram, and results of recognition.

Here, speechview is told to update the contents of the display whenever the contents of one of the files changes. It will create a waveform display of temp.wav, a spectrogram display of that waveform, a label display of the hand-labeled data (if it exists), a label display of the categories that were recognized, and a label display of the words that were recognized. (The speechview program is part of the CSLU Toolkit.)

Now, in the window with the browse.tcl script, you can search for errors in recognition, and the results should be displayed automatically in the SpeechView program.
 

[Step 18] Now we have finished the first cycle of training. If we are happy with the level of performance on the development set, we can stop the training process and evaluate on the test set (step 27). If we do skip to the evaluation step, then it is not permitted to re-train if we are unhappy with the test-set results. If we want to try to improve performance on the development set, we can do another cycle of training using force-aligned data. We can create another info file for doing forced alignment, using the training file as a template. This new file will be called digit.trainfa.info:

(Note that in the "force_cat:" field, the script and associated parameters are specified on two lines. No special marker (such as a backslash) is required.)

Note that we have changed the partition name (to "trainfa") and the path for category files (to "numbers_trainfa"). Also, by specifying "require: wt", we will now require the existence of .wav files and .txt files but not .phn files (because we will create labels from the text transcriptions using forced alignment). We also add a new field to the corpus description, indicating that we want to do forced alignment and create labels at the category level (as opposed to the phone or word level). Also note that we will do forced alignment using iteration 28 from the training we just finished, since iteration 28 had the best word-level performance. Because we are doing forced alignment, it is no longer necessary to use the remapping script that re-maps labels created by hand to the set of labels used by our recognizer.

[Step 19] Now we once again find the files we want to use for training by running find_files.tcl, and then we generate cateogry-level time-aligned labels by running gen_catfiles.tcl. As part of the process of creating category-level label files, we also automatically create new dur and counts files. Finally, we update the duration limits in the desc and olddesc files with the new information in the dur and counts files.

[Step 20] Then, we repeat the training steps to train and select the best force-aligned network:

Note that the weights files have the basename "fa". The results from find_best are:

[Step 21] To create a network using the forward-backward method, we create a vector file using hmm_embed.tcl. First, however, we will create a new info file, specifying that we will deal with forward-backward training (partition name "trainfb"), that we want to use the labels generated from forced alignment, and that we want to use all available samples for training.

[Step 22] Then we check the file for errors using hnncheckvec.exe. This time, the header size of the vector file is 8 bytes, and the vector size if 544 bytes (130 features x 4 bytes + 3 targets x 4 bytes + 3 target values x 4 bytes):

[Step 23] Next, we train on this vector file using hnntrain.exe:

[Step 24] Finally, we select the best iteration using find_best.tcl:

Although the word-level accuracy is the same as with force-aligned training, the sentence-level accuracy is 7% better.

[Step 25] We then repeat the cycle of forward-backward training one more time:

In this case, the two forward-backward networks have the same performance, so further training will probably not yield better results. Usually, two cycles of forward-backward training is enough.

[Step 26] The resulting network is the final network. The last step is to evaluate this network on the test set


 
 

5. File Formats

In the following file formats, text in fixed-font bold is a keyword that must be used verbatim. Italicized items in brackets <> must be substituted with the proper values.
 

wav file
A wav file contains the speech waveform that is to be trained on or recognized. The format for wav files in the CSLU Toolkit is (unfortunately, sometimes) not the Microsoft .wav format; it is the NIST Sphere ulaw format. This format is described at http://vision1.cs.umr.edu/~johns/links/music/audiofile2.html and there is software available on the WWW for converting waveform files into different formats.
 

txt file
A txt file contains a text transcription of the words in a speech waveform. This file is simply an ASCII file containing the words separated by spaces, and it can be created by any text editor that outputs ordinary .txt files.
 

label files (.phn, .cat, .wrd)
Label files, which usually have the extension .phn, .cat, or .wrd, contain time-aligned labels of a waveform utterance. If the file has the extension .phn, then the labels are phonetic labels; if the file has the .cat extension, then the labels are neural-network output categories (context-dependent sub-phone units); and if the file has the extension .wrd, then the labels are words. A label file has the following format:

where:

<value>
is the number of milliseconds in one frame of speech (usually this value is 1.0).
<begin_time>
is the time at which <label> starts
<end_time>
is the time at which <label> ends
<label>
is the word, phone, or category label for the segment of speech

The values for <begin_time> and <end_time> are measured in frames (so if <value> is 1.0, then time is measured in milliseconds; if <value> is 10.0, then time is measured in centi-seconds). The <end_time> of one label is usually the same as the <begin_time> of the next label.
 

corpora file

The corpora file contains descriptions of all corpora:

where a <corpus description> has the following format:

where:

<corpus_name>
is a name used to describe the corpus. The format for <corpus_name> is the same as for any Tcl variable name.
<path_to_wav_files>
is the full path to the directory containing waveform files. It is assumed that in this directory will be sub-directories, and that the actual files will be in these sub-directories.
<path_to_phn_files>
is the full path to the directory containing time-aligned phonetic label files. It is assumed that in this directory will be sub-directories, and that the actual files will be in these sub-directories.
<path_to_txt_files>
is the full path to the directory containing text transcription files. It is assumed that in this directory will be sub-directories, and that the actual files will be in these sub-directories.
<regular_expression_for_parsing_filenames>
is a regular expression, enclosed in curley braces {}, that will succeed when used to parse the base name of a file that belongs in the corpus. It can also be used to extract the call number from the filename, for use in determining the caller ID.
<extension_for_wav_files>
is the filename extension for waveform files. Usually, the value is "wav".
<extension_for_phn_files>
is the filename extension for time-aligned phonetic label files. Usually, the value is "phn".
<extension_for_txt_files>
is the filename