John-Paul Hosom, Ron Cole, Mark Fanty,
Johan Schalkwyk, Yonghong Yan, Wei Wei
Center for Spoken Language Understanding (CSLU)
Oregon Graduate Institute of Science and Technology
February 2, 1999
1. Introduction
This tutorial describes the method used at CSLU for creating neural-network-based speech recognizers. Included in this tutorial are some general concepts behind training a recognizer, step-by-step instructions on how to train a recognizer, and a description of Tcl scripts that can be used to automate parts of this process. This process has been further automated by Ben Serridge at the Tlatoa speech group (http://info.pue.udlap.mx/~sistemas/tlatoa/), and their simpler training process (still using the CSLU Toolkit) is available at http://info.pue.udlap.mx/~sistemas/tlatoa/howto/train_nnet.html. In addition, they have provided a hand-labeling guide and additional documentation at http://info.pue.udlap.mx/~sistemas/tlatoa/howto/hand_label.html and http://info.pue.udlap.mx/~sistemas/tlatoa/techdoc/techdoc.html.
In order to use the scripts mentioned in this tutorial, you must have the CSLU Toolkit installed on your machine. Make sure that your path includes the location of the Toolkit's stand-alone executable files (usually located in the "bin" directory, for example C:\CSLU\Toolkit\2.0\bin), the scripts used for training (usually located in the "script\training_1.0" directory, for example C:\CSLU\Toolkit\2.0\script\training_1.0), and the location of the Toolkit's "shlib" directory (usually C:\CSLU\Toolkit\2.0\shlib). In order to follow the example provided in this tutorial, you may want to use the same data files. These files are have been put into a zip file containing all waveform and transcription files. The size of this compressed "zip" file is 7.6MB, and the size of all the data files is about 10MB. The CSLU Toolkit and corpora are free of charge for non-profit use (universities, high schools, and individuals may download the Toolkit at no charge). For more information on the Toolkit and CSLU corpora, visit our WWW site at http://speech.bme.ogi.edu/.
In this document, phonetic symbols are represented using Worldbet, which is an ASCII encoding of the International Phonetic Alphabet (IPA) [J. Hieronymus, 1995].
The word phone is used extensively in this tutorial; according to Webster's Dictionary, a phone is "a speech sound considered as a physical event without regard to its place in the sound system of a language." So, the word phone is used here to refer to the phonetic events that we want to classify, whether or not they correspond to phonemes in the language.
Please send all questions, comments, and bug reports to hosom at cslu dot ogi dot edu.
The general steps to creating a neural-network based recognizer are:
Frame-based speech recognition has the following steps, illustrated in Figure 1:
Figure 1. Overview of Frame-Based Speech Recognition using Neural Networks.
For a more detailed explanation, see the on-line
recognition tutorial.
In order to determine the categories that the network will classify, the following three things need to be done:
Figure 2 shows an illustration of this kind of context-dependent modeling. In this figure, an example is given for the modeling of the word "yes", written in Worldbet as /j E s/. Here, the /j/ is split into two parts, the /E/ is split into three parts, and the /s/ is split into two parts. There are eight groups of phones used for contexts; each group represents a broad category of sounds. For the vowel /E/ in a general-purpose recognizer, there are eight categories for the left third, one category for the middle third, and eight categories for the right third, yielding a total of 17 categories for this 3-part phone. The /j/, on the other hand, would have 16 categories in a general-purpose recognizer, because it is split into only left and right halves.
Figure 2. Context-Dependent Modeling
The context-dependent phonetic categories that the network will be trained on can be determined from the phonetic-level pronunciation models, the groupings of phones into clusters of similar phones, and the number of parts to split each phoneme into.
To give an example of how these three items can be determined, we'll use the example of recognizing the isolated words "three", "tea", "zero", and "five".
First, we can come up with some initial pronunciations:
| word | pronunciation |
| three | T 9r i: |
| tea | tc th i: |
| zero | z i: 9r oU |
| five | f aI v |
We may want to modify these pronunciations, because the /i:/ in "zero" is often pronounced differently from the /i:/ in "three" and "tea". To account for this difference in pronunciation, we can use our own symbol, /i:_x/, to represent the front vowel in "zero". Making this change gives us the following pronunciation models:
| word | pronunciation |
| three | T 9r i: |
| tea | tc th i: |
| zero | z i:_x 9r oU |
| five | f aI v |
Next, we will determine the number of parts to use for each phone. In the table below, "1" means that the phone will be context-independent, "2" means that the phone will be split into two parts, "3" means that the phone will be split into three parts, and "r" means that the phone will be "right-dependent":
| phone | parts |
| T | 1 |
| 9r | 2 |
| i: | 3 |
| tc | 1 |
| th | r |
| z | 1 |
| i:_x | 2 |
| oU | 3 |
| f | 1 |
| aI | 3 |
| v | 1 |
Now, let's look at the /i:/ in "three" and "tea". In this case, the vowel /i:/ is the same, but it looks very different when it follows a /9r/ compared to when it follows a /th/ (see Figure 3).
Figure 3. Example of vowel /i:/ in different contexts.
In this case, we make the left third of the /i:/ (since it is split into three parts) dependent on a preceding retroflex (/9r/) in one case and dependent on a preceding alveolar sound (/th/ or /z/) in the other case. We usually group the phones in a left or right context according to their broad phonetic category; for example, the following groupings can be used (the dollar sign indicates a variable that represents the group of listed phones):
| group | phones in group | description |
| $bck | oU | back vowels |
| $fnt | i: i:_x | front vowels |
| $ret | 9r | retroflex sounds |
| $den | T v th z | dentals, labiodentals, and alveolars |
| $sil | .pau tc | silence or closure |
But notice that it then becomes difficult to classify diphthongs such as /aI/, because the phone starts as a back vowel and ends as a front vowel. The current solution is to modify the categories in the following way:
| group | phones in group | description |
| $bck_l | oU | back vowels to the left of a phone |
| $bck_r | oU aI | back vowels to the right of a phone |
| $fnt_l | i: i:_x aI | front vowels to the left of a phone |
| $fnt_r | i: i_x | front vowels to the right of a phone |
| $ret | 9r | retroflex sounds |
| $den | T v th z | dentals, labiodentals, and alveolars |
| $sil | .pau tc | silence or closure |
First, we have added "_l" and "_r" to the variable names in question,
to indicate whether the phones in this grouping occur on the left or right side of the
phone being classified. Then, because /aI/ looks like a back vowel when it appears to the
right of a phone, it has been put in the grouping $bck_r; because /aI/ looks like a front
vowel when it appears to the left of a phone, /aI/ has also been put in the grouping
$fnt_l. This method of grouping into left or right contexts is illustrated in Figure 4:
Figure 4. Illustration of labeling a diphthong in the word "five".
The format for specifying different categories is [left_context]<phone>[right_context], so for example the category for /.pau/ will be <.pau>, the category for the left third of /i:/ in the context of dental sounds will be $den<i:, the middle third of /i:/ will be <i:>, and the right third of /i:/ in the context of silence will be i:>$sil.
Given all this information, it can easily (if tediously) be determined that the 28
categories we need to train on are:
<.pau> |
$den<9r |
$fnt_l<9r |
9r>$bck_r |
9r>$fnt_r |
<T> |
f<aI |
<aI> |
aI>$den |
<f> |
$den<i: |
$ret<i: |
<i:> |
i:>$den |
i:>$sil |
i:>f |
$den<i:_x |
<i:_x> |
i:_x>$ret |
$ret<oU |
<oU> |
oU>$den |
oU>$sil |
oU>f |
<tc> |
th>$fnt_r |
<v> |
<z> |
|
|
In the following sections, a Tcl script called "categories.tcl" is described; this
script can be used to automate the process of determining categories.
2.4.1 Overfitting and Datasets
As we train a network, we keep adjusting the neural network weights to minimize the error
in our training data. For each adjustment of the weights, we have a new iteration (or
epoch) in the training process. We can keep generating new iterations until the error no
longer decreases. At this point, we have learned the training data to the extent that it
is possible.
However, when we train a neural network, we aren't interested in learning the training data. Instead, we are interested in learning the general properties of the training data. By learning the general properties of the data instead of the details that are specific to the training data, we are best able to classify a new utterance not in the training set.
In order to determine which iteration of network weights has best learned the general properties of the data, we use a separate (usually smaller) set of data to evaluate each iteration. This second set of data is called the "development" set (or cross-validation set). Because this development set has not been used to adjust the network weights during training, it can be used to evaluate the network's ability to recognize phonetic categories, as opposed to (possibly irrelevant) details in the training set. The larger this development set is, the more confidence we can have in the general classification properties of the network.
Once we have determined the best network, we need to evaluate its performance on a test set. In order to have an honest evaluation, the data in the test set must not occur in either the training set or the development set.
This means that given a corpus containing our target words, we must divide it into at least three parts: one part for training, one for development, and one for testing. If we have a large enough corpus, we may further divide the development set into subsets, so that as we evaluate and make modifications to our recognizer, we are not tuning performance to one set of development data.
Finally, at CSLU we leave 5% of our target corpus for independent, third-party
evaluation. This 5% is culled from the entire corpus before dividing into training,
development, or test sets.
2.4.2 Filtering
When selecting data for training, development, and testing, we can apply various filters
to reduce the amount of data. In one case, we may have utterances in our corpus that don't
occur in our target vocabulary. In this case, we may want to filter so that words not in
our vocabulary list are not included in our datasets. For example, if we are training a
digits recognizer and we are using the CSLU Numbers corpus for training, we may want to
remove out-of-vocabulary utterances that contain numbers such as "first",
"twelve", and "fifty". In another case, we may have so much data that
training or evaluation would take too long. In this case, we can filter so that we take
every Nth utterance for use in our datasets, where N is some integer greater
than 1. For example, we may want to take every sixth waveform for training our digits
recognizer, because there are over 6000 waveforms available for training on digits.
Filtering in this way will still leave over 1000 waveforms (or approximately 5000 examples
of each digit) available for training.
2.4.3 Finding Categories
Once we know which files we'll using for training, we need to find samples of each
category that we'll train on. This can be one in one of two ways: using data that has been
hand-labeled at the phonetic level, or using forced alignment.
Hand-Labeled Data
Many corpora at OGI have been labeled with time information at the phonetic level by
professional labelers. If training is to be done on this hand-labeled data, then the
labels must be re-mapped from the phonetic level to the (context-dependent) category
level. For example, a hand-labeled file for the isolated digit "three" might
contain this information:
0 53 .pau
53 113 T
113 170 9r
170 229 i:
229 273 .pau
where the first item is the start time in milliseconds, the second item is the end time in milliseconds, and the third item is the phonetic-level label. In order to train on this data, it needs to be re-mapped into the following set of time-aligned labels:
0 53 <.pau>
53 113 <T>
113 142 $den<9r
142 170 9r>$fnt_r
170 190 $ret<i:
190 209 <i:>
209 229 i:>$sil
229 273 <.pau>
A set of Tcl scripts to automate this process will be described later. Also, some general modifications may be made to the hand-labeled data so that the data is more suited for training; for example, we may want to ignore very short pauses. Again, there are scripts described below that will automate this for us.
Force-Aligned Data
Often, the corpus we want to train on has text transcriptions but no time-aligned phonetic
labels. In this case, we can create either phonetic labels or category labels using a
process called "forced alignment".
Forced alignment is the process of using an existing recognizer to recognize a training utterance, where the grammar and vocabulary are restricted to be the correct result. (The correct result is the word-level transcription, which must be known). The result of forced alignment is a set of time-aligned labels that give the existing recognizer's best alignment of the correct phones or categories. If the existing recognizer is good, then the labels will have good time alignments. These labels can then be used for training a new recognizer. Even if the existing recognizer is not so good, this process can be used to determine an initial set of categories.
2.4.4 Number of Samples per Category
Finally, the designer of a recognizer must decide how many samples of each category to
train on. Usually, networks with decent performance can be trained using up to 500 samples
per category, but sometimes 2000 or more samples are used. In order to get best
performance, all samples in the training set should be used. However, training with all
samples may be very time-consuming.
If some categories have very few or no training samples, then there are two options.
The first option is to use an additional corpus that contains samples of these infrequent
classes. The second option is to "tie" these infrequent categories to
phonetically similar categories that do have enough training samples. Categories tied in
this way will not be trained on, and during recognition their probabilities will be set
equal to the probabilities of the categories that they were tied to.
2.5.1 Generating Data
Once the categories to train on have been found, and the number of samples per category
has been determined, the actual data that will be trained on are collected and stored in a
"vector file". This vector file contains, for each training sample, the features
that will be input to the neural network and the target category. (One set of training
features and the target category is called a "vector"; it is also called a
"sample".)
2.5.3 Number of Hidden Nodes
At CSLU, we use 3-layer feed-forward networks. The number of input nodes is the number of
spectral features, and the number of output nodes is the number of categories to be
trained on. The designer of a recognizer must decide how many hidden nodes the network
should have; in general, we have found 200 hidden nodes to be a reasonable number.
2.5.4 Negative Penalty
When using a large number of samples per category, it is nearly inevitable that some
categories will have much fewer samples than others, making it difficult to learn these
sparse categories. This difficulty in training is due to the fact that there are many more
negative samples than positive samples for a sparse category, where negative samples are
samples for which the category being trained on has a target value of 0, and positive
samples are samples for which the category being trained on has a target value of 1. As a
result, these sparse categories often have very small output values that don't reflect the
actual posterior probabilities that we want to obtain. To adjust for this, the amount that
each negative sample contributes to the total error is weighted by a value proportional to
the number of samples in that negative category; this value is called a "negative
penalty". Training can be done either with or without this negative penalty. A more
thorough discussion of the negative penalty can be found in the paper by Wei and van
Vuuren at ICASSP-98, "Improved Neural Network Training of Inter-Word Context Units
for Connected Digit Recognition."
2.5.5 Number of Training
Iterations
It is almost never necessary to continue training until the training error stops
decreasing; the best performance on the development set will almost always happen much
sooner. Usually, best performance on the development set occurs after 20 to 30 iterations,
and so training is done for a fixed number of iterations, usually 30 to 40.
2.5.6 Re-Training on
Force-Aligned Data
As described above, forced alignment can be used to generate labels for training. In order
to generate initial labels using forced alignment, we usually use a general-purpose
recognizer. We can also use forced alignment to re-train a network; in this case, we use
our current-best network to generate the forced-alignment labels and then train again
using these new labels. This re-training often yields better results.
2.5.7 Forward-Backward
(Embedded) Training
One final method for improving results uses "forward-backward" or
"embedded" training. In forward-backward training, the targets of the neural
network are not binary values, but posterior probabilities. These probabilities are
determined using the forward-backward algorithm, in which a previously-trained neural
network is used to compute the observation probabilities. (The forward-backward algorithm
is usually used for training a Hidden Markov Model, and a good tutorial on this subject is
given in Rabiner and Juang's book "Fundamentals of Speech Recognition", in
Chapter 6). A paper on using the forward-backward algorithm for training a neural network
is given in a paper by Yan, Fanty, and Cole at ICASSP-97, "Speech Recognition Using Neural
Networks with Forward-Backward Probability Generated Targets".
2.6.1 Word-Level Evaluation
Once we have trained for, say, 30 iterations, we need to determine which iteration has the
best performance on the development set. To do this, we recognize each utterance in the
development set using the network weights from each iteration. If the number of words in
each utterance is not known beforehand, we need to evaluate the performance at each
iteration in terms of substitution errors, insertion errors, and deletion errors. (If the
number of words is known beforehand, then we only need to measure substitution errors, but
the same method can be used). The overall accuracy of a network iteration is defined to be
100% - (Sub + Ins + Del), where Sub is the percentage of
substitution errors, Ins is the percentage of insertion errors, and Del is
the percentage of deletion errors. We can also measure the "sentence-level
accuracy", which is the number of utterances (or waveforms) recognized correctly
divided by the total number of utterances in the development set.
2.6.2 Choosing the Best Iteration
Usually, we choose the network iteration with the best word-level accuracy, and in case of equal word-level accuracies, then we select the iteration with the greater sentence-level accuracy.
2.6.3 Testing
Once we have finished developing a recognizer, we evaluate the final performance on the
test set, in terms of word-level and sentence-level accuracy. It is important, however,
that once evaluation is done on the test set, the recognizer is not further modified based
on these test-set results. In order to make sure that such modifications are not done, the
test set is usually reserved until just before the recognizer is put into general-purpose
use (or just before publishing results in a journal or at a conference).
Given the background described in the previous section, the process of training a recognizer becomes relatively simple. This section gives the "recipe" for this training process.
The first step is to create a description of the recognizer and describe how the data will be selected for training. The files that need to be created are:
Given the files created above, the scripts to use in order to find data files for training are:
Once the files have been selected, the category files have been created, and the desc file is correct, then we can use the following scripts and programs to select frames for training:
Create force-aligned data using the best iteration of the network that was just trained. To do this, create an info file for forced alignment that specifies a new directory in which to put the category files and a forced-alignment script to use. Then use "find_files.tcl" and "gen_catfiles.tcl" to generate the force-aligned labels.
Repeat Sections 3.3 and 3.4 to create a network trained on this force-aligned data.
Given a network trained on force-aligned data, create a third network using the forward-backward method. To construct such a "forward-backward network", do the following:
Repeat this cycle to create another forward-backward network. This final network (the result of the second cycle of forward-backward training) should have the best performance of all other networks, although sometimes the first forward-backward network has better performance.
Use "find_best.tcl" to
evaluate the best network's performance on the test set. These are the final results that
are acceptable for publication.
To illusrate the procedure described above, the example of training a continuous-speech digits recognizer is given in this section. All commands should be entered using a DOS or unix command window (In Windows, click on Start, then Programs, then Command Prompt to bring up a DOS window. In Windows 95, all Tcl commands should be prefaced with "tclsh80" in order to properly invoke the Tcl interpreter). Text given in bold indicates commands that are typed from a command window; text in courier font indicates the output from this command. In DOS, all commands must be entered on one line; if a backslash is used in the examples below to continue the command on another line, this must be typed as one line with no backslash when using DOS. The parameters for each script and program are explained in Section 6. The data files that are used in this example are located in a zip file available for downloading (make sure that you preserve the directory structure of the files in the zip file).
[Step 1] In this initialization step, set up the directory structure that you will use. We recommend that you create one directory for each "project", where a project contains all of the files created during the training of a network. For this example, we will be using a project directory called \tutorial\digit. Note that some files (vector files in particular) may take up a large amount of disk space; you may want to delete these files after you are finished using them. Now is a good time to make sure that your path contains the location of the training scripts as well as the stand-alone C programs used for training. To check this, if you type "categories.tcl" in your project directory, you should get the following:
categories.tcl
Usage: categories.tcl <.info file> <.vocab file>
<.parts
file> <.desc file>
<.olddesc
file> [-isolated]
and if you type "checkvec" in your project directory, you should get the following:
checkvec
give vec file
If you don't get these responses, contact the person who installed the Toolkit to find the location of the "script\training_1.0" directory and the "bin" directory within the Toolkit directory hierarchy.
It may also be convenient to copy the recog.tcl script from the Toolkit's "script\training_1.0" directory into a convenient directory (such as \tutorial\src), as you will need to specify the path to this file several times. Finally, you can copy the files corpora, digit.train.info, digit.dev.info, digit.test.info, digit.vocab, digit.parts, and remap_tutorial.tcl from the links provided here into your project directory (\tutorial\digit). This will save you some typing in the following Steps 2 through 6. Note that for Windows users, downloading these files will often result in different filenames, where the first "." has been replaced by "_" or the 4th letter of the extension has been removed; we recommend that you rename these files back to their original names. For this example, these files will only require minor modifications (for the path information), but Section 5 describes the format of these files so that you can change them or create them from scrath later on, in order to train on another task or train using different parameters.
[Step 2] Create a corpora file, called "corpora" (no extension). For this tutorial, the corpora file might look like this (assuming that the tutorial data is storied in \tutorial\data):
type corpora
corpus: numbers
wav_path /tutorial/data/speechfiles
txt_path /tutorial/data/txtfiles
phn_path /tutorial/data/phnfiles
format {NU-([0-9]+)\.[A-Za-z0-9_]+}
wav_ext wav
txt_ext txt
phn_ext phn
cat_ext cat
cull_file /tutorial/digit/numbers.cull5
ID: {regexp $format
$filename filematch ID}
It is probably also a good idea to make sure that your filenames have the same format as specified in the "corpora" file; the format is case sensitive, so NU-78.zipcode.wav is different from nu-78.zipcode.wav.
[Step 3] To start with, we run cull5.tcl to remove 5% of all available data for third-party evaluation. This creates a cull file called numbers.cull5. This cull file simply contains a list of waveform files that won't be used for training, development, or testing.
cull5.tcl numbers corpora
Please be patient... this may take a minute or two.
Found 653 wav files in corpus 'numbers'
Culled 32 out of 653 files (4.90%)
Done.
[Step 4]
Create info files for training, development, and testing. They will be called digit.train.info, digit.dev.info, and digit.test.info. We will only request 200 samples per category, so that this tutorial doesn't take more time than necessary. If one were constructing a real-life network, it would be better to use all available samples. To specify all samples, use the keyword ALL instead of 200 in the "want:" field in digit.train.info.
For the digit.train.info file, we are specifying that we want training data from
the numbers corpus, and we will put time-aligned category labels in the \tutorial\digit\numbers_train
directory (specified in the "partition:", "name:", and
"cat_path:" fields). We require the presence of waveform, phonetically-labled,
and text transcription files in order to do this (specified in the "require:"
field), and we'll use 3/5 of available files (specified in the "partition:"
field). We won't skip over any files (specified in the "filter:" field), but we
will require that all of the vocabulary words in the text file are words we want to
recognize (specified in the "vocab:" field). We will remap the hand-labled
phonetic files (which can have a high degree of variability in the phonemes used to
represent a word) to a consistent set of phonemes using the remap_tutorial.tcl
script (specified in the "remap:" field). For more information about the meaning
of the various fields, see the description of the info
file format.
type digit.train.info
basename: digit;
partition: train;
vector_size: 130;
corpus: name: numbers
cat_path:
/tutorial/digit/numbers_train
require: wpt
partition: "{expr $ID % 5} {0 1
2}"
filter: 1+1
vocab: digit.vocab
remap:
/tutorial/digit/remap_tutorial.tcl
want: 200;
type digit.dev.info
partition: dev;
basename: digit;
corpus: name: numbers
require: wt
partition: "{expr $ID % 5} {3}"
filter: 1+1
vocab: digit.vocab;
type digit.test.info
partition: test;
basename: digit;
corpus: name: numbers
require: wt
partition: "{expr $ID % 5} {4}"
filter: 1+1
vocab: digit.vocab;
[Step 5] Create a vocab file, called digit.vocab. This file contains the words, their pronunciations, and the grammar to be used during recognition.
type digit.vocab
zero {z I 9r
oU} ;
oh
{oU}
;
one {w ^
n}
;
two {uc th
u}
;
three {T 9r
i:}
;
four {f
>r}
;
five {f aI
v}
;
six {s I uc
ks} ;
seven {s E v &
n} ;
eight {ei uc
[th]} ;
nine {n aI
n}
;
separator {.pau [.garbage] .pau} ;
$digit = zero | oh | one | two | three | four | five | six |
seven | eight | nine;
$grammar = ([separator%%] < $digit [separator%%] > [separator%%]);
[Step 6]
Create a parts file, called digit.parts. This contains the number of parts that each phoneme will be split into, the groupings of phones into clusters of similar phones, and mappings from one phone to another.type digit.parts
.pau 1 ;
uc 1 ;
f 2 ;
v 2 ;
T 2 ;
s 2 ;
z 2 ;
n 2 ;
w 2 ;
I 2 ;
ks 2 ;
& 2 ;
^ 2 ;
9r 2 ;
E 2 ;
\>r 3 ;
i: 3 ;
u 3 ;
ei 3 ;
aI 3 ;
oU 3 ;
th r ;
$sil = .pau tc .garbage;
$den_l = s z th T ks;
$den_r = s z th T;
$lab = f v;
$ret_l = 9r \>r;
$ret_r = 9r ;
$bck_l = oU u;
$bck_r = oU w;
map uc tc;
map uc kc;
[Step 7]
Run find_files.tcl in order to find files suitable for training. The input files (besides digit.train.info and corpora) are \tutorial\digit\numbers.cull5 and digit.vocab. The output file is digit.train.numbers.files; this filename is constructed from the basename, the partition, and the corpus. The reason that the user doesn't specify the output filename on the command line is that it is possible, when using several corpora, to create several output files; it seems easier to have the filenames automatically determined than to have the user specify one filename for each corpus.find_files.tcl digit.train.info corpora
Basename: digit
Partition: train
Corpus: numbers
cat_ext: cat
txt_ext: txt
partition: {expr
$ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_train
txt_path:
/tutorial/data/txtfiles
remap: /tutorial/digit/remap_tutorial.tcl
phn_ext: phn
filter: 1+1
wav_ext: wav
cull_file:
/tutorial/digit/numbers.cull5
name: numbers
phn_path:
/tutorial/data/phnfiles
want: 200
wav_path:
/tutorial/data/speechfiles
vocab: digit.vocab
require: wp
Total of 530 wave files to be used
NU-25.zipcode.wav
NU-30.zipcode.wav
NU-46.streetaddr.wav
NU-47.zipcode.wav
(etc)
NU-596.zipcode.wav
NU-597.streetaddr.wav
Final count of 530 files for this corpus
Done.
Then, run find_files.tcl a second and third time to find files suitable for development and testing:
find_files.tcl digit.dev.info corpora
find_files.tcl digit.test.info corpora
[Step 8]
Run categories.tcl to determine the context-dependent categories that will be classified by the recognizer. The input files are the vocab, parts, and info files. The output files are the desc and olddesc files; these files contain not only the list of the context-dependent categories, but also some other information about the recognizer that we will be creating.categories.tcl digit.train.info digit.vocab digit.parts digit.train.desc
digit.train.olddesc
Basename: digit
Parition: train
Corpus: numbers
partition: {expr
$ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_train
remap: /tutorial/digit/remap_tutorial.tcl
filter: 1+1
name: numbers
want: 200
vocab: digit.vocab
require: wp
zero: z I 9r oU
oh: oU
one: w ^ n
two: uc th u
three: T 9r i:
four: f GREATER_THANr
five: f aI v
six: s I uc ks
seven: s E v & n
eight: ei uc
eight: ei uc th
nine: n aI n
separator: .pau .garbage .pau
separator: .pau .pau
word begin = z oU w uc T f s ei n .pau
word end = oU n u i: GREATER_THANr v ks uc th .pau
[Step 9]
Run gen_catfiles.tcl to take the list of files for training (digit.train.numbers.files) and create time-aligned labels of categories to train on. The input file (other than digit.train.info and corpora) is digit.train.numbers.files. If specified in digit.train.info, the script in the "remap:" field will be used, or the script in the "force_cat:" or "force_phn:" fields will be used (in this case, we haven't specified the "force_cat:" or "force_phn:" fields because we are not yet doing forced alignment). The category label files that are created are stored in the directory that is specified in digit.train.info in the "cat_path:" field. (For the February 1999 release of the Toolkit, there is also the option to store the category labels in one file, called a "master label file"; this master label file is specified instead of the "cat_path:" field, after the partition information in the .info file.) The gen_catfiles.tcl script also creates two other output files: the "dur" file and the "counts" file. The dur file contains minimum and maximum duration limits for each category, as determined from the category label files; the counts file lists the number of occurrence (and total time in msec) of each category.gen_catfiles.tcl digit.train.info digit.train.desc corpora digit.train.dur
digit.train.counts
Basename: digit
Partition: train
Corpus: numbers
cat_ext: cat
txt_ext: txt
partition: {expr
$ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_train
txt_path:
/tutorial/data/txtfiles
remap: /tutorial/digit/remap_tutorial.tcl
phn_ext: phn
filter: 1+1
wav_ext: wav
cull_file:
/tutorial/digit/numbers.cull5
name: numbers
phn_path:
/tutorial/data/phnfiles
want: 200
wav_path:
/tutorial/data/speechfiles
vocab: digit.vocab
require: wp
Creating directory /tutorial/digit/numbers_train/0
Created
file
NU-25.zipcode.cat: oh seven three oh six
Created
file
NU-30.zipcode.cat: one zero zero one four
Created file
NU-46.streetaddr.cat: one six
Created
file
NU-47.zipcode.cat: one one three five four
(etc)
Created file
NU-596.streetaddr.cat: nine eight zero four
Created file
NU-596.zipcode.cat: eight seven one one two
Created file
NU-597.streetaddr.cat: one
Sorting durations... taking lowest 5% and top 100% of durations
Done.
This script may generate messages such as
Merging 2323 2344 .tc with right (.pau) (too short)
These are simply messages to the user that some labels are being merged or deleted when converting from hand labels to categories. These messages come from the remapping script, in this case remap_tutorial.tcl. No action needs to be taken by the user. At the end, for each category, the duration that is at the bottom 2nd percentile of all durations for that category is written to the dur file as the minimum duration, and the longest duration of the category is written to the dur file as the maximum duration. These limits help the Viterbi search refrain from inserting very short words during recognition.
[Step 10] Run revise_desc.tcl to make sure that we have enough samples of each category to train on, and to add duration limits to the desc and olddesc files. If there are not enough samples of a category, this script allows us to tie these categories to categories with more samples. This is the only interactive script in the entire training and recognition process. The input files are the counts file, the dur file, the desc file, and the olddesc file. The outputs of this script are modifications to the desc and olddesc files to include category tieing information and duration limits information.
revise_desc.tcl digit.train.counts digit.train.dur digit.train.desc
digit.train.olddesc -min 3
The following states have already been tied:
(none)
Warnings:
Category uc<ei has 2 occurrences (total duration
of 161 msec)
Category ks>ei has 2 occurrences (total duration
of 131 msec)
Category n<n has 0 occurrences (total
duration of 0 msec)
Category n>n has 0 occurrences (total
duration of 0 msec)
Category uc<oU has 2 occurrences (total duration
of 224 msec)
Category th>ei has 1 occurrences (total duration
of 85 msec)
Category th>n has 2 occurrences (total duration
of 96 msec)
Category th>uc has 0 occurrences (total duration
of 0 msec)
Do you want this program to create ties? yes
All Available Categories:
$lab<& &>n
<.pau> $den_l<9r I<9r
9r>$bck_r 9r>i: $den_l<E
E>$lab $lab<>r
<>r> >r>$bck_r >r>$den_r >r>$lab
>r>$sil >r>ei
>r>n >r>uc
$den_l<I I>$ret_r I>uc
$bck_l<T $den_l<T $lab<T
$ret_l<T $sil<T
i:<T n<T
uc<T T>$ret_r
w<^ ^>n
$lab<aI n<aI
<aI> aI>$lab aI>n $bck_l<ei
$den_l<ei $lab<ei
$ret_l<ei $sil<ei
i:<ei n<ei
<ei> ei>uc $bck_l<f $den_l<f
$lab<f $ret_l<f
$sil<f i:<f
n<f uc<f
f>>r f>aI
$ret_l<i: <i:> i:>$bck_r
i:>$den_r i:>$lab i:>$sil
i:>ei i:>n
i:>uc uc<ks ks>$bck_r
ks>$den_r ks>$lab
ks>$sil ks>n ks>uc
$bck_l<n $den_l<n $lab<n
$ret_l<n $sil<n
&<n ^<n aI<n
i:<n uc<n
n>$bck_r n>$den_r n>$lab
n>$sil n>aI n>ei
n>uc $bck_l<oU $den_l<oU $lab<oU
$ret_l<oU $sil<oU
i:<oU n<oU
<oU> oU>$bck_r oU>$den_r
oU>$lab oU>$sil
oU>ei oU>n oU>uc
$bck_l<s $den_l<s $lab<s
$ret_l<s $sil<s
i:<s n<s uc<s
s>E s>I
th>$bck_r th>$den_r th>$lab
th>$sil th>u $den_l<u
<u> u>$bck_r
u>$den_r u>$lab
u>$sil u>ei
u>n u>uc
<uc>
E<v aI<v v>$bck_r
v>$den_r v>$lab
v>$sil v>&
v>ei
v>n v>uc $bck_l<w
$den_l<w $lab<w $ret_l<w $sil<w
i:<w
n<w uc<w
w>^ $bck_l<z $den_l<z $lab<z $ret_l<z
$sil<z
i:<z n<z
uc<z z>I
Tie category uc<ei (2 occurrences)
to which category? no
Tie category ks>ei (2 occurrences)
to which category? no
Tie category n<n (0
occurrences) to which category? $sil<n
Tie category n>n (0
occurrences) to which category? n>$sil
Tie category uc<oU (2 occurrences)
to which category? no
Tie category th>ei (1 occurrences)
to which category? no
Tie category th>n (2
occurrences) to which category? th>$sil
Tie category th>uc (0 occurrences)
to which category? th>$sil
Would you like to create duration limits with data from
the durations file? yes
Modifying duration information in digit.desc...
Confirmation: Do you really want to change the desc file 'digit.desc'? yes
Done.
[Step 11]
Run hscript.exe to create digit.rr, digit.list, and digit.0 from the revised digit.train.desc file. The digit.rr file contains a binary description of the recognizer; the digit.list file contains an ASCII list of the categories; and the digit.0 file contains an initial HMM description that is used in some of the cslush function calls. (These cslush function calls are designed to work with both neural networks and HMMs, and an HMM description is required even when using or training neural networks.)hscript.exe digit.train.desc
opening file: digit.train.desc
Input model:
Output model: digit.0 digit.list digit.rr
tie: $sil<n to n<n (1)->(1)
tie: n>$sil to n>n (1)->(1)
tie: th>$sil to th>n (1)->(1)
tie: th>$sil to th>uc (1)->(1)
stateNum = 162
transpNum = 166
[Step 12]
Run pickframes.tcl to select frames for training, from the files created by gen_catfiles. The input files (other than digit.train.info and corpora) are digit.rr, digit.list, digit.0, and digit.train.numbers.files. (For the February 1999 release of the Toolkit, if a master label file is generated, then this is also an input file to pickframes.tcl.) The output of this script is the file digit.train.pick, which contains a binary list of files, the frames to be used in each file, and the categories corresponding to these frames.pickframes.tcl digit.train.info corpora digit.train.pick
Basename: digit
Partition: train
Corpus: numbers
cat_ext: cat
txt_ext: txt
partition: {expr
$ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_train
txt_path:
/tutorial/data/txtfiles
remap: /tutorial/digit/remap_tutorial.tcl
phn_ext: phn
filter: 1+1
wav_ext: wav
cull_file:
/tutorial/digit/numbers.cull5
name: numbers
phn_path:
/tutorial/data/phnfiles
want: 200
wav_path:
/tutorial/data/speechfiles
vocab: digit.vocab
require: wp
digit.train.numbers.files, 200, 0
Picking frames for digit.train.numbers.files...
{{$lab<&} {0 200}} {&>n {1 200}} {<.pau> {2 200}} {{$den_l<9r}
(etc)
[Step 13]
Run genvec.tcl to compute features for all of the frames given in digit.train.pick. The input files are digit.train.info, digit.train.olddesc, and digit.train.pick. The features that are computed, and the target category values, are stored in the binary output file digit.train.vec. Note that if you want to use features that are different from the standard 130 features, you can write the code used to create the new features; the location of your code can be specified in the olddesc file. Also, the description of the format of the vector file given in Section 5 may be of interest.genvec.tcl digit.train.info digit.train.olddesc digit.train.pick digit.train.vec
/tutorial/data/speechfiles/0/NU-25.zipcode.wav
/tutorial/data/speechfiles/0/NU-30.zipcode.wav
/tutorial/data/speechfiles/0/NU-46.streetaddr.wav
/tutorial/data/speechfiles/0/NU-47.zipcode.wav
/tutorial/data/speechfiles/0/NU-51.other2.wav
/tutorial/data/speechfiles/0/NU-51.other3.wav
/tutorial/data/speechfiles/0/NU-51.zipcode.wav
(etc)
[Step 14]
Run checkvec.exe to make sure that the vector file that we created has the correct format, and that every category has at least one sample to train on. The numbers on the left are the numbers corresponding to each category, and the numbers on the right are the number of samples for that category. The input file is digit.train.vec; the only output goes to the screen for the user to check.checkvec.exe digit.train.vec
1: 200
2: 200
3: 198
4: 200
5: 200
6: 200
(etc)
156: 54
157: 200
158: 30
159: 153
160: 18
161: 200
21509 vectors with 130 features
[Step 15]
In order to train the neural network, one of two methods can be used. The first is the older method, which is an executable program that generates weights in floating-point format. This older method is the only method available for releases of the Toolkit before February, 1999. The second, newer, method, is a Tcl script that generates weights in double-precision format. The advantage of using the program instead of the script is that the program can be scheduled to run even when the user is logged out (in Windows NT), whereas the user must remain logged in to train using the script. Also, the results from the executable program can be somewhat better than the results from the newer method. The recognition scripts will work with files generated using either method. As a result, for now we recommend using the older method with the executable program nntrain.exe.For the executable program method: Run nntrain.exe to train the neural network on the vector file digit.train.vec. This program creates a weights file at each iteration; we will select the best weights file after training for 30 iterations. The -l option indicates that the negative penalty will be adjusted to compensate for varying numbers of samples per category; -sn 88 and -sv 88 are random-number seeds; -a 3 130 200 161 specifies the architecture of the net: 3 layers, with 130 nodes in the first layer, 200 nodes in the hidden layer, and 161 nodes in the output layer. The value 30 specifies training for 30 iterations, and the last parameter is the vector file to use for training.
nntrain.exe -l -sn 88 -sv 88 -a 3 130 200 161 30 digit.train.vec
creating net with seed 88
3 layers: 131 200 161
learning rate 0.050000
momentum 0.000000
negative weight 1.000000
training file digit.train.vec
numvec: 21509; tau: 107545.000000
vectors chosen in 1 blocks of 21509 with seed 88
iter 1: learn_rate 0.041667; total error is 75884.882813
iter 2: learn_rate 0.035714; total error is 57095.183594
iter 3: learn_rate 0.031250; total error is 50935.089844
iter 4: learn_rate 0.027778; total error is 46711.523438
(etc)
For the Tcl script method:
Run train_nnet.tcl to train the neural network on the vector file digit.train.vec. This script creates a weights file at each iteration; we will select the best weights file after training for 30 iterations. The first argument is the vector file, the second argument (in quotes) are the sizes of each layer, the next argument is the number of training iterations, and the final argument is the name of the output log file.train_nnet.tcl digit.train.vec "130 200 161" 30 training.log
Notes: For specifying the architecture, note that the number of nodes in the first layer will always be 130 for the standard feature set. The number of hidden nodes is decided by the user, but 200 is a reasonable number. The number of output nodes to use is written in a comment in the digit.desc file after running revise_desc.tcl. (this is the same as the number of categories that are not tied, and excludes the <.garbage> category). Also, the output of checkvec.exe indicates the number of output nodes to use in the third layer of the network, since the last pair of numbers from checkvec.exe gives the final output number (161 for this example) and the number of samples of that last output. The number 161 used in this example may change, depending on the number of states that have been tied and the information in the .vocab and .parts files.
The only input file to either nntrain.exe or train_nnet.tcl is the vector file; the output files are the neural-network weights files for each iteration (the default names are nnet.X, where X is an integer from 0 to the number of iterations).
[Step 16]
Run find_best.tcl to evaluate the performance of each iteration (weight file) on the development-set data. This script may take a long time, especially if there are many files in the development set. The invocation below assumes that the script for doing recognition is located in \tutorial\src\recog.tcl. The input files are the neural-network files created by nntrain, the digit.dev.numbers.files file, the vocab file, and the olddesc file, as well as digit.rr. The output files are ali files and a summary file.find_best.tcl nnet digit.dev.numbers.files digit.vocab \
digit.train.olddesc hand_labels.summary -b 15 -g 10
Garbage value is 10
Basename for the .ali files is wrdalign_nnet
Summary file is hand_labels.summary
Starting Iteration 30...
(etc)
Itr #Snt #Words Sub% Ins% Del%
WrdAcc% SntCorr
15 91 408 3.43%
1.23% 1.47% 93.87% 79.12%
16 91 408 3.43%
1.47% 1.47% 93.63% 79.12%
17 91 408 3.43%
1.23% 1.72% 93.63% 78.02%
18 91 408 3.68%
1.23% 1.47% 93.63% 79.12%
19 91 408 3.92%
0.98% 1.23% 93.87% 78.02%
20 91 408 2.94%
1.23% 1.23% 94.61% 80.22%
21 91 408 2.94%
1.47% 1.23% 94.36% 80.22%
22 91 408 3.68%
1.47% 1.47% 93.38% 76.92%
23 91 408 3.43%
1.23% 1.23% 94.12% 80.22%
24 91 408 3.43%
0.74% 1.23% 94.61% 80.22%
25 91 408 3.68%
0.74% 1.23% 94.36% 79.12%
26 91 408 3.68%
1.23% 1.23% 93.87% 79.12%
27 91 408 3.92%
0.98% 1.23% 93.87% 79.12%
28 91 408 3.19%
0.98% 1.23% 94.61% 82.42%
29 91 408 3.19%
1.47% 1.23% 94.12% 79.12%
30 91 408 3.19%
1.23% 1.23% 94.36% 80.22%
Best results (94.61, 82.42) with network nnet.28
Evaluated 16 networks using 91 files
Note that training on only 200 samples per category has a large influence on results; when I trained using all available samples instead of 200 per category (using the keyword "ALL" instead of 200 in digit.train.info), results were more than 30% better. The drawback to training on all samples is that nntrain.exe or train_nnet.tcl takes longer. If you have time and want better results, it is beneficial to use as much data as possible.
[Step 17] At this point, it may be helpful to see what kinds of errors are being made. We can browse through the development set, looking at the waveform, spectrogram, and word results for each error. The script browse.tcl will go through the alignment file of the best iteration and find errors. It will then perform recognition and create a wrd file and cat file of the result. At the question-mark prompt, you can type "-e" to find the next error, type <return> to go to the next file in the list of files, or type "q" to quit the program. (There are other options, but these are the most commonly used).
browse.tcl digit.dev.numbers.files \tutorial\src\recog.tcl nnet.28 digit.vocab \
digit.train.olddesc -a wrdalign_nnet.28
Files are output using base name temp
Starting with file number 1
Found 91 files to work with.
-----
#2: NU-23.zipcode.wav
Correct: one oh oh oh three
Recognized: one oh oh oh three
//goosnargh/speech/corpora/CSLU/numbers/speechfiles/0/NU-23.zipcode.wav
//goosnargh/speech/corpora/CSLU/numbers/phnfiles/0/NU-23.zipcode.phn
-----
? -e
Searching for next error...
-----
#23: NU-10193.streetaddr.wav
Correct: ### FOUR one three four
Recognized: TWO OH one three four
//goosnargh/speech/corpora/CSLU/numbers/speechfiles/101/NU-10193.streetaddr.wav
.phn file does not exist
-----
?
In a separate DOS or unix window, while browse.tcl is still running, start the program SpeechView in order to display the waveform, spectrogram, and results of recognition.
speechview.tcl -update -Wf temp.wav -S -Lf temp.phn -Lf temp.cat -Lf temp.wrd
Here, speechview is told to update the contents of the display whenever the contents of one of the files changes. It will create a waveform display of temp.wav, a spectrogram display of that waveform, a label display of the hand-labeled data (if it exists), a label display of the categories that were recognized, and a label display of the words that were recognized. (The speechview program is part of the CSLU Toolkit.)
Now, in the window with the browse.tcl script, you can search for errors in
recognition, and the results should be displayed automatically in the SpeechView
program.
[Step 18] Now we have finished the first cycle of training. If we are happy with the level of performance on the development set, we can stop the training process and evaluate on the test set (step 27). If we do skip to the evaluation step, then it is not permitted to re-train if we are unhappy with the test-set results. If we want to try to improve performance on the development set, we can do another cycle of training using force-aligned data. We can create another info file for doing forced alignment, using the training file as a template. This new file will be called digit.trainfa.info:
copy digit.train.info digit.trainfa.info
edit digit.trainfa.info
type digit.trainfa.info
basename: digit;
partition: trainfa;
vector_size: 130;
corpus: name: numbers
cat_path:
tutorial/digit/numbers_trainfa
require: wt
force_cat: "/tutorial/src/force.tcl
nnet.28 digit.vocab
digit.train.olddesc TXT WAV c OUT"
partition: "{expr $ID % 5} {0 1
2}"
filter: 1+1
vocab: digit.vocab
want: 200;
(Note that in the "force_cat:" field, the script and associated parameters are specified on two lines. No special marker (such as a backslash) is required.)
Note that we have changed the partition name (to "trainfa") and the path for category files (to "numbers_trainfa"). Also, by specifying "require: wt", we will now require the existence of .wav files and .txt files but not .phn files (because we will create labels from the text transcriptions using forced alignment). We also add a new field to the corpus description, indicating that we want to do forced alignment and create labels at the category level (as opposed to the phone or word level). Also note that we will do forced alignment using iteration 28 from the training we just finished, since iteration 28 had the best word-level performance. Because we are doing forced alignment, it is no longer necessary to use the remapping script that re-maps labels created by hand to the set of labels used by our recognizer.
[Step 19] Now we once again find the files we want to use for training by running find_files.tcl, and then we generate cateogry-level time-aligned labels by running gen_catfiles.tcl. As part of the process of creating category-level label files, we also automatically create new dur and counts files. Finally, we update the duration limits in the desc and olddesc files with the new information in the dur and counts files.
find_files.tcl digit.trainfa.info corpora
gen_catfiles.tcl digit.trainfa.info digit.train.desc corpora digit.trainfa.dur
digit.trainfa.counts
update_descdur.tcl digit.trainfa.dur digit.train.desc digit.train.olddesc \
digit.trainfa.desc digit.trainfa.olddesc
[Step 20]
Then, we repeat the training steps to train and select the best force-aligned network:pickframes.tcl digit.trainfa.info corpora digit.trainfa.pick
genvec.tcl digit.trainfa.info digit.trainfa.olddesc digit.trainfa.pick
digit.trainfa.vec
checkvec.exe digit.trainfa.vec
nntrain.exe -f fa -l -sn 88 -sv 88 -a 3 130 200 161 30 digit.trainfa.vec
find_best.tcl fa digit.dev.numbers.files digit.vocab \
digit.trainfa.olddesc fa.summary -b 15 -g 10
Note that the weights files have the basename "fa". The results from find_best are:
Itr #Snt #Words Sub% Ins%
Del% WrdAcc% SntCorr
15 91 408 3.19%
0.98% 1.47% 94.36% 79.12%
16 91 408 2.70%
0.98% 1.23% 95.10% 79.12%
17 91 408 2.94%
1.23% 1.23% 94.61% 78.02%
18 91 408 2.70%
0.98% 1.47% 94.85% 79.12%
19 91 408 2.94%
1.23% 0.74% 95.10% 81.32%
20 91 408 2.94%
1.23% 0.98% 94.85% 81.32%
21 91 408 2.21%
1.23% 0.98% 95.59% 82.42%
22 91 408 2.21%
0.98% 0.98% 95.83% 84.62%
23 91 408 2.94%
0.98% 1.23% 94.85% 80.22%
24 91 408 2.45%
1.23% 0.74% 95.59% 83.52%
25 91 408 3.19%
0.74% 1.23% 94.85% 81.32%
26 91 408 1.96%
0.98% 0.98% 96.08% 83.52%
27 91 408 2.45%
1.23% 1.23% 95.10% 81.32%
28 91 408 2.21%
0.74% 0.74% 96.32% 84.62%
29 91 408 2.45%
1.23% 0.98% 95.34% 82.42%
30 91 408 2.70%
0.74% 0.98% 95.59% 82.42%
Best results (96.32, 84.62) with network fa.28
Evaluated 16 networks using 91 files
[Step 21]
To create a network using the forward-backward method, we create a vector file using hmm_embed.tcl. First, however, we will create a new info file, specifying that we will deal with forward-backward training (partition name "trainfb"), that we want to use the labels generated from forced alignment, and that we want to use all available samples for training.copy digit.trainfa.info digit.trainfb.info
copy digit.trainfa.numbers.files digit.trainfb.numbers.files
copy digit.trainfa.desc digit.trainfb.desc
copy digit.trainfa.olddesc digit.trainfb.olddesc
if an MLF file is being used, we should also do the following:
copy digit.trainfa.mlf digit.trainfb.mlf
edit digit.trainfb.info
type digit.trainfb.info
basename: digit;
partition: trainfb;
vector_size: 130;
corpus: name: numbers
cat_path: /tutorial/digit/numbers_trainfb
require: wt
partition: "{expr $ID % 5} {0 1
2}"
filter: 1+1
vocab: digit.vocab
want: ALL;
pickframes.tcl digit.trainfb.info corpora digit.trainfb.pick
Basename: digit
Partition: trainfb
Corpus: numbers
cat_ext: cat
txt_ext: txt
partition: {expr
$ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_trainfb
txt_path:
/tutorial/data/txtfiles
phn_ext: phn
filter: 1+1
wav_ext: wav
cull_file:
/tutorial/digit/numbers.cull5
name: numbers
phn_path:
/tutorial/data/phnfiles
want: 1000000000
wav_path:
/tutorial/data/speechfiles
vocab: digit.vocab
require: wt
digit.trainfb.numbers.files, 1000000000, 0
Picking frames for digit.trainfb.numbers.files...
{{$lab<&} {0 498}} {&>n {1 482}} {<.pau> {2 11160}}
hmm_embed.tcl digit 0 fa.28 digit.trainfb.info corpora digit.trainfb.pick
digit.trainfb1.vec \
digit.trainfb.olddesc
Basename: digit
Partition: trainfb
Corpus: numbers
cat_ext: cat
txt_ext: txt
partition: {expr
$ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_trainfb
txt_path:
/tutorial/data/txtfiles
phn_ext: phn
filter: 1+1
wav_ext: wav
cull_file:
/tutorial/digit/numbers.cull5
name: numbers
phn_path:
/tutorial/data/phnfiles
want: 1000000000
wav_path:
/tutorial/data/speechfiles
vocab: digit.vocab
require: wt
prune: 300.0
minmodel: 10.0
mincount: 0
/tutorial/digit/numbers_trainfb/0/NU-25.zipcode.cat
Utterance prob per frame: -2.104426
/tutorial/digit/numbers_trainfb/0/NU-30.zipcode.cat
Utterance prob per frame: -1.293738
/tutorial/digit/numbers_trainfb/0/NU-46.streetaddr.cat
Utterance prob per frame: -1.190075
(etc)
[Step 22]
Then we check the file for errors using hnncheckvec.exe. This time, the header size of the vector file is 8 bytes, and the vector size if 544 bytes (130 features x 4 bytes + 3 targets x 4 bytes + 3 target values x 4 bytes):[Step 23]
Next, we train on this vector file using hnntrain.exe:hnntrain.exe -l -sn 88 -sv 88 -f fb1 -a 3 130 200 161 30 digit.trainfb1.vec
creating net with seed 88
numvec: 93752
numactive:3
(etc)
[Step 24]
Finally, we select the best iteration using find_best.tcl:find_best.tcl fb1 digit.dev.numbers.files digit.vocab \
digit.trainfb.olddesc fb1.summary -b 15 -g
10
Itr #Snt #Words Sub% Ins% Del%
WrdAcc% SntCorr
15 91 408 2.94%
1.23% 0.98% 94.85% 81.32%
16 91 408 2.21%
0.98% 1.23% 95.59% 82.42%
17 91 408 2.45%
0.98% 1.47% 95.10% 82.42%
18 91 408 1.96%
0.98% 0.98% 96.08% 84.62%
19 91 408 1.96%
0.98% 0.74% 96.32% 85.71%
20 91 408 2.45%
0.98% 0.98% 95.59% 82.42%
21 91 408 2.21%
0.98% 1.47% 95.34% 81.32%
22 91 408 1.96%
0.74% 0.98% 96.32% 85.71%
23 91 408 1.96%
0.98% 0.98% 96.08% 83.52%
24 91 408 2.21%
0.98% 1.47% 95.34% 82.42%
25 91 408 2.21%
0.98% 0.74% 96.08% 85.71%
26 91 408 2.70%
0.98% 1.47% 94.85% 81.32%
27 91 408 2.21%
0.98% 0.98% 95.83% 82.42%
28 91 408 2.94%
1.23% 1.23% 94.61% 79.12%
29 91 408 2.45%
0.98% 0.98% 95.59% 82.42%
30 91 408 1.72%
0.98% 1.23% 96.08% 84.62%
Best results (96.32, 85.71) with network fb1.22
Evaluated 16 networks using 91 files
Although the word-level accuracy is the same as with force-aligned training, the sentence-level accuracy is 7% better.
[Step 25] We then repeat the cycle of forward-backward training one more time:
hmm_embed.tcl digit 1 fb1.22 digit.trainfb.info corpora digit.trainfb.pick
digit.trainfb2.vec \
digit.trainfb.olddesc
hnncheckvec.exe digit.trainfb2.vec
hnntrain.exe -l -sn 88 -sv 88 -f fb2 -a 3 130 200 161 30 digit.trainfb2.vec
find_best.tcl fb2 digit.dev.numbers.files digit.vocab \
digit.trainfb.olddesc fb2.summary -b 15 -g
10
Itr #Snt #Words Sub% Ins% Del%
WrdAcc% SntCorr
15 91 408 2.21%
1.23% 1.23% 95.34% 81.32%
16 91 408 1.47%
0.98% 1.23% 96.32% 84.62%
17 91 408 2.45%
1.23% 1.47% 94.85% 82.42%
18 91 408 2.45%
0.98% 1.23% 95.34% 82.42%
19 91 408 2.45%
0.98% 1.23% 95.34% 82.42%
20 91 408 2.45%
0.74% 1.23% 95.59% 83.52%
21 91 408 1.96%
0.98% 1.47% 95.59% 82.42%
22 91 408 2.21%
0.98% 1.23% 95.59% 83.52%
23 91 408 1.72%
0.98% 1.23% 96.08% 83.52%
24 91 408 2.21%
0.98% 1.23% 95.59% 84.62%
25 91 408 1.72%
0.98% 0.98% 96.32% 85.71%
26 91 408 2.45%
0.98% 1.23% 95.34% 82.42%
27 91 408 1.96%
0.74% 1.47% 95.83% 83.52%
28 91 408 1.96%
1.23% 1.23% 95.59% 82.42%
29 91 408 1.96%
1.23% 1.23% 95.59% 83.52%
30 91 408 2.45%
0.98% 1.23% 95.34% 83.52%
Best results (96.32, 85.71) with network fb2.25
Evaluated 16 networks using 91 files
In this case, the two forward-backward networks have the same performance, so further training will probably not yield better results. Usually, two cycles of forward-backward training is enough.
[Step 26] The resulting network is the final network. The last step is to evaluate this network on the test set
find_best.tcl fb2 digit.test.numbers.files digit.vocab \
digit.trainfb.olddesc test.summary -o 25 -g
10
In the following file formats, text in fixed-font bold is a keyword
that must be used verbatim. Italicized items in brackets <> must be substituted with
the proper values.
wav file
A wav file contains the speech waveform that is to be trained on or recognized. The format
for wav files in the CSLU Toolkit is (unfortunately, sometimes) not the Microsoft
.wav format; it is the NIST Sphere ulaw format. This format is described at http://vision1.cs.umr.edu/~johns/links/music/audiofile2.html
and there is software available on the WWW for converting waveform files into different
formats.
txt file
A txt file contains a text transcription of the words in a speech waveform. This file is
simply an ASCII file containing the words separated by spaces, and it can be created by
any text editor that outputs ordinary .txt files.
label files (.phn, .cat, .wrd)
Label files, which usually have the extension .phn, .cat, or .wrd, contain time-aligned
labels of a waveform utterance. If the file has the extension .phn, then the labels are
phonetic labels; if the file has the .cat extension, then the labels are neural-network
output categories (context-dependent sub-phone units); and if the file has the extension
.wrd, then the labels are words. A label file has the following format:
MillisecondsPerFrame: <value>
END OF HEADER
<begin_time_1> <end_time_1> <label_1>
<begin_time_2> <end_time_2> <label_2>
...
<begin_time_n> <end_time_n> <label_n>
where:
The values for <begin_time> and <end_time> are measured in
frames (so if <value> is 1.0, then time is measured in milliseconds; if <value>
is 10.0, then time is measured in centi-seconds). The <end_time> of one label
is usually the same as the <begin_time> of the next label.
The corpora file contains descriptions of all corpora:
<corpus1 description>
<corpus2 description>
<corpus3 description>
...
where a <corpus description> has the following format:
corpus: <corpus_name>
wav_path <path_to_wav_files>
phn_path <path_to_phn_files>
txt_path <path_to_txt_files>
format <regular_expression_for_parsing_filenames>
wav_ext <extension_for_wav_files>
phn_ext <extension_for_phn_files>
txt_ext <extension_for_txt_files>
cat_ext <extension_for_cat_files>
cull_file <location_of_cull_file>
ID: <Tcl_code_for_determining_caller_ID>
where: