Training Neural Networks for Speech Recognition
John-Paul Hosom, Ron Cole, Mark Fanty, Johan Schalkwyk, Yonghong
Yan, Wei Wei
Center for Spoken Language Understanding (CSLU)
Oregon Graduate Institute of Science and Technology
January 29, 1998
Second Draft
Contents
1. Introduction
2. General Concepts
and Notation
2.1
Quick Review of Frame-Based Speech Recognition
2.2 Specifying
Categories
2.3
Example of Specifying Categories
2.4 Finding
Samples to Train On
2.4.1
Overfitting and Datasets
2.4.2
Filtering
2.4.3
Finding Categories
2.4.4
Number of Samples per Category
2.5 Training
the Network
2.5.1
Generating Data
2.5.2
Shuffling Data
2.5.3
Number of Hidden Nodes
2.5.4
Negative Penalty
2.5.5
Number of Training Iterations
2.5.6
Re-Training on Force-Aligned Data
2.5.7
Forward-Backward (Embedded) Training
2.6 Evaluation
2.6.1
Word-Level Evaluation
2.6.2
Choosing the Best Iteration
2.6.3
Testing
3. Overall Procedure
3.1 Create Descriptions
3.2 Find Data
3.3 Select
Data for Training
3.4 Train and
Evaluate
3.5 Re-Train
3.6 Evaluate Test
Set
4. Complete Example
5. File Formats
wav files
txt files
label files
corpora file
cull file
info file
vocab file
parts file
desc file
olddesc file
files file
dur file
counts file
pick file
vec, svec files
neural-network files
summary file
ali files
6. Script and Program Usage
browse.tcl
categories.tcl
checkvec.exe
cull5.tcl
find_best.tcl
find_dur.tcl
find_files.tcl
force.tcl
genvec.tcl
gen_catfiles.tcl
hmm_embed.tcl
hnncheckvec.exe
hnntrain.exe
hscript.exe
nntrain.exe
pickframes.tcl
recog.tcl
recog_cslushdigit.tcl
remap_genpur.tcl
revise_desc.tcl
shuffle.exe
update_descdur.tcl
1. Introduction
This tutorial describes the method used at CSLU for creating neural-network-based
speech recognizers. Included in this tutorial are some general concepts
behind training a recognizer, step-by-step instructions on how to train
a recognizer, and a description of Tcl scripts that can be used to automate
parts of this process.
In order to use the scripts mentioned in this tutorial, you must have
the CSLU Toolkit installed on your machine. Make sure that your path includes
the location of the Toolkit's stand-alone executable files (usually located
in the "bin" directory, for example C:\CSLU\Toolkit\2.0\bin) as well as
the scripts used for training (usually located in the "script\training_1.0"
directory, for example C:\CSLU\Toolkit\2.0\script\training_1.0). In order
to follow the example provided in this tutorial, you may want to use the
same data files. These files are located at http://www.cse.ogi.edu/CSLU/toolkit/documentation/userguide/nnet_training/CSLUexamplefiles.zip.
The size of this compressed "zip" file is 7.6MB, and the size of all the
data files is about 10MB. The CSLU Toolkit and corpora are free of charge
for non-profit use (universities, high schools, and individuals may download
the Toolkit at no charge). For more information on the Toolkit and CSLU
corpora, visit our WWW site at http://www.cse.ogi.edu/CSLU.
In this document, phonetic symbols are represented using Worldbet,
which is an ASCII encoding of the International Phonetic Alphabet (IPA)
[J. Hieronymus, 1995].
In this tutorial, the word phone is used extensively; according
to Webster's Dictionary,
a phone is "a speech sound considered as a physical event without regard
to its place in the sound system of a language." So, in this tutorial,
the word phone is used to refer to the phonetic events that we want
to classify, whether or not they correspond to phonemes in the language.
Finally, please send all questions, comments, and bug reports to
hosom@cse.ogi.edu.
2. General Concepts and
Notation
The general steps to creating a neural-network based recognizer are:
-
Specify the phonetic categories that the network will recognize.
-
Find many samples of each of these categories in the speech data.
-
Train a network to recognize these categories.
-
Evaluate the network performance using a test set.
2.1 Quick Review of
Frame-Based Speech Recognition
Frame-based speech recognition has the following five steps, illustrated
in Figure 1:
Figure 1. Overview of Frame-Based Speech Recognition using Neural
Networks.
-
Divide the waveform into frames, where each frame is a small
segment of speech that contains an equal number of waveform samples. In
this tutorial, we will assume a frame size of 10msec.
-
Compute features for each frame. These features usually describe the spectral
envelope of the speech at that frame and at a small number of surrounding
frames.
-
Classify the features in each frame into phonetic-based categories using
a neural network. The outputs of the neural network are used as estimates
of the probability, for each phonetic category, that the current frame
contains that category.
-
Use the matrix of probabilities and a set of pronunciation models to determine
the most likely word(s). Searching is done with a Viterbi search.
-
Determine the confidence in the most likely result, and either accept it
as the answer or reject the waveform as containing a word not in the vocabulary.
For a more detailed explanation, see the on-line tutorial at http://www.cse.ogi.edu/CSLU/toolkit/documentation/userguide/nnet_recog/recog.html
2.2 Specifying Categories
In order to determine the categories that the network will classify, the
following three things need to be done:
-
The designer of the recognizer needs to determine the pronunciations for
each of the words that will be recognized. More accurate pronunciation
models will generally yield better recognition rates.
-
Quite often, we also use context-dependent phone models, which means that
one phone is classified differently depending on the phones that surround
it (for example, an /aI/ following an /w/ is classified differently from
an /aI/ following an /h/). The surrounding context may contain a group
of phones or just a single phone. (Using groups of phones reduces the number
of categories that need to be classified.) The grouping of phones into
clusters of similar phones must be done by the person designing the recognizer.
-
Finally, when constructing context-dependent phone models, we divide each
phone to be recognized into one, two, or three parts. Each sub-phone segment
corresponds to one category to be recognized. If we keep a phone as one
part, then it is used without the context of surrounding phones. If we
divide it into two parts, then the left half of the phone model (the left
sub-phone) is dependent on the preceding phone, and the right half of the
phone model (the right sub-phone) is dependent on the following phone.
If the phone is split into three parts, then the first third is dependent
on the preceding phone, the middle third is independent of surrounding
phones, and the last third is dependent on the following phone. One final
option is to keep the phone as one part, but make it dependent on the following
phone; this is called a right-dependent phone and it is used mostly for
stop consonants. The designer of the recognizer needs to decide how many
parts each phone will be split into.
Figure 2 shows an illustration of this kind of context-dependent modeling.
In this figure, an example is given for the modeling of the word "yes",
written in Worldbet as /j E s/. Here, the /j/ is split into two parts,
the /E/ is split into three parts, and the /s/ is split into two parts.
There are eight groups of phones used for contexts; each group represents
a broad category of sounds. For the vowel /E/ in a general-purpose recognizer,
there are eight categories for the left third, one category for the middle
third, and eight categories for the right third, yielding a total of 17
categories for this 3-part phone. The /j/, on the other hand, would have
16 categories in a general-purpose recognizer, because it is split into
only left and right halves.
Figure 2. Context-Dependent Modeling
The context-dependent phonetic categories that the network will be trained
on can be determined from the phonetic-level pronunciation models, the
groupings of phones into clusters of similar phones, and the number of
parts to split each phoneme into.
2.3 Example of Specifying
Categories
To give an example of how these three items can be determined, we'll use
the example of recognizing the isolated words "three", "tea", "zero", and
"five".
First, we can come up with some initial pronunciations:
| word |
pronunciation
|
| three |
T 9r i: |
| tea |
tc th i: |
| zero |
z i: 9r oU |
| five |
f aI v |
We may want to modify these pronunciations, because the /i:/ in "zero"
is often pronounced differently from the /i:/ in "three" and "tea". To
account for this difference in pronunciation, we can use our own symbol,
/i:_x/, to represent the front vowel in "zero". Making this change gives
us the following pronunciation models:
| word |
pronunciation
|
| three |
T 9r i: |
| tea |
tc th i: |
| zero |
z i:_x 9r oU |
| five |
f aI v |
Next, we will determine the number of parts to use for each phone. In the
table below, "1" means that the phone will be context-independent, "2"
means that the phone will be split into two parts, "3" means that the phone
will be split into three parts, and "r" means that the phone will be "right-dependent":
| phone |
parts
|
| T |
1
|
| 9r |
2
|
| i: |
3
|
| tc |
1
|
| th |
r
|
| z |
1
|
| i:_x |
2
|
| oU |
3
|
| f |
1
|
| aI |
3
|
| v |
1
|
Now, let's look at the /i:/ in "three" and "tea". In this case, the vowel
/i:/ is the same, but it looks very different when it follows a /9r/ compared
to when it follows a /th/ (see Figure 3).
Figure 3. Example of vowel /i:/ in different contexts.
In this case, we make the left third of the /i:/ (since it is split
into three parts) dependent on a preceding retroflex (/9r/) in one case
and dependent on a preceding alveolar sound (/th/ or /z/) in the other
case. We usually group the phones in a left or right context according
to their broad phonetic category; for example, the following groupings
can be used (the dollar sign indicates a variable that represents the group
of listed phones):
| group |
phones in group |
description |
| $bck |
oU |
back vowels |
| $fnt |
i: i:_x |
front vowels |
| $ret |
9r |
retroflex sounds |
| $den |
T v th z |
dentals, labiodentals, and alveolars |
| $sil |
.pau tc |
silence or closure |
But notice that it then becomes difficult to classify diphthongs such as
/aI/, because the phone starts as a back vowel and ends as a front vowel.
The current solution is to modify the categories in the following way:
| group |
phones in group |
description |
| $bck_l |
oU |
back vowels to the left of a phone |
| $bck_r |
oU aI |
back vowels to the right of a phone |
| $fnt_l |
i: i:_x aI |
front vowels to the left of a phone |
| $fnt_r |
i: i_x |
front vowels to the right of a phone |
| $ret |
9r |
retroflex sounds |
| $den |
T v th z |
dentals, labiodentals, and alveolars |
| $sil |
.pau tc |
silence or closure |
First, we have added "_l" and "_r" to the variable names in question,
to indicate whether the phones in this grouping occur on the left or right
side of the phone being classified. Then, because /aI/ looks like a back
vowel when it appears to the right of a phone, it has been put in the grouping
$bck_r; because /aI/ looks like a front vowel when it appears to the left
of a phone, /aI/ has also been put in the grouping $fnt_l. This method
of grouping into left or right contexts is illustrated in Figure 4:
Figure 4. Illustration of labeling a diphthong in the word "five".
The format for specifying different categories is [left_context]<phone>[right_context],
so for example the category for /.pau/ will be <.pau>, the category
for the left third of /i:/ in the context of dental sounds will be $den<i:,
the middle third of /i:/ will be <i:>, and the right third of /i:/ in
the context of silence will be i:>$sil.
Given all this information, it can easily (if tediously) be determined
that the 28 categories we need to train on are:
|
<.pau>
|
$den<9r
|
$fnt_l<9r
|
9r>$bck_r
|
9r>$fnt_r
|
|
<T>
|
f<aI
|
<aI>
|
aI>$den
|
<f>
|
|
$den<i:
|
$ret<i:
|
<i:>
|
i:>$den
|
i:>$sil
|
|
i:>f
|
$den<i:_x
|
<i:_x>
|
i:_x>$ret
|
$ret<oU
|
|
<oU>
|
oU>$den
|
oU>$sil
|
oU>f
|
<tc>
|
|
th>$fnt_r
|
<v>
|
<z>
|
|
|
In the following sections, a Tcl script called "categories.tcl"
is described; this script can be used to automate the process of determining
categories.
2.4 Finding Samples to Train
On
2.4.1 Overfitting and
Datasets
As we train a network, we keep adjusting the neural network weights
to minimize the error in our training data. For each adjustment of the
weights, we have a new iteration (or epoch) in the training process. We
can keep generating new iterations until the error no longer decreases.
At this point, we have learned the training data to the extent that it
is possible.
However, when we train a neural network, we aren't interested in learning
the training data. Instead, we are interested in learning the general
properties of the training data. By learning the general properties
of the data instead of the details that are specific to the training data,
we are best able to classify a new utterance not in the training set.
In order to determine which iteration of network weights has best learned
the general properties of the data, we use a separate (usually smaller)
set of data to evaluate each iteration. This second set of data is called
the "development" set (or cross-validation set). Because this development
set has not been used to adjust the network weights during training, it
can be used to evaluate the network's ability to recognize phonetic categories,
as opposed to (possibly irrelevant) details in the training set. The larger
this development set is, the more confidence we can have in the general
classification properties of the network.
Once we have determined the best network, we need to evaluate its performance
on a test set. In order to have an honest evaluation, the data in the test
set must not occur in either the training set or the development set.
This means that given a corpus containing our target words, we must
divide it into at least three parts: one part for training, one for development,
and one for testing. If we have a large enough corpus, we may further divide
the development set into subsets, so that as we evaluate and make modifications
to our recognizer, we are not tuning performance to one set of development
data.
Finally, at CSLU we leave 5% of our target corpus for independent, third-party
evaluation. This 5% is culled from the entire corpus before dividing into
training, development, or test sets.
2.4.2 Filtering
When selecting data for training, development, and testing, we can
apply various filters to reduce the amount of data. In one case, we may
have utterances in our corpus that don't occur in our target vocabulary.
In this case, we may want to filter so that words not in our vocabulary
list are not included in our datasets. For example, if we are training
a digits recognizer and we are using the CSLU Numbers corpus for training,
we may want to remove out-of-vocabulary utterances that contain numbers
such as "first", "twelve", and "fifty". In another case, we may have so
much data that training or evaluation would take too long. In this case,
we can filter so that we take every Nth utterance for use in our
datasets, where N is some integer greater than 1. For example, we
may want to take every sixth waveform for training our digits recognizer,
because there are over 6000 waveforms available for training on digits.
Filtering in this way will still leave over 1000 waveforms (or approximately
5000 examples of each digit) available for training.
2.4.3 Finding Categories
Once we know which files we'll using for training, we need to find
samples of each category that we'll train on. This can be one in one of
two ways: using data that has been hand-labeled at the phonetic level,
or using forced alignment.
Hand-Labeled Data
Many corpora at OGI have been labeled with time information at the
phonetic level by professional labelers. If training is to be done on this
hand-labeled data, then the labels must be re-mapped from the phonetic
level to the (context-dependent) category level. For example, a hand-labeled
file for the isolated digit "three" might contain this information:
0 53 .pau
53 113 T
113 170 9r
170 229 i:
229 273 .pau
where the first item is the start time in milliseconds, the second item
is the end time in milliseconds, and the third item is the phonetic-level
label. In order to train on this data, it needs to be re-mapped into the
following set of time-aligned labels:
0 53 <.pau>
53 113 <T>
113 142 $den<9r
142 170 9r>$fnt_r
170 190 $ret<i:
190 209 <i:>
209 229 i:>$sil
229 273 <.pau>
A set of Tcl scripts to automate this process will be described later.
Also, some general modifications may be made to the hand-labeled data so
that the data is more suited for training; for example, we may want to
ignore very short pauses. Again, there are scripts described below that
will automate this for us.
Force-Aligned Data
Often, the corpus we want to train on has text transcriptions but no
time-aligned phonetic labels. In this case, we can create either phonetic
labels or category labels using a process called "forced alignment".
Forced alignment is the process of using an existing recognizer to recognize
a training utterance, where the grammar and vocabulary are restricted to
be the correct result. (The correct result is the word-level transcription,
which must be known). The result of forced alignment is a set of time-aligned
labels that give the existing recognizer's best alignment of the correct
phones or categories. If the existing recognizer is good, then the labels
will have good time alignments. These labels can then be used for training
a new recognizer. Even if the existing recognizer is not so good, this
process can be used to determine an initial set of categories.
2.4.4 Number of
Samples per Category
Finally, the designer of a recognizer must decide how many samples
of each category to train on. Usually, networks with decent performance
can be trained using up to 500 samples per category, but sometimes 2000
or more samples are used. In order to get best performance, all samples
in the training set should be used. However, training with all samples
may be very time-consuming.
If some categories have very few or no training samples, then there
are two options. The first option is to use an additional corpus that contains
samples of these infrequent classes. The second option is to "tie" these
infrequent categories to phonetically similar categories that do have enough
training samples. Categories tied in this way will not be trained on, and
during recognition their probabilities will be set equal to the probabilities
of the categories that they were tied to.
2.5 Training the Network
2.5.1 Generating Data
Once the categories to train on have been found, and the number of
samples per category has been determined, the actual data that will be
trained on are collected and stored in a "vector file". This vector file
contains, for each training sample, the features that will be input to
the neural network and the target category. (One set of training features
and the target category is called a "vector"; it is also called a "sample".)
2.5.2 Shuffling Data
The order in which the training process selects vectors for training
is important, and it is thought that having vectors in random order improves
the accuracy with which training can be done. As a result, the original
vector file is "shuffled" before training, making the order of training
vectors quasi-random.
2.5.3 Number of Hidden
Nodes
At CSLU, we use 3-layer feed-forward networks. The number of input
nodes is the number of spectral features, and the number of output nodes
is the number of categories to be trained on. The designer of a recognizer
must decide how many hidden nodes the network should have; in general,
we have found 200 hidden nodes to be a reasonable number.
2.5.4 Negative Penalty
When using a large number of samples per category, it is nearly inevitable
that some categories will have much fewer samples than others, making it
difficult to learn these sparse categories. This difficulty in training
is due to the fact that there are many more negative samples than positive
samples for a sparse category, where negative samples are samples for which
the category being trained on has a target value of 0, and positive samples
are samples for which the category being trained on has a target value
of 1. As a result, these sparse categories often have very small output
values that don't reflect the actual posterior probabilities that we want
to obtain. To adjust for this, the amount that each negative sample contributes
to the total error is weighted by a value proportional to the number of
samples in that negative category; this value is called a "negative penalty".
Training can be done either with or without this negative penalty. A more
thorough discussion of the negative penalty can be found in the paper by
Wei and van Vuuren at ICASSP-98, "Improved Neural Network Training of Inter-Word
Context Units for Connected Digit Recognition."
2.5.5 Number
of Training Iterations
It is almost never necessary to continue training until the training
error stops decreasing; the best performance on the development set will
almost always happen much sooner. Usually, best performance on the development
set occurs after 20 to 30 iterations, and so training is done for a fixed
number of iterations, usually 30 to 40.
2.5.6 Re-Training
on Force-Aligned Data
As described above, forced alignment can be used to generate labels
for training. In order to generate initial labels using forced alignment,
we usually use a general-purpose recognizer. We can also use forced alignment
to re-train a network; in this case, we use our current-best network to
generate the forced-alignment labels and then train again using these new
labels. This re-training often yields better results.
2.5.7
Forward-Backward (Embedded) Training
One final method for improving results uses "forward-backward" or "embedded"
training. In forward-backward training, the targets of the neural network
are not binary values, but posterior probabilities. These probabilities
are determined using the forward-backward algorithm, in which a previously-trained
neural network is used to compute the observation probabilities. (The forward-backward
algorithm is usually used for training a Hidden Markov Model, and a good
tutorial on this subject is given in Rabiner and Juang's book "Fundamentals
of Speech Recognition", in Chapter 6). A paper on using the forward-backward
algorithm for training a neural network is given in a paper by Yan, Fanty,
and Cole at ICASSP-97, "Speech
Recognition Using Neural Networks with Forward-Backward Probability Generated
Targets".
2.6 Evaluation
2.6.1 Word-Level Evaluation
Once we have trained for, say, 30 iterations, we need to determine
which iteration has the best performance on the development set. To do
this, we recognize each utterance in the development set using the network
weights from each iteration. If the number of words in each utterance is
not known beforehand, we need to evaluate the performance at each iteration
in terms of substitution errors, insertion errors, and deletion errors.
(If the number of words is known beforehand, then we only need to measure
substitution errors, but the same method can be used). The overall accuracy
of a network iteration is defined to be 100% - (Sub + Ins
+ Del), where Sub is the percentage of substitution errors,
Ins is the percentage of insertion errors, and Del is the
percentage of deletion errors. We can also measure the "sentence-level
accuracy", which is the number of utterances (or waveforms) recognized
correctly divided by the total number of utterances in the development
set.
2.6.2 Choosing
the Best Iteration
Usually, we choose the network iteration with the best word-level accuracy,
and in case of equal word-level accuracies, then we select the iteration
with the greater sentence-level accuracy.
2.6.3 Testing
Once we have finished developing a recognizer, we evaluate the final
performance on the test set, in terms of word-level and sentence-level
accuracy. It is important, however, that once evaluation is done on the
test set, the recognizer is not further modified based on these test-set
results. In order to make sure that such modifications are not done, the
test set is usually reserved until just before the recognizer is put into
general-purpose use (or just before publishing results in a journal or
at a conference).
3. Overall Procedure
Given the background described in the previous section, the process of
training a recognizer becomes relatively simple. This section gives the
"recipe" for this training process.
3.1 Create Descriptions
The first step is to create a description of the recognizer and describe
how the data will be selected for training. The files that need to be created
are:
-
corpora file
-
Create a "corpora" file if one doesn't yet exist. The corpora file contains
a master list of each corpus and the location and format of the files in
that corpus. The format of this corpora file is given below; there is no
automated way of generating this file, but it is easy to modify by hand.
The same corpora file should be used for all training tasks.
-
cull file
-
Create "cull" files, if necessary. A cull file is a list of files in a
corpus that won't be used for training, development, or in-house testing.
Usually, 5% of the entire corpus is put into this cull file. The script
cull5.tcl can be used to generate a cull file for a particular corpus.
-
info files
-
Create "info" files for training, development, and testing. These info
files must be created by hand; the format is given below in Section
5. An info file contains all of the information that is necessary to
find samples for training, development, or testing. This info file includes
the partition (train, develop, test), how to select the data for the required
partition, the basename of the recognizer, the minimum number of samples
requested for each category, and corpus-dependent information. One info
file is required for each of the tasks of training, re-training using forced
alignment, forward-backward training, development, and testing.
-
vocab file
-
Create a "vocab" file with the vocabulary, pronunciations, and grammar
for the task. This must also be created by hand, and the format is given
below.
-
parts file
-
Create a "parts" file, which specifies how many parts to split each phoneme
into, and what context groupings to use. Once again, this must be created
by hand, and the format is given in Section
5.
3.2 Find Data
Given the files created above, the scripts to use in order to find data
files for training are:
-
find_files.tcl
-
Use "find_files.tcl" to find files for training, development,
and testing. This script must be called once for each set of files. At
this stage, any filters are applied and the corpus is searched for files
that are appropriate for the given partition (such as training or testing).
-
categories.tcl
-
Use "categories.tcl" to generate categories to train on. This
script uses the info, vocab, and parts files to create a "desc" file. A
desc file contains a description of the recognizer for use by other training
and recognition scripts. An "olddesc" file is also created. This file also
contains a description, but in an older format. (The "olddesc" file will,
in future versions of the training process, be obsolete. When this time
comes, the olddesc file will no longer be generated. For now, however,
the "desc" file is used in some training scripts, and the "olddesc" file
is used in recognition scripts.)
-
gen_catfiles.tcl
-
Use "gen_catfiles.tcl" to create time-aligned categories from
text transcriptions or from phonetic time-aligned transcriptions. These
categories are written to files with the extension ".cat" and are put in
sub-directories that mirror the directory structure of the corpus (or corpora)
being used.
-
revise_desc.tcl
-
Use "revise_desc.tcl" to make sure that all categories have enough
samples for training. If some categories don't have enough samples, then
either the vocab and parts files need to be modified (and the entire process
repeated), or these sparse categories need to be "tied" to categories with
more data. This script is also used to set minimum and maximum duration
limits for each category, based on the categories in the training data.
The use of these duration limits is optional, as the CSLU Toolkit does
have default limits for every phone; however, performance usually improves
by using the durations from the categories that are trained on. This script
will revise the contents of the desc and olddesc files.
-
hscript.exe
-
Use "hscript" to create other files that will be used in training
and recognition. These files have the extensions ".rr", ".list", and ".0".
The ".rr" file contains a binary description of the recognizer, with information
such as the list of categories being recognized. The ".list" file contains
an ASCII list of the all categories. The ".0" file is an initial HMM model,
used later in the forward-backward training stage.
3.3 Select Data for Training
Once the files have been selected, the category files have been created,
and the desc file is correct, then we can use the following scripts and
programs to select frames for training:
-
pickframes.tcl
-
Use "pickframes.tcl" to select samples to train on. The output
of this script is a "pick" file, which is used directly by genvec.tcl.
-
genvec.tcl
-
Use "genvec.tcl" to create features for each frame to be trained
on.
-
shuffle.exe
-
Use "shuffle" to randomize the order of the vectors.
-
checkvec.exe
-
Use "checkvec" to make sure that the data in the vector file is
valid.
3.4 Train and Evaluate
-
nntrain.exe
-
Use "nntrain" to train the network on the vector file.
-
find_best.tcl
-
Use "find_best.tcl" to find the best iteration of the network
using the set of development files.
-
browse.tcl
-
Use "browse.tcl" to evaluate errors. The errors that are made
may give clues about necessary revisions to the recognizer. Repeat steps
in the development process, as necessary.
3.5 Re-Train
Create force-aligned data using the best iteration of the network that
was just trained. To do this, create an info file
for forced alignment that specifies a new directory in which to put the
category files and a forced-alignment script to use. Then use "find_files.tcl"
and "gen_catfiles.tcl" to generate
the force-aligned labels.
Repeat Sections 3.3 and
3.4 to create a network trained on
this force-aligned data.
Given a network trained on force-aligned data, create a third network
using the forward-backward method. To construct such a "forward-backward
network", do the following:
-
Create a vector file using hmm_embed.tcl
-
Check this vector file for errors using hnncheckvec.exe
-
Train on this vector file using hnntrain.exe
-
Find the best iteration using find_best.tcl
Repeat this cycle to create another forward-backward network. This final
network (the result of the second cycle of forward-backward training) should
have the best performance of all other networks, although sometimes the
first forward-backward network has better performance.
3.6 Evaluate Test Set
Use "find_best.tcl" to evaluate the
best network's performance on the test set. These are the final results
that are acceptable for publication.
4. Complete Example
To illusrate the procedure described above, the example of training a continuous-speech
digits recognizer is given in this section. Text given in bold indicates
commands that are typed from a command window; text in courier font
indicates the output from this command. In DOS, all commands must be entered
on one line; if a backslash is used in the examples below to continue the
command on another line, this must be typed as one line with no backslash
when using DOS. The parameters for each script and program are explained
in Section 6. The data files
that are used in this example are located in the zip file at the following
address: http://www.cse.ogi.edu/CSLU/toolkit/documentation/userguide/nnet_training/CSLUexamplefiles.zip.
[Step 1] In this initialization
step, set up the directory structure that you will use. We recommend that
you create one directory for each "project", where a project contains all
of the files created during the training of a network. For this example,
we will be using a project directory called \tutorial\digit. Note
that some files (vector files in particular) may take up a large amount
of disk space; you may want to delete these files after you are finished
using them. Now is a good time to make sure that your path contains the
location of the training scripts as well as the stand-alone C programs
used for training. To check this, if you type "categories.tcl"
in your project directory, you should get the following:
categories.tcl
Usage: categories.tcl <.info file> <.vocab file>
<.parts file> <.desc file>
<.olddesc file> [-isolated]
and if you type "checkvec" in your project directory, you should
get the following:
If you don't get these responses, contact the person who installed the
Toolkit to find the location of the "script\training_1.0" directory
and the "bin" directory within the Toolkit directory hierarchy.
It may also be convenient to copy the recog.tcl
script from the Toolkit's "script\training_1.0" directory into
a convenient directory (such as \tutorial\src), as you will need
to specify the path to this file several times. Finally, you can copy the
files corpora,
digit.train.info,
digit.dev.info,
digit.test.info,
digit.vocab,
digit.parts,
and remap_tutorial.tcl
from the links provided here into your project directory (\tutorial\digit).
This will save you some typing in the following Steps 2 through 6. None
of these files will need to be modified for this example, but Section
5 describes the format of these files so that you can change them later
on, in order to train on another task or train using different parameters.
[Step 2] Create a corpora
file, called "corpora" (no extension). For this tutorial,
the corpora file might look like this (assuming that the tutorial data
is storied in \tutorial\data):
type corpora
corpus: numbers
wav_path /tutorial/data/speechfiles
txt_path /tutorial/data/txtfiles
phn_path /tutorial/data/phnfiles
format {NU-([0-9]+)\.[A-Za-z0-9_]+}
wav_ext wav
txt_ext txt
phn_ext phn
cat_ext cat
cull_file /tutorial/digit/numbers.cull5
ID:
{regexp $format $filename filematch ID}
[Step 3] To start with, we run
cull5.tcl to remove 5% of all available
data for third-party evaluation. This creates a cull
file called numbers.cull5. This cull file simply contains a list of
waveform files that won't be used for training, development, or testing.
cull5.tcl numbers corpora
Please be patient... this may take a minute or two.
Found 653 wav files in corpus 'numbers'
Culled 32 out of 653 files (4.90%)
Done.
[Step 4] Create info
files for training, development, and testing. They will be called digit.train.info,
digit.dev.info, and digit.test.info. We will only request
200 samples per category, so that this tutorial doesn't take more time
than necessary; if one were constructing a real-life network, it would
be better to use all available samples. To specify all samples, use the
keyword ALL instead of 200 in the "want:" field in digit.train.info.
For the digit.train.info file, we are specifying
that we want training data from the numbers corpus, and we will put time-aligned
category labels in the \tutorial\digit\numbers_train directory
(specified in the "partition:", "name:", and "cat_path:" fields). We require
the presence of waveform, phonetically-labled, and text transcription files
in order to do this (specified in the "require:" field), and we'll use
3/5 of available files (specified in the "partition:" field). We won't
skip over any files (specified in the "filter:" field), but we will require
that all of the vocabulary words in the text file are words we want to
recognize (specified in the "vocab:" field). We will remap the hand-labled
phonetic files (which can have a high degree of variability in the phonemes
used to represent a word) to a consistent set of phonemes using the remap_tutorial.tcl
script (specified in the "remap:" field). For more information about the
meaning of the various fields, see the description of the info
file format.
type digit.train.info
basename: digit;
partition: train;
vector_size: 130;
corpus: name: numbers
cat_path: /tutorial/digit/numbers_train
require:
wpt
partition: "{expr $ID
% 5} {0 1 2}"
filter:
1+1
vocab:
digit.vocab
remap:
/tutorial/digit/remap_tutorial.tcl
want:
200;
type digit.dev.info
partition: dev;
basename: digit;
corpus: name: numbers
require:
wt
partition: "{expr $ID
% 5} {3}"
filter:
1+1
vocab:
digit.vocab;
type digit.test.info
partition: test;
basename: digit;
corpus: name: numbers
require:
wt
partition: "{expr $ID
% 5} {4}"
filter:
1+1
vocab:
digit.vocab;
[Step 5] Create a vocab
file, called digit.vocab. This file contains the words, their
pronunciations, and the grammar to be used during recognition.
type digit.vocab
zero {z I 9r oU}
;
oh {oU}
;
one {w ^ n}
;
two {uc th u}
;
three {T 9r i:}
;
four {f >r}
;
five {f aI v}
;
six {s I uc ks}
;
seven {s E v & n}
;
eight {ei uc [th]}
;
nine {n aI n}
;
separator {.pau [.garbage] .pau} ;
$digit = zero | oh | one | two | three | four | five
| six |
seven
| eight | nine;
$grammar = ([separator%%] < $digit [separator%%] > [separator%%]);
[Step 6] Create a parts
file, called digit.parts. This contains the number of parts
that each phoneme will be split into, the groupings of phones into clusters
of similar phones, and mappings from one phone to another.
type digit.parts
.pau 1 ;
uc 1 ;
f 2 ;
v 2 ;
T 2 ;
s 2 ;
z 2 ;
n 2 ;
w 2 ;
I 2 ;
ks 2 ;
& 2 ;
^ 2 ;
9r 2 ;
E 2 ;
\>r 3 ;
i: 3 ;
u 3 ;
ei 3 ;
aI 3 ;
oU 3 ;
th r ;
$sil = .pau tc .garbage;
$den_l = s z th T ks;
$den_r = s z th T;
$lab = f v;
$ret_l = 9r \>r;
$ret_r = 9r ;
$bck_l = oU u;
$bck_r = oU w;
map uc tc;
map uc kc;
[Step 7] Run find_files.tcl
in order to find files suitable for training. The input files (besides
digit.train.info and corpora) are \tutorial\digit\numbers.cull5
and digit.vocab. The output file is digit.train.numbers.files;
this filename is constructed from the basename, the partition, and the
corpus. The reason that the user doesn't specify the output filename on
the command line is that it is possible, when using several corpora, to
create several output files; it seems easier to have the filenames automatically
determined than to have the user specify one filename for each corpus.
find_files.tcl digit.train.info corpora
Basename: digit
Partition: train
Corpus: numbers
cat_ext:
cat
txt_ext:
txt
partition:
{expr $ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_train
txt_path:
/tutorial/data/txtfiles
remap: /tutorial/digit/remap_tutorial.tcl
phn_ext:
phn
filter: 1+1
wav_ext:
wav
cull_file:
/tutorial/digit/numbers.cull5
name: numbers
phn_path:
/tutorial/data/phnfiles
want: 200
wav_path:
/tutorial/data/speechfiles
vocab: digit.vocab
require:
wp
Total of 530 wave files to be used
NU-25.zipcode.wav
NU-30.zipcode.wav
NU-46.streetaddr.wav
NU-47.zipcode.wav
(etc)
NU-596.zipcode.wav
NU-597.streetaddr.wav
Final count of 530 files for this corpus
Done.
Then, run find_files.tcl a second
and third time to find files suitable for development and testing:
find_files.tcl digit.dev.info corpora
find_files.tcl digit.test.info corpora
[Step 8] Run categories.tcl
to determine the context-dependent categories that will be classified by
the recognizer. The input files are the vocab,
parts, and info files.
The output files are the desc and olddesc
files; these files contain not only the list of the context-dependent categories,
but also some other information about the recognizer that we will be creating.
categories.tcl digit.train.info digit.vocab digit.parts digit.train.desc
digit.train.olddesc
Basename: digit
Parition: train
Corpus: numbers
partition:
{expr $ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_train
remap: /tutorial/digit/remap_tutorial.tcl
filter: 1+1
name: numbers
want: 200
vocab: digit.vocab
require:
wp
zero: z I 9r oU
oh: oU
one: w ^ n
two: uc th u
three: T 9r i:
four: f GREATER_THANr
five: f aI v
six: s I uc ks
seven: s E v & n
eight: ei uc
eight: ei uc th
nine: n aI n
separator: .pau .garbage .pau
separator: .pau .pau
word begin = z oU w uc T f s ei n .pau
word end = oU n u i: GREATER_THANr v ks uc th .pau
[Step 9] Run gen_catfiles.tcl
to take the list of files for training (digit.train.numbers.files)
and create time-aligned labels of categories to train on. The input file
(other than digit.train.info and corpora) is digit.train.numbers.files.
If specified in digit.train.info, the script in the "remap:" field
will be used, or the script in the "force_cat:" or "force_phn:" fields
will be used (in this case, we haven't specified the "force_cat:" or "force_phn:"
fields because we are not yet doing forced alignment). The category label
files that are created are stored in the directory that is specified in
digit.train.info in the "cat_path:" field. The gen_catfiles.tcl
script also creates two other output files: the "dur"
file and the "counts" file. The dur file contains
minimum and maximum duration limits for each category, as determined from
the category label files; the counts file lists the number of occurrences
(and total time in msec) of each category.
gen_catfiles.tcl digit.train.info corpora digit.train.dur digit.train.counts
Basename: digit
Partition: train
Corpus: numbers
cat_ext:
cat
txt_ext:
txt
partition:
{expr $ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_train
txt_path:
/tutorial/data/txtfiles
remap: /tutorial/digit/remap_tutorial.tcl
phn_ext:
phn
filter: 1+1
wav_ext:
wav
cull_file:
/tutorial/digit/numbers.cull5
name: numbers
phn_path:
/tutorial/data/phnfiles
want: 200
wav_path:
/tutorial/data/speechfiles
vocab: digit.vocab
require:
wp
Creating directory /tutorial/digit/numbers_train/0
Created file
NU-25.zipcode.cat: oh seven three oh six
Created file
NU-30.zipcode.cat: one zero zero one four
Created file
NU-46.streetaddr.cat: one six
Created file
NU-47.zipcode.cat: one one three five four
(etc)
Created file
NU-596.streetaddr.cat: nine eight zero four
Created file
NU-596.zipcode.cat: eight seven one one two
Created file
NU-597.streetaddr.cat: one
Sorting durations... taking lowest 5% and top 100% of durations
Done.
This script may generate messages such as
Merging 2323 2344 .tc with right (.pau) (too short)
These are simply messages to the user that some labels are being merged
or deleted when converting from hand labels to categories. These messages
come from the remapping script, in this case remap_tutorial.tcl.
No action needs to be taken by the user. At the end, for each category,
the duration that is at the bottom 5th percentile of all durations for
that category is written to the dur file as the minimum duration, and the
longest duration of the category is written to the dur file as the maximum
duration. These limits help the Viterbi search refrain from inserting very
short words during recognition.
[Step 10] Run revise_desc.tcl
to make sure that we have enough samples of each category to train on,
and to add duration limits to the desc and olddesc files. If there are
not enough samples of a category, this script allows us to tie these categories
to categories with more samples. This is the only interactive script in
the entire training and recognition process. The input files are the counts
file, the dur file, the desc file, and the olddesc file. The outputs of
this script are modifications to the desc and olddesc files to include
category tieing information and duration limits information.
revise_desc.tcl digit.train.counts digit.train.dur digit.train.desc
digit.train.olddesc -min 3
The following states have already been tied:
(none)
Warnings:
Category uc<ei has 2 occurrences
(total duration of 161 msec)
Category ks>ei has 2 occurrences
(total duration of 131 msec)
Category n<n has 0
occurrences (total duration of
0 msec)
Category n>n has 0 occurrences
(total duration of 0 msec)
Category uc<oU has 2 occurrences
(total duration of 224 msec)
Category th>ei has 1 occurrences
(total duration of 85 msec)
Category th>n has 2 occurrences
(total duration of 96 msec)
Category th>uc has 0 occurrences
(total duration of 0 msec)
Do you want this program to create ties? yes
All Available Categories:
$lab<& &>n
<.pau> $den_l<9r I<9r 9r>$bck_r
9r>i: $den_l<E
E>$lab $lab<>r
<>r> >r>$bck_r >r>$den_r >r>$lab >r>$sil
>r>ei
>r>n >r>uc $den_l<I
I>$ret_r I>uc $bck_l<T
$den_l<T $lab<T
$ret_l<T $sil<T
i:<T n<T
uc<T T>$ret_r w<^
^>n
$lab<aI n<aI
<aI> aI>$lab aI>n $bck_l<ei
$den_l<ei $lab<ei
$ret_l<ei $sil<ei i:<ei
n<ei <ei>
ei>uc $bck_l<f $den_l<f
$lab<f $ret_l<f $sil<f
i:<f n<f
uc<f f>>r f>aI
$ret_l<i: <i:> i:>$bck_r i:>$den_r
i:>$lab i:>$sil i:>ei
i:>n
i:>uc uc<ks ks>$bck_r ks>$den_r
ks>$lab ks>$sil ks>n
ks>uc
$bck_l<n $den_l<n $lab<n
$ret_l<n $sil<n
&<n ^<n
aI<n
i:<n uc<n
n>$bck_r n>$den_r n>$lab
n>$sil n>aI n>ei
n>uc $bck_l<oU $den_l<oU $lab<oU
$ret_l<oU $sil<oU i:<oU
n<oU
<oU> oU>$bck_r oU>$den_r oU>$lab
oU>$sil oU>ei oU>n
oU>uc
$bck_l<s $den_l<s $lab<s
$ret_l<s $sil<s
i:<s n<s
uc<s
s>E
s>I th>$bck_r th>$den_r th>$lab th>$sil
th>u $den_l<u
<u> u>$bck_r u>$den_r
u>$lab u>$sil u>ei
u>n u>uc
<uc>
E<v aI<v v>$bck_r
v>$den_r v>$lab v>$sil
v>&
v>ei v>n
v>uc $bck_l<w $den_l<w $lab<w
$ret_l<w $sil<w
i:<w
n<w uc<w
w>^ $bck_l<z $den_l<z $lab<z
$ret_l<z
$sil<z i:<z
n<z uc<z
z>I
Tie category uc<ei
(2 occurances) to which category? no
Tie category ks>ei
(2 occurances) to which category? no
Tie category
n<n (0 occurances) to which category? $sil<n
Tie category
n>n (0 occurances) to which category? n>$sil
Tie category uc<oU
(2 occurances) to which category? no
Tie category th>ei
(1 occurances) to which category? no
Tie category th>n
(2 occurances) to which category? th>$sil
Tie category th>uc
(0 occurances) to which category? th>$sil
Would you like to create duration limits with data from
the durations file? yes
Modifying duration information in digit.desc...
Confirmation: Do you really want to change the desc file 'digit.desc'?
yes
Done.
[Step 11] Run hscript.exe
to create digit.rr, digit.list, and digit.0
from the revised digit.train.desc file. The digit.rr
file contains a binary description of the recognizer; the digit.list
file contains an ASCII list of the categories; and the digit.0
file contains an initial HMM description that is used in some of the cslush
function calls. (These cslush function calls are designed to work with
both neural networks and HMMs, and an HMM description is required even
when using or training neural networks.) Note that the value of stateNum
below (162) will be the number of output categories of the neural network;
this number should be the same as the number in a comment in the revised
digit.desc file (just before the list of categories).
hscript.exe digit.train.desc
opening file: digit.train.desc
Input model:
Output model: digit.0 digit.list digit.rr
tie: $sil<n to n<n (1)->(1)
tie: n>$sil to n>n (1)->(1)
tie: th>$sil to th>n (1)->(1)
tie: th>$sil to th>uc (1)->(1)
stateNum = 162
transpNum = 166
[Step 12] Run pickframes.tcl
to select frames for training, from the files created by gen_catfiles.
The input files (other than digit.train.info and corpora)
are digit.rr, digit.list, digit.0, and digit.train.numbers.files.
The output of this script is the file digit.train.pick, which
contains a binary list of files, the frames to be used in each file, and
the categories corresponding to these frames.
pickframes.tcl digit.train.info corpora digit.train.pick
Basename: digit
Partition: train
Corpus: numbers
cat_ext:
cat
txt_ext:
txt
partition:
{expr $ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_train
txt_path:
/tutorial/data/txtfiles
remap: /tutorial/digit/remap_tutorial.tcl
phn_ext:
phn
filter: 1+1
wav_ext:
wav
cull_file:
/tutorial/digit/numbers.cull5
name: numbers
phn_path:
/tutorial/data/phnfiles
want: 200
wav_path:
/tutorial/data/speechfiles
vocab: digit.vocab
require:
wp
digit.train.numbers.files, 200, 0
Picking frames for digit.train.numbers.files...
{{$lab<&} {0 200}} {&>n {1 200}} {<.pau> {2 200}}
{{$den_l<9r}
(etc)
[Step 13] Run genvec.tcl
to compute features for all of the frames given in digit.train.pick.
The input files are digit.train.olddesc, and digit.train.pick.
The features that are computed, and the target category values, are stored
in the binary output file digit.train.vec. Note that if you want
to use features that are different from the standard 130 features, you
can write the code used to create the new features; the location of your
code can be specified in the olddesc file.
Also, the description of the format
of the vector file given in Section 5 may be of interest.
genvec.tcl digit.train.olddesc digit.train.pick digit.train.vec
/tutorial/data/speechfiles/0/NU-25.zipcode.wav
/tutorial/data/speechfiles/0/NU-30.zipcode.wav
/tutorial/data/speechfiles/0/NU-46.streetaddr.wav
/tutorial/data/speechfiles/0/NU-47.zipcode.wav
/tutorial/data/speechfiles/0/NU-51.other2.wav
/tutorial/data/speechfiles/0/NU-51.other3.wav
/tutorial/data/speechfiles/0/NU-51.zipcode.wav
(etc)
[Step 14] Run shuffle.exe
to randomize the order of the features. This helps the training program
to better learn the data. This program divides the vector file into a series
of "blocks" and then shuffles the order of each block. In the example below,
-W indicates that feature vectors will be shuffled within a block;
-s 524 indicates that the size of one item to be shuffled is 524
bytes (130 features x 4 bytes per feature
+ 1 target x 4 bytes per target); the
-r 88 specifies a value for the random-number seed. The first vec
file is the input file, and the second vec file is the output file.
shuffle.exe -W -s 524 -r 88 digit.train.vec digit.train.svec
num_blocks = 44
num_full_blocks = 43
num_data = 21509
num_shuffle_items = 500
num_remain_items = 9
[Step 15] Run checkvec.exe
to make sure that the vector file that we created has the correct format,
and that every category has at least one sample to train on. The numbers
on the left are the numbers corresponding to each category, and the numbers
on the right are the number of samples for that category. The input file
is digit.train.svec; the only output goes to the screen for the
user to check.
[Step 16] Run nntrain.exe
to train the neural network on the vector file digit.train.svec.
This program creates a weights file at each iteration; we will select the
best weights file after training for 30 iterations. The -l option
indicates that the negative penalty will be adjusted to compensate for
varying numbers of samples per category; -sn 88 and -sv 88
are random-number seeds; -a 3 130 200 161 specifies the architecture
of the net: 3 layers, with 130 nodes in the first layer, 200 nodes in the
hidden layer, and 161 nodes in the output layer. The value 30 specifies
training for 30 iterations, and the last parameter is the vector file to
use for training. Note that the number of nodes in the first layer will
always be 130 for the standard feature set. The number of hidden nodes
is decided by the user, and the number of output nodes to use is written
in a comment in the digit.desc file (this is the same as the number
of categories that are not tied, and excluding the <.garbage> category).
The only input file to nntrain is the vector file; the output files are
the neural-network weights files for each iteraion (the default names are
nnet.X, where X is an integer from 0 to
the number of iterations).
nntrain.exe -l -sn 88 -sv 88 -a 3 130 200 161 30 digit.train.svec
creating net with seed 88
3 layers: 131 200 161
learning rate 0.050000
momentum 0.000000
negative weight 1.000000
training file digit.train.svec
numvec: 21509; tau: 107545.000000
vectors chosen in 1 blocks of 21509 with seed 88
iter 1: learn_rate 0.041667; total error is 75884.882813
iter 2: learn_rate 0.035714; total error is 57095.183594
iter 3: learn_rate 0.031250; total error is 50935.089844
iter 4: learn_rate 0.027778; total error is 46711.523438
(etc)
[Step 17] Run find_best.tcl
to evaluate the performance of each iteration (weight file) on the development-set
data. This script may take a long time, especially if there are many files
in the development set. The invocation below assumes that the script for
doing recognition is located in \tutorial\src\recog.tcl. The input
files are the neural-network files created by nntrain, the digit.dev.numbers.files
file, the vocab file, and the olddesc file, as well as digit.rr.
The output files are ali files and a summary
file.
find_best.tcl nnet \tutorial\src\recog.tcl digit.dev.numbers.files
digit.vocab \
digit.train.olddesc hand_labels.summary -b 15 -g 10
Garbage value is 10
Basename for the .ali files is wrdalign_nnet
Summary file is hand_labels.summary
Starting Iteration 15...
nnet.15:1: /tutorial/data/speechfiles/0/NU-23.streetaddr.wav
Correct: seven
Result: seven
nnet.15:2: /tutorial/data/speechfiles/0/NU-23.zipcode.wav
Correct: one oh oh oh three
Result: one oh oh oh three
(etc)
nnet.30:90: /tutorial/data/speechfiles/103/NU-10318.zipcode.wav
Correct: nine seven one one six
Result: nine seven one one six
nnet.30:91: /tutorial/data/speechfiles/103/NU-10328.streetaddr.wav
Correct: six six zero four
Result: six six zero four
(etc)
Itr #Snt #Words Sub% Ins%
Del% WrdAcc% SntCorr
15 91 408 3.43%
1.23% 1.47% 93.87% 79.12%
16 91 408 3.43%
1.47% 1.47% 93.63% 79.12%
17 91 408 3.43%
1.23% 1.72% 93.63% 78.02%
18 91 408 3.68%
1.23% 1.47% 93.63% 79.12%
19 91 408 3.92%
0.98% 1.23% 93.87% 78.02%
20 91 408 2.94%
1.23% 1.23% 94.61% 80.22%
21 91 408 2.94%
1.47% 1.23% 94.36% 80.22%
22 91 408 3.68%
1.47% 1.47% 93.38% 76.92%
23 91 408 3.43%
1.23% 1.23% 94.12% 80.22%
24 91 408 3.43%
0.74% 1.23% 94.61% 80.22%
25 91 408 3.68%
0.74% 1.23% 94.36% 79.12%
26 91 408 3.68%
1.23% 1.23% 93.87% 79.12%
27 91 408 3.92%
0.98% 1.23% 93.87% 79.12%
28 91 408 3.19%
0.98% 1.23% 94.61% 82.42%
29 91 408 3.19%
1.47% 1.23% 94.12% 79.12%
30 91 408 3.19%
1.23% 1.23% 94.36% 80.22%
Best results (94.61, 82.42) with network nnet.28
Evaluated 16 networks using 91 files
Note that training on only 200 samples per category has a large influence
on results; when I trained using all available samples instead of 200 per
category (using the keyword "ALL" instead of 200 in digit.train.info),
results were more than 30% better. The drawback to training on all samples
is that nntrain takes longer. If you have time and want better
results, it is beneficial to use as much data as possible.
[Step 18] At this point, it may
be helpful to see what kinds of errors are being made. We can browse through
the development set, looking at the waveform, spectrogram, and word results
for each error. The script browse.tcl
will go through the alignment file of the best iteration and find errors.
It will then perform recognition and create a wrd
file and cat file of the result.
browse.tcl digit.dev.numbers.files \tutorial\src\recog.tcl nnet.28
digit.vocab \
digit.train.olddesc -a
wrdalign_nnet.28
Files are output using base name temp
Starting with file number 1
Found 91 files to work with.
-----
#2: NU-23.zipcode.wav
Correct: one oh oh oh three
Recognized: one oh oh oh three
//goosnargh/speech/corpora/CSLU/numbers/speechfiles/0/NU-23.zipcode.wav
//goosnargh/speech/corpora/CSLU/numbers/phnfiles/0/NU-23.zipcode.phn
-----
? -e
Searching for next error...
-----
#23: NU-10193.streetaddr.wav
Correct: ### FOUR one three four
Recognized: TWO OH one three four
//goosnargh/speech/corpora/CSLU/numbers/speechfiles/101/NU-10193.streetaddr.wav
.phn file does not exist
-----
?
In a separate DOS or unix window, while browse.tcl is still running,
start the program SpeechView in order to display the waveform,
spectrogram, and results of recognition.
speechview.tcl -update -Wf temp.wav -S -Lf temp.phn -Lf temp.cat
-Lf temp.wrd
Here, speechview is told to update the contents of the display
whenever the contents of one of the files changes. It will create a waveform
display of temp.wav, a spectrogram display of that waveform, a
label display of the hand-labeled data (if it exists), a label display
of the categories that were recognized, and a label display of the words
that were recognized. (The speechview program is part of the CSLU
Toolkit.)
Now, in the window with the browse.tcl script, you can search
for errors in recognition, and the results should be displayed automatically
in the SpeechView program.
[Step 19] Now we have finished the
first cycle of training. If we are happy with the level of performance
on the development set, we can stop the training process and evaluate on
the test set (step 27). If we do skip to the evaluation step, then it is
not permitted to re-train if we are unhappy with the test-set results.
If we want to try to improve performance on the development set, we can
do another cycle of training using force-aligned data. We can create another
info file for doing forced alignment, using the training file as a template.
This new file will be called digit.fa.info:
copy digit.train.info digit.trainfa.info
edit digit.trainfa.info
type digit.trainfa.info
basename: digit;
partition: trainfa;
vector_size: 130;
corpus: name: numbers
cat_path: /tutorial/digit/numbers_trainfa
require:
wt
force_cat: "/tutorial/src/force.tcl
nnet.28 digit.vocab
digit.train.olddesc TXT WAV c OUT"
partition: "{expr $ID
% 5} {0 1 2}"
filter:
1+1
vocab:
digit.vocab
want:
200;
(Note that in the "force_cat:" field, the script and associated parameters
are specified on two lines. No special marker (such as a backslash) is
required.)
Note that we have changed the partition name (to "trainfa"), the path
for category files (to "numbers_trainfa"), and that we will now require
the existence of .txt files instead of .phn files (because we will create
labels from the text transcriptions). We also add a new field to the corpus
description, indicating that we want to do forced alignment and create
labels at the category level (as opposed to the phone or word level). Also
note that we will do forced alignment using iteration 28 from the training
we just finished, since iteration 28 had the best word-level performance.
Because we are doing forced alignment, it is no longer necessary to use
the remapping script that re-maps labels created by hand to the set of
labels used by our recognizer.
[Step 20] Now we once again find
the files we want to use for training by running find_files.tcl,
and then we generate cateogry-level time-aligned labels by running gen_catfiles.tcl.
Next, we create new dur and counts
files, and update the duration limits in the desc and olddesc files:
find_files.tcl digit.trainfa.info corpora
gen_catfiles.tcl digit.trainfa.info corpora
find_dur.tcl digit.trainfa.info corpora digit.trainfa.dur digit.trainfa.counts
update_descdur.tcl digit.trainfa.dur digit.train.desc digit.train.olddesc
\
digit.trainfa.desc digit.trainfa.olddesc
[Step 21] Then, we repeat the training
steps to train and select the best force-aligned network:
pickframes.tcl digit.trainfa.info corpora digit.trainfa.pick
genvec.tcl digit.trainfa.olddesc digit.trainfa.pick digit.trainfa.vec
shuffle.exe -W -s 524 -r 88 digit.trainfa.vec digit.trainfa.svec
checkvec.exe digit.trainfa.svec
nntrain.exe -f fa -l -sn 88 -sv 88 -a 3 130 200 161 30 digit.trainfa.svec
find_best.tcl force \tutorial\src\recog.tcl digit.dev.numbers.files
digit.vocab \
digit.train.olddesc fa.summary
-b 15 -g 10
Note that the weights files have the basename "fa". The results from find_best
are:
Itr #Snt #Words Sub% Ins%
Del% WrdAcc% SntCorr
15 91 408 3.19%
0.98% 1.47% 94.36% 79.12%
16 91 408 2.70%
0.98% 1.23% 95.10% 79.12%
17 91 408 2.94%
1.23% 1.23% 94.61% 78.02%
18 91 408 2.70%
0.98% 1.47% 94.85% 79.12%
19 91 408 2.94%
1.23% 0.74% 95.10% 81.32%
20 91 408 2.94%
1.23% 0.98% 94.85% 81.32%
21 91 408 2.21%
1.23% 0.98% 95.59% 82.42%
22 91 408 2.21%
0.98% 0.98% 95.83% 84.62%
23 91 408 2.94%
0.98% 1.23% 94.85% 80.22%
24 91 408 2.45%
1.23% 0.74% 95.59% 83.52%
25 91 408 3.19%
0.74% 1.23% 94.85% 81.32%
26 91 408 1.96%
0.98% 0.98% 96.08% 83.52%
27 91 408 2.45%
1.23% 1.23% 95.10% 81.32%
28 91 408 2.21%
0.74% 0.74% 96.32% 84.62%
29 91 408 2.45%
1.23% 0.98% 95.34% 82.42%
30 91 408 2.70%
0.74% 0.98% 95.59% 82.42%
Best results (96.32, 84.62) with network fa.28
Evaluated 16 networks using 91 files
[Step 22] To create a network using
the forward-backward method, we create a vector file using hmm_embed.tcl.
First, however, we will create a new info file, specifying that we will
deal with forward-backward training (partition name "trainfb"), that we
want to use the labels generated from forced alignment, and that we want
to use all available samples for training.
copy digit.trainfa.info digit.trainfb.info
copy digit.trainfa.numbers.files digit.trainfb.numbers.files
copy digit.trainfa.desc digit.trainfb.desc
copy digit.trainfa.olddesc digit.trainfb.olddesc
edit digit.trainfb.info
type digit.trainfb.info
basename: digit;
partition: trainfb;
vector_size: 130;
corpus: name: numbers
cat_path: /tutorial/digit/numbers_trainfa
require:
wt
partition: "{expr $ID
% 5} {0 1 2}"
filter:
1+1
vocab:
digit.vocab
want:
ALL;
pickframes.tcl digit.trainfb.info corpora digit.trainfb.pick
Basename: digit
Partition: trainfb
Corpus: numbers
cat_ext:
cat
txt_ext:
txt
partition:
{expr $ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_trainfa
txt_path:
/tutorial/data/txtfiles
phn_ext:
phn
filter: 1+1
wav_ext:
wav
cull_file:
/tutorial/digit/numbers.cull5
name: numbers
phn_path:
/tutorial/data/phnfiles
want: 1000000000
wav_path:
/tutorial/data/speechfiles
vocab: digit.vocab
require:
wt
digit.trainfb.numbers.files, 1000000000, 0
Picking frames for digit.trainfb.numbers.files...
{{$lab<&} {0 498}} {&>n {1 482}} {<.pau> {2 11160}}
hmm_embed.tcl digit 0 fa.28 digit.trainfb.info corpora digit.trainfb.pick
digit.trainfb1.vec \
digit.trainfb.olddesc
Basename: digit
Partition: trainfb
Corpus: numbers
cat_ext:
cat
txt_ext:
txt
partition:
{expr $ID % 5} {0 1 2}
cat_path:
/tutorial/digit/numbers_fa
txt_path:
/tutorial/data/txtfiles
phn_ext:
phn
filter: 1+1
wav_ext:
wav
cull_file:
/tutorial/digit/numbers.cull5
name: numbers
phn_path:
/tutorial/data/phnfiles
want: 1000000000
wav_path:
/tutorial/data/speechfiles
vocab: digit.vocab
require:
wt
prune: 300.0
minmodel: 10.0
mincount: 0
/tutorial/digit/numbers_trainfa/0/NU-25.zipcode.cat
Utterance prob per frame: -2.104426
/tutorial/digit/numbers_trainfa/0/NU-30.zipcode.cat
Utterance prob per frame: -1.293738
/tutorial/digit/numbers_trainfa/0/NU-46.streetaddr.cat
Utterance prob per frame: -1.190075
(etc)
[Step 23] Then we shuffle this vector
file using shuffle.exe and check it for errors using hnncheckvec.exe.
This time, the header size of the vector file is 8 bytes, and the vector
size if 544 bytes (130 features x 4
bytes + 3 targets x 4 bytes + 3 target
values x 4 bytes):
shuffle.exe -W -r 88 -h 8 -s 544 digit.trainfb1.vec digit.trainfb1.svec
hnncheckvec.exe digit.trainfb1.svec
numvec: 93752
numactive: 3
numinputs: 130
0: 1131
1: 1196
2: 32393
3: 2024
4: 1359
5: 1336
(etc)
[Step 24] Next, we train on this vector
file using hnntrain.exe:
hnntrain.exe -l -sn 88 -sv 88 -f fb1 -a 3 130 200 161 30 digit.trainfb1.svec
creating net with seed 88
numvec: 93752
numactive:3
(etc)
[Step 25] Finally, we select the best
iteration using find_best.tcl:
find_best.tcl fb1 \tutorial\src\recog.tcl digit.dev.numbers.files
digit.vocab \
digit.trainfb.olddesc
fb1.summary -b 15 -g 10
Itr #Snt #Words Sub% Ins%
Del% WrdAcc% SntCorr
15 91 408 2.94%
1.23% 0.98% 94.85% 81.32%
16 91 408 2.21%
0.98% 1.23% 95.59% 82.42%
17 91 408 2.45%
0.98% 1.47% 95.10% 82.42%
18 91 408 1.96%
0.98% 0.98% 96.08% 84.62%
19 91 408 1.96%
0.98% 0.74% 96.32% 85.71%
20 91 408 2.45%
0.98% 0.98% 95.59% 82.42%
21 91 408 2.21%
0.98% 1.47% 95.34% 81.32%
22 91 408 1.96%
0.74% 0.98% 96.32% 85.71%
23 91 408 1.96%
0.98% 0.98% 96.08% 83.52%
24 91 408 2.21%
0.98% 1.47% 95.34% 82.42%
25 91 408 2.21%
0.98% 0.74% 96.08% 85.71%
26 91 408 2.70%
0.98% 1.47% 94.85% 81.32%
27 91 408 2.21%
0.98% 0.98% 95.83% 82.42%
28 91 408 2.94%
1.23% 1.23% 94.61% 79.12%
29 91 408 2.45%
0.98% 0.98% 95.59% 82.42%
30 91 408 1.72%
0.98% 1.23% 96.08% 84.62%
Best results (96.32, 85.71) with network fb1.22
Evaluated 16 networks using 91 files
Although the word-level accuracy is the same as with force-aligned training,
the sentence-level accuracy is 7% better.
[Step 26] We then repeat the cycle
of forward-backward training one more time:
hmm_embed.tcl digit 1 fb1.22 digit.trainfb.info corpora digit.trainfb.pick
digit.trainfb2.vec \
digit.trainfb.olddesc
shuffle.exe -W -r 88 -h 8 -s 544 digit.trainfb2.vec digit.trainfb2.svec
hnncheckvec.exe digit.trainfb2.svec
hnntrain.exe -l -sn 88 -sv 88 -f fb2 -a 3 130 200 161 30 digit.trainfb2.svec
find_best.tcl fb2 \tutorial\src\recog.tcl digit.dev.numbers.files
digit.vocab \
digit.trainfb.olddesc
fb2.summary -b 15 -g 10
Itr #Snt #Words Sub% Ins%
Del% WrdAcc% SntCorr
15 91 408 2.21%
1.23% 1.23% 95.34% 81.32%
16 91 408 1.47%
0.98% 1.23% 96.32% 84.62%
17 91 408 2.45%
1.23% 1.47% 94.85% 82.42%
18 91 408 2.45%
0.98% 1.23% 95.34% 82.42%
19 91 408 2.45%
0.98% 1.23% 95.34% 82.42%
20 91 408 2.45%
0.74% 1.23% 95.59% 83.52%
21 91 408 1.96%
0.98% 1.47% 95.59% 82.42%
22 91 408 2.21%
0.98% 1.23% 95.59% 83.52%
23 91 408 1.72%
0.98% 1.23% 96.08% 83.52%
24 91 408 2.21%
0.98% 1.23% 95.59% 84.62%
25 91 408 1.72%
0.98% 0.98% 96.32% 85.71%
26 91 408 2.45%
0.98% 1.23% 95.34% 82.42%
27 91 408 1.96%
0.74% 1.47% 95.83% 83.52%
28 91 408 1.96%
1.23% 1.23% 95.59% 82.42%
29 91 408 1.96%
1.23% 1.23% 95.59% 83.52%
30 91 408 2.45%
0.98% 1.23% 95.34% 83.52%
Best results (96.32, 85.71) with network fb2.25
Evaluated 16 networks using 91 files
In this case, the two forward-backward networks have the same performance,
so further training will probably not yield better results. Usually, two
cycles of forward-backward training is enough.
[Step 27] The resulting network
is the final network. The last step is to evaluate this network on the
test set
find_best.tcl fb2 \tutorial\src\recog.tcl digit.test.numbers.files
digit.vocab \
digit.trainfb.olddesc
test.summary -o 25 -g 10
5. File Formats
In the following file formats, text in fixed-font bold
is a keyword that must be used verbatim. Italicized items in brackets <>
must be substituted with the proper values.
wav file
A wav file contains the speech waveform that is to be trained on or
recognized. The format for wav files in the CSLU Toolkit is (unfortunately,
sometimes) not the Microsoft .wav format; it is the NIST Sphere
ulaw format. This format is described at http://vision1.cs.umr.edu/~johns/links/music/audiofile2.html
and there is software available on the WWW for converting waveform files
into different formats.
txt file
A txt file contains a text transcription of the words in a speech waveform.
This file is simply an ASCII file containing the words separated by spaces,
and it can be created by any text editor that outputs ordinary .txt files.
label files (.phn, .cat, .wrd)
Label files, which usually have the extension .phn, .cat, or .wrd,
contain time-aligned labels of a waveform utterance. If the file has the
extension .phn, then the labels are phonetic labels; if the file has the
.cat extension, then the labels are neural-network output categories (context-dependent
sub-phone units); and if the file has the extension .wrd, then the labels
are words. A label file has the following format:
MillisecondsPerFrame: <value>
END OF HEADER
<begin_time_1> <end_time_1> <label_1>
<begin_time_2> <end_time_2> <label_2>
...
<begin_time_n> <end_time_n> <label_n>
where:
-
<value>
-
is the number of milliseconds in one frame of speech (usually this value
is 1.0).
-
<begin_time>
-
is the time at which <label> starts
-
<end_time>
-
is the time at which <label> ends
-
<label>
-
is the word, phone, or category label for the segment of speech
The values for <begin_time> and <end_time> are measured
in frames (so if <value> is 1.0, then time is measured in milliseconds;
if <value> is 10.0, then time is measured in centi-seconds).
The <end_time> of one label is usually the same as the <begin_time>
of the next label.
corpora file
The corpora file contains descriptions of all corpora:
<corpus1 description>
<corpus2 description>
<corpus3 description>
...
where a <corpus description> has the following format:
corpus: <corpus_name>
wav_path <path_to_wav_files>
phn_path <path_to_phn_files>
txt_path <path_to_txt_files>
format <regular_expression_for_parsing_filenames>
wav_ext <extension_for_wav_files>
phn_ext <extension_for_phn_files>
txt_ext <extension_for_txt_files>
cat_ext <extension_for_cat_files>
cull_file <location_of_cull_file>
ID:
<Tcl_code_for_determining_caller_ID>
where:
-
<corpus_name>
-
is a name used to describe the corpus. The format for <corpus_name>
is the same as for any Tcl variable name.
-
<path_to_wav_files>
-
is the full path to the directory containing waveform files. It is assumed
that in this directory will be sub-directories, and that the actual files
will be in these sub-directories.
-
<path_to_phn_files>
-
is the full path to the directory containing time-aligned phonetic label
files. It is assumed that in this directory will be sub-directories, and
that the actual files will be in these sub-directories.
-
<path_to_txt_files>
-
is the full path to the directory containing text transcription files.
It is assumed that in this directory will be sub-directories, and that
the actual files will be in these sub-directories.
-
<regular_expression_for_parsing_filenames>
-
is a regular expression, enclosed in curley braces {}, that will succeed
when used to parse the base name of a file that belongs in the corpus.
It can also be used to extract the call number from the filename, for use
in determining the caller ID.
-
<extension_for_wav_files>
-
is the filename extension for waveform files. Usually, the value is "wav".
-
<extension_for_phn_files>
-
is the filename extension for time-aligned phonetic label files. Usually,
the value is "phn".
-
<extension_for_txt_files>
-
is the filename extension for text transcription files (without time alignment).
Usually, the value is "txt".
-
<extension_for_cat_files>
-
is the filename extension for time-aligned category files, where the categories
correspond to outputs of the neural network (such as $nas<E). These
files are usually generated automatically. <location_of_cull_file> is
the full path and filename for the cull file, if one exists for this corpus.
-
<Tcl_code_for_determining_caller_ID>
-
is Tcl code, enclosed in curley braces {}, for determining the caller ID.
In order to make this possible, this code can reference two variables:
$format, which is the regular expression given above, and $filename, which
is the base name of a waveform file. The result of this code must store
the result in the variable "ID".
cull file
A cull file contains a list of waveform files that won't be used for
training, development, or testing. The purpose is to have a set of files
available for third-party evaluation. The format of a cull file is:
<wav_file_1>
<wav_file_2>
...
<wav_file_n>
where:
-
<wav_file>
-
is a waveform file that will not be used for training, development, or
testing.
info file
An info file has the following format:
basename: <base_name>
;
partition: <partition_name>
;
vector_size: <vector_size> ;
min_samp: <minimum_number_of_samples>
;
corpus: <corpus1
information> ;
corpus: <corpus2
information> ;
corpus: <corpus3
information> ;
...
The number of corpora specified in an info file is theoretically unlimited,
but there must be at least one. Note that each field has a semicolon at
the end.
-
<base_name>
-
is the name of the recognizer and is the basename of many files associated
with the recognizer, such as the "desc", "olddesc", and "rr" files. Typically,
this name will indicate the task being trained on ("digits", for example).
-
<parition_name>
-
is a description of how the data will be used. Typical partition names
are "train", "dev", "fa" (for forced alignment), "fb1" and "fb2" (for forward-backward
1 and 2), and "test".
-
<vector_size>
-
is an integer value for the number of inputs to the neural network. The
typical value is 130.
-
<minimum_number_of_samples>
-
is the minimum number of samples (or vectors) that are requested for each
category. This is only meaningful when more than one corpus is being used,
because if there is only one corpus, then the scripts will automatically
try to find the desired number of samples. If there is more than one corpus,
the scripts will try to obtain at least the minimum number that is specified.
If only one corpus is being used, this field may be omitted.
-
<corpus information>
-
contains information about how to use files in the particular corpus. This
field has the following format:
-
name: <corpus_name>
-
cat_path: <path_to_cat_files>
-
require: <type_of_files_that_are_required>
-
wav_list: <list_of_wav_files_to_use>
-
phn_list: <list_of_phn_files_to_use>
-
txt_list: <list_of_txt_files_to_use>
-
partition: <Tcl_code_and_list>
-
filter: <filter_for_skipping_files>
-
vocab: <vocab_file>
-
force_phn: <script_for_forced_alignment_at_phonetic_level>
-
force_cat: <script_for_forced_alignment_at_category_level>
-
remap: <script_for_remapping_hand_labels>
-
want:
-
<desired_number_of_samples_per_category>
-
where all fields are optional except for "name:" and "partition:".
-
<corpus_name>
-
is the name of the corpus to be trained on. This name must match one of
the corpus names in the corpora file.
-
<path_to_cat_files>
-
is the full path to the directory where time-aligned categories will be
stored. This directory will be created during the training process.
-
<type_of_files_that_are_required<