Johan Schalkwyk, Xintian Wu and Yonghong Yan
Center for Spoken Language Understanding (cslu)
Oregon Graduate Institute of Science & Technology
December 17, 1997
The CSLU-HMM development environment (CSLU-HMM) is a collection of modular building blocks which aim to provide the user with an easy to use, powerful, research and development environment for the construction of state of the art HMM and Neural Network hybrid recognizers. Based on the CSLU shell [1] (CSLUsh), which uses Tcl/Tk to provide a scripting environment, CSLU-HMM provides a flexible environment in which the user can shape/cast the existing modules to meet specific needs. CSLU-HMM is released as part of the CSLU Toolkit and may be down-loaded for non-commercial use from http://www.cse.ogi.edu/CSLU/toolkit.
Figure 1.1 presents the software architecture of CSLU-HMM. Designed as an extension to the CSLU shell (CSLUsh), CSLU-HMM uses a wide variety of pre-existing modules for distributed computing, speech signal processing, mathematical operations and various miscellaneous modules to provide a complete HMM development environment.
The foundation of CSLUsh (and CSLU-HMM) is a set of efficient C libraries (CSLU-C) for the functions which support the basic algorithmic operations and associated utilities. These libraries can be used directly with a documented C API to build applications.
The extendible scripting language Tcl is used to access the same functionality in a high-powered way, by creating core components (Tcl packages) which ``glue'' the basic operations according to a well defined API. The packages are dynamically loaded as needed, providing a small footprint for implementation.
Some of the more advanced techniques under development are:
Parameter tying may be done at either the model, state, mixture component, mean and/or covariance level. To facilitate the construction of HMM recognizers for medium to large vocabulary tasks, CSLU-HMM also supports decision tree state clustering [5].
CSLU-HMM also includes support for the design and optimization of neural-network hybrid recognizers. The embedded re-estimation algorithm provided with the main HMM library may be used to reestimate neural network targets [6].
HMM models are created and configured using CSLU-HMM configuration scripts (hscript ). Figure 1.2 presents an example. In this example mono-phone models are created for the phonemes /w/, /ah/, /n/, /sil/ and /sp/. The short pause model (/sp/) is then tied to the center state of the silence model (/sil/). The HMM configuration language provides a mechanism in which one can specify new HMM models and also edit existing HMM models. The collection of such scripts documents the process of building a recognizer in an easily readable and understandable format. All the above mentioned functionalities are integrated within the CSLUsh environment. Stand-alone applications such as HMM embedded training are mere CSLUsh (extended Tcl/Tk) scripts which read a set of predefined input files and performs embedded training. Since the core HMM functionalities are implemented as extensions to the base scripting language, the CSLU-HMM environment is therefore totally programmable. The user can thus with relative ease change these applications to meet specific needs, rather than comply with the predefined interface.
Figure 1.3 shows the outline of the typical process of training an HMM model for speech recognition. Each of these steps and the associated CSLU-HMM scripts are discussed.
The arguments to the scripts are broken into command line parameters and command line options. A typical script would be invoked as follows
system prompt% hmmtool.tcl param1 param2 -option1 x -option2 y
The command line options are used to configure the experimental setup. These configuration setup parameters may be specified using the -config option. When the -config option is specified the default values for the command line options and parameters will be read from configuration file. CSLU-HMM uses the Tcl associative array as an interface to specifying command line parameters. The following configuration file illustrates this process.
experiment.cfg -------------- set config(tool,option1) x set config(tool,option2) y set param(tool,param1) param1 set param(tool,param2) param2
With these variables defined, the script would be invoked as follows for these to take effect:
system prompt% hmmtool.tcl - - -config experiment.cfg
One of the most important parts of building a good speech recognizer has to do with processing the data. Building HMM models requires a set of speech waveform files and their associated word and/or phoneme transcriptions.
The steps needed to process the data and transcriptions into the required formats for training and testing are inevitably specific to layout and contents of the data. In the tutorial following (Chapter 2) some of these steps are illustrated using the Tcl scripting language to create a series of scripts to transform the transcriptions into the required formats needed.
Given the speech waveform files one the major data preparation steps is parameterize the speech data. The CSLU-HMM script genfeature.tcl extracts the baseline features for training HMM models. genfeat.tcl reads a list of files from a Master Transcription file and computes the baseline features for each of the files specified. The files are stored in a disk cache, and indexed for fast retrieval. With the exception of the model initialization and single model training scripts all CSLU-HMM tools interface via Master transcription file .
For phoneme (sub-word based speech recognizers it is often required to have phonetically labeled data, in order to create initial seed models, which can in turn be used to label non phonetically aligned data. For English there are a wide variety of resources available such as the TIMIT and NTIMIT corpora.
Figure 1.4 depicts a graphical overview of the model initialization.
Model initialization is done in a three step process, starting with the alignments obtained from the initial labeled data. The CSLU-HMM script pickdata.tcl reads the alignment information and selects a subset of the available data for model initialization and model training.
With the data selected the model parameters are initialized using the CSLU-HMM script hmminit.tcl . Data (segments) from a particular class are loaded into memory. Each segment is then cut into equal sized sub segments, depending on the total number of states for the model. The data allocated to a particular HMM state are then combined, from which the initial mixture means are estimated using the vector quantization (VQ) algorithm. The mixture covariances are set to the pooled covariance of the data. During this first parameter initialization step, the mixture weights and state transition probabilities are not estimated.
The parameter estimates are further improved upon by performing a Viterbi state realignment. During this phase the allocation of data to a particular model state is determined by the state alignment rather than the initial equal sized sub segments.
During model initialization the state parameter estimates assume a discrete allocation of data for each state in the HMM model. This hard allocation of data to states is often not optimal. Feature frames close to state transition are difficult to associate with a single state. Rather there is a soft (probabilistic) association with states. The forward/backward algorithm in contrast to the Viterbi state realignment computes this probabilistic association of features versus HMM state.
These probabilistic associations are used in conjunction with the expectation/maximization (EM) algorithm to further improve upon the initial parameter estimates. For each segment the parameter accumulator variables are updated. The accumulator variables are used to store the accumulative contribution of each data vector to each mixture of each state. From these statistics the model parameters are computed.
The CSLU-HMM script (hmmtrain.tcl ) encapsulates this process. Together with model initialization these two training techniques are used to form a ``seed'' model which is used as basis for training the recognizer.
The phonetic hand labeled data are typically only sufficient to create ``seed'' models for a phoneme-based recognizer. To build a more accurate and more robust recognizer requires much more data. Most corpora contain word level transcriptions, which may be used to augment the existing training data, by
The CSLUsh script hmmscribe.tcl computes the word, phoneme and/or state alignments. The input word transcriptions are used to create a finite state grammar where each node or state in the grammar contains a word and its pronunciation variants obtained a pronunciation lexicon. The standard Viterbi algorithm is then used to find the best possible path through the grammar.
Because the pronunciation lexicon can have more than one pronunciation per word, transcriptions are derived from the known orthography by using the initial bootstrap models to do a forced recognition of each training utterance.
hmmscribe.tcl reads a standard master transcription file For each word in the transcription the pronunciation variants are read from a pronunciation database. The CSLUsh script worddb.tcl reads a text file of word and pronunciation tuples, and creates the database.
Model initialization and single model training only use the data associated with the particular model. During these training steps it is assumed that the phonetic boundaries are defined and that there are no interaction between neighboring models. Embedded parameter (hmmembed.tcl ) estimation addresses these problems by creating a composite model from the associated model transcriptions. The composite model is then used to compute the model and state alignments, from which the model parameters are updated. The CSLUsh script genmodel.tcl converts a word transcription file to an associated mono-phone or triphone transcription file. Given the input pronunciation database, each word is expanded into either a mono-phone or triphone representation. The resulting model transcriptions are then used for embedded training.
The previous three training steps described created a series of HMM models. The question which remains to be answered is which model will give the best performance when evaluated on previously unseen data.
To setup the evaluation process we first need to construct a task grammar and associated search network used by the Viterbi decoder. Figure 1.5 graphically depicts the task grammar for a continuous spoken digit recognizer. This grammar can be used to recognizer any number of spoken digits, with an optional silence between words.
The CSLUsh script buildsearch.tcl uses both the grammar specification and word pronunciations to create a search network.
With the search network in place, recognition is done using the CSLU-HMM script hmmsearch.tcl . hmmsearch.tcl supports full context dependent triphone modeling, with word, phoneme, or state level alignment. By default only the first best answer is returned. However, a word lattice is also available which contains multiple hypothesis. Once all files have been processed, the input transcriptions is compared to the recognition answers to evaluate the performance of the recognizer. The CSLU-HMM script hmmscore.tcl uses the NIST scoring algorithm to compare the recognition answers with the true answers provided.
This chapter shows how to use each of the HMM training utilities provided, by building a continuous digit recognizer. The tutorial files may be down-loaded from .....
Figure 2.1 depicts the grammar for a spoken digit recognizer. This grammar can be used to recognize any number of spoken digits, with an optional silence between words.
The CSLU-HMM development environment uses a grammar definition language similar to the HTK definition language to specify task grammars. For our digit recognizer the following grammar is used:
$digit = one | two | three | four | five | six |
seven | eight | nine | zero | oh;
[sil] <$digit [sil]>;
For this task we will use the The 30k Numbers
corpus
. The
numbers corpus is a collection of spontaneous ordinal and cardinal
numbers, continuous digit strings and isolated digit strings. The
utterances were taken from numerous CSLU telephone speech data
collection efforts. Release 1.0 of the 30,000 Numbers Corpus contains
15,000 files. Each file has an orthographic transcription; about 7,000
have a phonetic transcription. For the purpose of this tutorial only
the utterance which contain digits are used.
Figure 2.2 depicts the directory structure of the Numbers corpus. The speech waveform files are stored in the speech files subdirectory. For each speech file there is an associated word transcription file which can be found in the txtfiles directory. These transcriptions are used to select the set of files which only contain digit strings. The phnfiles directory contains phonetic hand-labels for a subset (7000 files) of the data.
The word pronunciation models for each of the words defined by the grammar are given in table 2.1. These are defined using the ARPA phonetic alphabet.
From the pronunciation models we define the first set of HMM models needed for the task. The configuration script digit.desc creates as set of standard 3-state left-to-right HMM models for each of the phones needed. For the word six (pronounced ``s ih k s'') the phonemes ``k'' and ``s'' are merged resulting in a special ``ks'' model.
system prompt% pwd
TUTORIAL/digit/mono
system prompt% cat digit.desc
#!hscript
#
outputmodel "digit.0";
vecsize 39;
prototype mono numstate 5 mixtures 4 transp
0.000 1.000 0.000 0.000 0.000
0.000 0.600 0.400 0.000 0.000
0.000 0.000 0.500 0.500 0.000
0.000 0.000 0.000 0.600 0.400
0.000 0.000 0.000 0.000 0.000;
define mono <z> <ih> <r> <ow>
<w> <ah> <n> <t>
<uw> <th> <iy> <f>
<aor> <ay> <v> <s>
<ks> <eh> <ey> <sil>;
Executing the model configuration script results in the files digit.0, digit.list and digit.rr. These files describe the initial model configuration.
system prompt% hscript digit.desc opening file: digit.desc Input model: Output model: digit.0 digit.list digit.rr Checking HMM model minimum mixute weight: 10000000000.000000 stateNum = 60 transpNum = 20 meanNum = 240 varNum: 240
This section presents the initial data preparation steps needed to build a continuous digit recognizer.
In this tutorial the training set consists of 3/5 of the total data, and the sum of all development sets is 1/5 of the total data. The test set is also 1/5 of the total data.
The "total data" is actually not all of the digits files in the numbers corpus... 5% of all files (speaker-independent) have been removed (culled) from available files for a final test set. This final 5% test set is not to be used by the researcher in evaluation; it is to be used by an independent party and run only once on a given system (the normal test set is also to be run only once on a given system, but the researchers may do it themselves).
There are four development sub-sets taken from the available development data. The first sub-set was generated by finding all files in the development set that have phonetic labels. The other three development sub-sets were generated by splitting up all available development files (1/5 of all data) into three parts. Note that these three development sub-sets are speaker- independent, in that a call number will appear in only one of the sets. (In other words, you won't have NU-78.zipcode.wav in one sub-set and NU-78.other1.wav in another sub-set, because the call number 78 is from one person).
The corpus subdirectory of this tutorial contains the file listing for each of the subsets defined according to the procedure outlined above.
system prompt% pwd TUTORIAL/digit/corpus system prompt% ls dev.files dev2.files test.files train.phn.files dev1.files dev3.files train.all.files
system prompt% pwd TUTORIAL/digit/labs system prompt% ../script/transcript.tcl ../corpus/train.all.files train.all.wrd system prompt% ../script/transcript.tcl ../corpus/train.phn.file train.phn.wrd system prompt% ../script/transcript.tcl ../corpus/dev1.files dev1.wrd system prompt% ../script/transcript.tcl ../corpus/test.files test.wrd
The script digitphn.tcl reads each phonetic transcription file and performs the necessary mappings between the two phonetic alphabets. Because our design creates a few variations like merging the ``k'' and ``s'' phonemes in the word six, these effects are also incorporated during the mapping process.
system prompt% ../script/digitphn.tcl corpus/train.phn.files
/u/johans/csluhmm/digit/lola/0/NU-25.zipcode.mono
/u/johans/csluhmm/digit/lola/0/NU-30.zipcode.mono
/u/johans/csluhmm/digit/lola/0/NU-31.zipcode.mono
.
.
.
With all the data in the required formats, we can now start the initial processing to train the recognizer. Table 2.2 shows the parameter settings for the feature extraction process. In this example we will be computing 13 MEL scale cepstral coefficients every 10 milliseconds.
system prompt% genfeat.tcl labs/train.all.wrd -config digit.cfg processing MLF File: ../labs/train.all.wrd . . . system prompt% genfeat.tcl labs/dev1.wrd -config digit.cfg system prompt% genfeat.tcl labs/test.wrd -config digit.cfg
The features calculated and stored in the cache file are typically only the base cepstral features. Extra features such as the first and second order time derivatives are computed on the fly. The script user.tcl creates a function UserFeat which is invoked by the feature cache interface function during training and testing. In the example below the user defined function performs cepstral mean subtraction and adds the first and second order time derivatives to base cepstral coefficients. This results in a total of 39 values for each feature frame.
system prompt% pwd
TUTORIAL/digit/script
system prompt% cat user.tcl
package require Mx
package require Analysis
proc UserFeat wmf {
# mean subtraction
set mf [mx zeromean $wmf]
nuke $wmf
# delta
set dd [analysis delta initialize -order 2 -sigmaT2 13]
set mld [analysis delta $dd $mf -flush]; nuke $dd;
lappend mf [set mlc [mx cut $mld :,0:12]]; nuke $mld
# delta^2
set dd [analysis delta initialize -order 2 -sigmaT2 13]
set mld [analysis delta $dd $mlc -flush]; nuke $dd;
lappend mf [mx cut $mld :,0:12]; nuke $mld
set wfeat [mx join col $mf]
nuke $mf
return $wfeat
}
Table 2.3 depicts the configuration setup for segment selection. The CSLU-HMM script pickdata.tcl will read the mapped phonetic transcriptions and select 200 examples for each class defined by the initial model configuration script.
system prompt% pickdata.tcl - - - -config digit.cfg
{<z> 200} {<ih> 200} {<r> 200} {<ow> 200} {<w> 200} {<ah> 200} {<n> 200}
{<t> 200} {<uw> 200} {<th> 200} {<iy> 200} {<f> 200} {<aor> 192} {<ay> 200}
{<v> 200} {<s> 200} {<ks> 200} {<eh> 200} {<ey> 200} {<sil> 200}
system prompt% hmminit.tcl - -config digit.cfg
<z>: 199 examples loaded. total #frames = 2036
numiter: 10
avg logProb: -6.152841e+02 delta: 1.224415e+05
avg logProb: -6.023079e+02 delta: 2.582265e+03
.
.
.
The model parameters are stored as digit.0.
system prompt% hmmtrain.tcl - -config digit.cfg
<z>: 199 examples loaded. total #frames = 2036
avg logProb: -5.960176e+02 delta: 1.186075e+05
avg logProb: -5.949093e+02 delta: 2.205604e+02
avg logProb: -5.947156e+02 delta: 3.854687e+01
.
.
.
For the digits which end in a nasal (one/seven/nine) the speakers often tend emphasize the word final phoneme. The effect is that these words are often pronounced one-a, seven-a or nine-a. These variations are captured by cloning the <ah> model. Table 2.6 lists the new set of pronunciations given the design considerations discussed.
#!hscript
#
inputmodel "digit.1";
outputmodel "allophone.0";
prototype skip numstate 5 mixtures 1 transp
0.000000 1.000000 0.000000 0.000000 0.000000
0.000000 0.500000 0.500000 0.000000 0.000000
0.000000 0.000000 0.333333 0.333333 0.333333
0.000000 0.000000 0.000000 0.500000 0.500000
0.000000 0.000000 0.000000 0.000000 0.000000;
prototype mono numstate 5 mixtures 1 transp
0.000000 1.000000 0.000000 0.000000 0.000000
0.000000 0.500000 0.500000 0.000000 0.000000
0.000000 0.000000 0.500000 0.500000 0.000000
0.000000 0.000000 0.000000 0.500000 0.500000
0.000000 0.000000 0.000000 0.000000 0.000000;
define skip <td>;
define mono <ah2> <ah3> <.garbage>;
clone <t>.state[1-3] <td>.state[1-3];
clone <ah>.state[1-3] {<ah2> <ah3>}.state[1-3];
clone <ah>.transp {<ah2> <ah3>}.transp;
clone <sil>.state[1-3] <.garbage>.state[1-3];
In this script we also create a cloned copy of the silence model which will be used to model non-speech (low energy) background noises such as line noise or breath noise. Executing the model configuration script results in the files allophone.0, allophone.list and allophone.rr. These files describe the new model set. Initially this model set will be parameter equivalent to the ``seed'' model (digit.1). Further training will create distinctive diffirences between the original and cloned models.
system prompt% hscript allophone.desc opening file: allophone.desc Input model: digit.1 digit.list digit.rr Output model: allophone.0 allophone.list allophone.rr Checking HMM model minimum mixute weight: -5.947759 clone: <t> to <td> (1)->(1) (2)->(2) (3)->(3) clone: <ah> to <ah2> (1)->(1) (2)->(2) (3)->(3) clone: <ah> to <ah3> (1)->(1) (2)->(2) (3)->(3) clone: <ah> to <ah2> transp clone: <ah> to <ah3> transp clone: <sil> to <.garbage> (1)->(1) (2)->(2) (3)->(3) stateNum = 72 transpNum = 24 meanNum = 288 varNum: 288
system prompt% pwd digit/dict system prompt% worddb.tcl digit.dict digit.db
Before proceeding we need to create a triphone lookup table for the search algorithm. The search build procedure uses a triphone lookup table to determine which model to use during cross word expansion. The following CSLU-HMM configuration script creates this lookup table.
table.desc ---------- #!hscript # inputmodel "allophone.0"; $mono = z ih r ow w ah n ah3 t td uw th iy f aor ay v s ks eh ah2 ey sil .garbage; outputtable $mono "allophone.tt";
Since each of our models are essentially monophone models rather than the expected triphone models each triphone model will be tied to its corresponding monophone model.
With all of these components in place we can now select the pronuciation which best matches the acoustics. Table 2.7 describes the configuration setup for the forced alignment procedure:
system prompt% hmmscribe.tcl - -config digit.cfg processing MLF File: /u/johans/csluhmm/digit/labs/train.all.wrd 0 /projects/cslu/speech/corpora/CSLU/numbers/speechfiles/0/NU-25.zipcode.wav 1 /projects/cslu/speech/corpora/CSLU/numbers/speechfiles/0/NU-30.zipcode.wav . .
Table 2.8 describes the configuration setup used to generated the monophone model transcriptions. The following session reads the forced aligned word transcriptions (train.all.force) and generates the model transcriptions using the pronunciation database.
system prompt% genmodel.tcl - - -config digit.cfg
Table 2.9 lists the configuration setup for embedded parameter re-estimation. Starting from the initial cloned model the embedded re-estimation performs a total of nine passes over the data.
system prompt% hmmembed.tcl - -config digit.cfg
Iteration #1
processing MLF File: /u/johans/csluhmm/digit/labs/digit.all.mono
0 /projects/cslu/speech/corpora/CSLU/numbers/speechfiles/0/NU-25.zipcode.wav
= -62.2594
.
.
.
The resulting HMM models (allophone.1 through allophone.9) may now be evaluated to select the best model.
system prompt% pwd TUTORIAL/digit/mono system prompt% buildsearch.tcl allophone.0 ../search/digit.dict \ ../search/digit.grammar digit.search
Table 2.10 lists the setting for the Viterbi decoder. In the following session each of the files listed in the word transcription file (dev.wrd) are processed. The first best answers are written to the file dev.answer.
system prompt% ~/2.0/script/hmm_1.0/hmmsearch.tcl allophone.6 - \ -config digit.cfg processing MLF File: /u/johans/csluhmm/digit/labs/dev.wrd 0 0/NU-23.streetaddr.wav 1 0/NU-23.zipcode.wav . .
Using the true transcriptions (dev.wrd) and the hypothesized transcriptions (dev.answer), we can compute the performance of our models. Table 2.11 lists the settings for the NIST alignment procedure. In this example all extraneous speech labels are surpressed during the scoring.
system prompt% hmmscore.tcl - - -config digit.cfg # words : 4152 # insertions : 62 (1.49325626204) # deletions : 47 (1.13198458574) # substitutions: 106 (2.55298651252) Word Correct : 96.3150289017 Sentence Correct: 80.3894297636 Accuracy : 94.8217726397
All models trained thus far has assumed an initial model complexity of 4 mixtures per state. Given the amount of data available for training the total number parameters can be increased substantially which will in turn increase the performance (accuracy) of the recognizer. The configuration script split4.desc increases the number of mixtures per state from 4 to 8.
system prompt% pwd TUTORIAL/digit/mono system prompt% cat split4.desc inputmodel "allophone.5"; loadstats "state.5"; outputmodel "allophone.10"; include "split.desc";
system prompt% hscript split4.desc Input model: allophone.5 allophone.list allophone.rr Output model: allophone.10 allophone.list allophone.rr Checking HMM model WARNING: DEFUNCT MIXTURE: <w> state: 1 mixture: 2 WARNING: DEFUNCT MIXTURE: <td> state: 1 mixture: 2 minimum mixute weight: -6.908567 split: 8 . .
In this example we see that there are two defunction mixtures. The splitting procedure above will first remove the defunct mixtures before increasing the number of parameters until the desired number of mixtures are reached.
This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 2 -dir html csluhmm.tex.
The translation was initiated by Johan Schalkwyk on 12/17/1997