next up previous


CSLU-HMM: The CSLU Hidden Markov Modeling Environment

Johan Schalkwyk, Xintian Wu and Yonghong Yan

Center for Spoken Language Understanding (cslu)
Oregon Graduate Institute of Science & Technology




December 17, 1997


Contents

Overview of CSLU-HMM

The CSLU-HMM development environment (CSLU-HMM) is a collection of modular building blocks which aim to provide the user with an easy to use, powerful, research and development environment for the construction of state of the art HMM and Neural Network hybrid recognizers. Based on the CSLU shell [1] (CSLUsh), which uses Tcl/Tk to provide a scripting environment, CSLU-HMM provides a flexible environment in which the user can shape/cast the existing modules to meet specific needs. CSLU-HMM is released as part of the CSLU Toolkit and may be down-loaded for non-commercial use from http://www.cse.ogi.edu/CSLU/toolkit.

Architecture of CSLU-HMM.

Implemented in Tcl/Tk [2] and C, the CSLU-HMM environment supports a flexible environment for various modeling strategies. Great care was taken to design all of the core components to operate in as efficient and consistent a manner as possible, with special attention given to modularity, portability and extendibility.


 
Figure 1.1:  Software Architecture of CSLU-HMM.
\begin{figure}
\centering 
\begin{tabular}
{c} \ \centerline{\epsfxsize=85mm\leavevmode
\epsfbox {eps/arch.eps}
} \ \end{tabular} \end{figure}

Figure 1.1 presents the software architecture of CSLU-HMM. Designed as an extension to the CSLU shell (CSLUsh), CSLU-HMM uses a wide variety of pre-existing modules for distributed computing, speech signal processing, mathematical operations and various miscellaneous modules to provide a complete HMM development environment.

The foundation of CSLUsh (and CSLU-HMM) is a set of efficient C libraries (CSLU-C) for the functions which support the basic algorithmic operations and associated utilities. These libraries can be used directly with a documented C API to build applications.

The extendible scripting language Tcl is used to access the same functionality in a high-powered way, by creating core components (Tcl packages) which ``glue'' the basic operations according to a well defined API. The packages are dynamically loaded as needed, providing a small footprint for implementation.

HMM Core

The current version of CSLU-HMM supports both standard and advanced training methods. For basic model training, the standard techniques are:

Some of the more advanced techniques under development are:

Parameter tying may be done at either the model, state, mixture component, mean and/or covariance level. To facilitate the construction of HMM recognizers for medium to large vocabulary tasks, CSLU-HMM also supports decision tree state clustering [5].

CSLU-HMM also includes support for the design and optimization of neural-network hybrid recognizers. The embedded re-estimation algorithm provided with the main HMM library may be used to reestimate neural network targets [6].


 
Figure 1.2:  Hmm Configuration.
\begin{figure}
\centering 
\begin{tabular}
{c} 
\ \centerline{\epsfxsize=75mm\leavevmode
\epsfbox {eps/hmm.eps}
} \ \end{tabular} \end{figure}

HMM models are created and configured using CSLU-HMM configuration scripts (hscript ). Figure 1.2 presents an example. In this example mono-phone models are created for the phonemes /w/, /ah/, /n/, /sil/ and /sp/. The short pause model (/sp/) is then tied to the center state of the silence model (/sil/). The HMM configuration language provides a mechanism in which one can specify new HMM models and also edit existing HMM models. The collection of such scripts documents the process of building a recognizer in an easily readable and understandable format. All the above mentioned functionalities are integrated within the CSLUsh environment. Stand-alone applications such as HMM embedded training are mere CSLUsh (extended Tcl/Tk) scripts which read a set of predefined input files and performs embedded training. Since the core HMM functionalities are implemented as extensions to the base scripting language, the CSLU-HMM environment is therefore totally programmable. The user can thus with relative ease change these applications to meet specific needs, rather than comply with the predefined interface.

Decoders

HMM Tools


 
Figure 1.3:  Outline of typical HMM training.
\begin{figure}
\centering 
\begin{tabular}
{c} \ \centerline{\epsfysize=80mm\leavevmode
\epsfbox {eps/outline.eps}
} \ \end{tabular} \end{figure}

Figure 1.3 shows the outline of the typical process of training an HMM model for speech recognition. Each of these steps and the associated CSLU-HMM scripts are discussed.

Generic properties of CSLU-HMM scripts.

The scripts provided with the CSLU-HMM development environment uses a traditional command-line interface. Since all scripts are mere Tcl scripts interfacing with the HMM modeling extensions, these scripts can easily be tailored to the users needs.

The arguments to the scripts are broken into command line parameters and command line options. A typical script would be invoked as follows

system prompt% hmmtool.tcl param1 param2 -option1 x -option2 y

The command line options are used to configure the experimental setup. These configuration setup parameters may be specified using the -config option. When the -config option is specified the default values for the command line options and parameters will be read from configuration file. CSLU-HMM uses the Tcl associative array as an interface to specifying command line parameters. The following configuration file illustrates this process.

experiment.cfg
--------------
set config(tool,option1) x
set config(tool,option2) y
set param(tool,param1) param1
set param(tool,param2) param2

With these variables defined, the script would be invoked as follows for these to take effect:

system prompt% hmmtool.tcl - - -config experiment.cfg

Data preparation

One of the most important parts of building a good speech recognizer has to do with processing the data. Building HMM models requires a set of speech waveform files and their associated word and/or phoneme transcriptions.

The steps needed to process the data and transcriptions into the required formats for training and testing are inevitably specific to layout and contents of the data. In the tutorial following (Chapter 2) some of these steps are illustrated using the Tcl scripting language to create a series of scripts to transform the transcriptions into the required formats needed.

Given the speech waveform files one the major data preparation steps is parameterize the speech data. The CSLU-HMM script genfeature.tcl extracts the baseline features for training HMM models. genfeat.tcl reads a list of files from a Master Transcription file and computes the baseline features for each of the files specified. The files are stored in a disk cache, and indexed for fast retrieval. With the exception of the model initialization and single model training scripts all CSLU-HMM tools interface via Master transcription file .

Model Initialization

For phoneme (sub-word based speech recognizers it is often required to have phonetically labeled data, in order to create initial seed models, which can in turn be used to label non phonetically aligned data. For English there are a wide variety of resources available such as the TIMIT and NTIMIT corpora.

Figure 1.4 depicts a graphical overview of the model initialization.


 
Figure 1.4:  HMM model initilizationz.
\begin{figure}
\centering 
\begin{tabular}
{c} \ \centerline{\epsfysize=80mm\leavevmode
\epsfbox {eps/hinit.eps}
} \ \end{tabular} \end{figure}

Model initialization is done in a three step process, starting with the alignments obtained from the initial labeled data. The CSLU-HMM script pickdata.tcl reads the alignment information and selects a subset of the available data for model initialization and model training.

With the data selected the model parameters are initialized using the CSLU-HMM script hmminit.tcl . Data (segments) from a particular class are loaded into memory. Each segment is then cut into equal sized sub segments, depending on the total number of states for the model. The data allocated to a particular HMM state are then combined, from which the initial mixture means are estimated using the vector quantization (VQ) algorithm. The mixture covariances are set to the pooled covariance of the data. During this first parameter initialization step, the mixture weights and state transition probabilities are not estimated.

The parameter estimates are further improved upon by performing a Viterbi state realignment. During this phase the allocation of data to a particular model state is determined by the state alignment rather than the initial equal sized sub segments.

Model Training.

During model initialization the state parameter estimates assume a discrete allocation of data for each state in the HMM model. This hard allocation of data to states is often not optimal. Feature frames close to state transition are difficult to associate with a single state. Rather there is a soft (probabilistic) association with states. The forward/backward algorithm in contrast to the Viterbi state realignment computes this probabilistic association of features versus HMM state.

These probabilistic associations are used in conjunction with the expectation/maximization (EM) algorithm to further improve upon the initial parameter estimates. For each segment the parameter accumulator variables are updated. The accumulator variables are used to store the accumulative contribution of each data vector to each mixture of each state. From these statistics the model parameters are computed.

The CSLU-HMM script (hmmtrain.tcl ) encapsulates this process. Together with model initialization these two training techniques are used to form a ``seed'' model which is used as basis for training the recognizer.

Transcription

The phonetic hand labeled data are typically only sufficient to create ``seed'' models for a phoneme-based recognizer. To build a more accurate and more robust recognizer requires much more data. Most corpora contain word level transcriptions, which may be used to augment the existing training data, by

The CSLUsh script hmmscribe.tcl computes the word, phoneme and/or state alignments. The input word transcriptions are used to create a finite state grammar where each node or state in the grammar contains a word and its pronunciation variants obtained a pronunciation lexicon. The standard Viterbi algorithm is then used to find the best possible path through the grammar.

Because the pronunciation lexicon can have more than one pronunciation per word, transcriptions are derived from the known orthography by using the initial bootstrap models to do a forced recognition of each training utterance.

hmmscribe.tcl reads a standard master transcription file For each word in the transcription the pronunciation variants are read from a pronunciation database. The CSLUsh script worddb.tcl reads a text file of word and pronunciation tuples, and creates the database.

Embedded parameter re-estimation

Model initialization and single model training only use the data associated with the particular model. During these training steps it is assumed that the phonetic boundaries are defined and that there are no interaction between neighboring models. Embedded parameter (hmmembed.tcl ) estimation addresses these problems by creating a composite model from the associated model transcriptions. The composite model is then used to compute the model and state alignments, from which the model parameters are updated. The CSLUsh script genmodel.tcl converts a word transcription file to an associated mono-phone or triphone transcription file. Given the input pronunciation database, each word is expanded into either a mono-phone or triphone representation. The resulting model transcriptions are then used for embedded training.

Evaluation

The previous three training steps described created a series of HMM models. The question which remains to be answered is which model will give the best performance when evaluated on previously unseen data.


 
Figure 1.5:  Continuous digit string grammar.
\begin{figure}
\centering 
\begin{tabular}
{c} \ \centerline{\epsfysize=50mm\leavevmode
\epsfbox {eps/digit.eps}
} \ \end{tabular} \end{figure}

To setup the evaluation process we first need to construct a task grammar and associated search network used by the Viterbi decoder. Figure 1.5 graphically depicts the task grammar for a continuous spoken digit recognizer. This grammar can be used to recognizer any number of spoken digits, with an optional silence between words.

The CSLUsh script buildsearch.tcl uses both the grammar specification and word pronunciations to create a search network.

With the search network in place, recognition is done using the CSLU-HMM script hmmsearch.tcl . hmmsearch.tcl supports full context dependent triphone modeling, with word, phoneme, or state level alignment. By default only the first best answer is returned. However, a word lattice is also available which contains multiple hypothesis. Once all files have been processed, the input transcriptions is compared to the recognition answers to evaluate the performance of the recognizer. The CSLU-HMM script hmmscore.tcl uses the NIST scoring algorithm to compare the recognition answers with the true answers provided.

Tutorial - Building a digit recognizer  

This chapter shows how to use each of the HMM training utilities provided, by building a continuous digit recognizer. The tutorial files may be down-loaded from .....

Defining the task

Figure 2.1 depicts the grammar for a spoken digit recognizer. This grammar can be used to recognize any number of spoken digits, with an optional silence between words.


 
Figure 2.1:  Continuous digit string grammar.
\begin{figure}
\centering 
\begin{tabular}
{c} \ \centerline{\epsfysize=60mm\leavevmode
\epsfbox {eps/digit.eps}
} \ \end{tabular} \end{figure}

The CSLU-HMM development environment uses a grammar definition language similar to the HTK definition language to specify task grammars. For our digit recognizer the following grammar is used:

$digit = one | two | three | four | five | six | 
           seven | eight | nine | zero | oh;
 [sil] <$digit [sil]>;

The 30k Numbers corpus

For this task we will use the The 30k Numbers corpus . The numbers corpus is a collection of spontaneous ordinal and cardinal numbers, continuous digit strings and isolated digit strings. The utterances were taken from numerous CSLU telephone speech data collection efforts. Release 1.0 of the 30,000 Numbers Corpus contains 15,000 files. Each file has an orthographic transcription; about 7,000 have a phonetic transcription. For the purpose of this tutorial only the utterance which contain digits are used.

 
Figure 2.2:  Layout of the OGI Numbers corpus.
\begin{figure}
\centering 
\begin{tabular}
{c} \ \centerline{\epsfysize=80mm\leavevmode
\epsfbox {eps/numbers.eps}
} \ \end{tabular} \end{figure}

Figure 2.2 depicts the directory structure of the Numbers corpus. The speech waveform files are stored in the speech files subdirectory. For each speech file there is an associated word transcription file which can be found in the txtfiles directory. These transcriptions are used to select the set of files which only contain digit strings. The phnfiles directory contains phonetic hand-labels for a subset (7000 files) of the data.

Model prototyping

The word pronunciation models for each of the words defined by the grammar are given in table 2.1. These are defined using the ARPA phonetic alphabet.


 
Table 2.1:   Word pronunciations for the digit recognizer.
one w ah n
two t uw
three th r iy
four f aor
five f ay v
six s ih k s
seven s eh v ah n
eight ey t
nine n ay n
zero z ih r ow
oh ow
sil sil

From the pronunciation models we define the first set of HMM models needed for the task. The configuration script digit.desc creates as set of standard 3-state left-to-right HMM models for each of the phones needed. For the word six (pronounced ``s ih k s'') the phonemes ``k'' and ``s'' are merged resulting in a special ``ks'' model.

system prompt% pwd
TUTORIAL/digit/mono
system prompt% cat digit.desc
#!hscript
#
outputmodel "digit.0";
 
vecsize 39;
 
prototype mono numstate 5 mixtures 4 transp
 0.000 1.000 0.000 0.000 0.000
 0.000 0.600 0.400 0.000 0.000
 0.000 0.000 0.500 0.500 0.000
 0.000 0.000 0.000 0.600 0.400
 0.000 0.000 0.000 0.000 0.000;
 
define mono <z>   <ih> <r>  <ow> 
            <w>   <ah> <n>  <t>
            <uw>  <th> <iy> <f>  
            <aor> <ay> <v>  <s>
            <ks>  <eh> <ey> <sil>;

Executing the model configuration script results in the files digit.0, digit.list and digit.rr. These files describe the initial model configuration.

system prompt% hscript digit.desc
opening file: digit.desc
Input model:   
Output model: digit.0 digit.list digit.rr
Checking HMM model
minimum mixute weight: 10000000000.000000
stateNum = 60
transpNum = 20
meanNum = 240
varNum: 240

Data preparation

This section presents the initial data preparation steps needed to build a continuous digit recognizer.

Step 1 - Dividing up a corpus

Each speaker in the digit corpus is identified by a unique caller identification number, encoded into the filename. Based on this identification number a speaker will be allocated either to training, development or test set.

In this tutorial the training set consists of 3/5 of the total data, and the sum of all development sets is 1/5 of the total data. The test set is also 1/5 of the total data.

The "total data" is actually not all of the digits files in the numbers corpus... 5% of all files (speaker-independent) have been removed (culled) from available files for a final test set. This final 5% test set is not to be used by the researcher in evaluation; it is to be used by an independent party and run only once on a given system (the normal test set is also to be run only once on a given system, but the researchers may do it themselves).

There are four development sub-sets taken from the available development data. The first sub-set was generated by finding all files in the development set that have phonetic labels. The other three development sub-sets were generated by splitting up all available development files (1/5 of all data) into three parts. Note that these three development sub-sets are speaker- independent, in that a call number will appear in only one of the sets. (In other words, you won't have NU-78.zipcode.wav in one sub-set and NU-78.other1.wav in another sub-set, because the call number 78 is from one person).

The corpus subdirectory of this tutorial contains the file listing for each of the subsets defined according to the procedure outlined above.

system prompt% pwd
TUTORIAL/digit/corpus
system prompt% ls
dev.files         dev2.files        test.files        train.phn.files
dev1.files        dev3.files        train.all.files

Word transcriptions

Next the corpus description files are used to create the associated word transcription files. The script transcript.tcl provided with this tutorial reads each text file (.txt) in the numbers corpus and saves the information according to the required master transcription file format.

system prompt% pwd 
TUTORIAL/digit/labs
system prompt% ../script/transcript.tcl ../corpus/train.all.files train.all.wrd
system prompt% ../script/transcript.tcl ../corpus/train.phn.file  train.phn.wrd
system prompt% ../script/transcript.tcl ../corpus/dev1.files dev1.wrd
system prompt% ../script/transcript.tcl ../corpus/test.files test.wrd

Phonetic transcriptions

The phonetic transcriptions in the numbers corpus are defined using the World Bet phonetic alphabet.

The script digitphn.tcl reads each phonetic transcription file and performs the necessary mappings between the two phonetic alphabets. Because our design creates a few variations like merging the ``k'' and ``s'' phonemes in the word six, these effects are also incorporated during the mapping process.

system prompt% ../script/digitphn.tcl corpus/train.phn.files
/u/johans/csluhmm/digit/lola/0/NU-25.zipcode.mono
/u/johans/csluhmm/digit/lola/0/NU-30.zipcode.mono
/u/johans/csluhmm/digit/lola/0/NU-31.zipcode.mono
        .
        .
        .

Step 2 - Feature extraction

With all the data in the required formats, we can now start the initial processing to train the recognizer. Table 2.2 shows the parameter settings for the feature extraction process. In this example we will be computing 13 MEL scale cepstral coefficients every 10 milliseconds.


 
Table 2.2:   Feature generation configuration setup for digit recognizer (digit.cfg).
set config(feature,exponent) 0.6
set config(feature,features) 13
set config(feature,filters) 21
set config(feature,framesize) 10.0
set config(feature,order)  
set config(feature,preemphasis) 0.98
set config(feature,samplerate) 8000
set config(feature,type) MFCC
set config(feature,windowsize) 16.0
set config(feature,basedir) /u0/tmp
set config(feature,script) $base/script/user.tcl
set param(feature,mlffile) $base/labs/train.all.wrd

system prompt% genfeat.tcl labs/train.all.wrd -config digit.cfg
processing MLF File: ../labs/train.all.wrd
	.
	.
	.
system prompt% genfeat.tcl labs/dev1.wrd -config digit.cfg
system prompt% genfeat.tcl labs/test.wrd -config digit.cfg

The features calculated and stored in the cache file are typically only the base cepstral features. Extra features such as the first and second order time derivatives are computed on the fly. The script user.tcl creates a function UserFeat which is invoked by the feature cache interface function during training and testing. In the example below the user defined function performs cepstral mean subtraction and adds the first and second order time derivatives to base cepstral coefficients. This results in a total of 39 values for each feature frame.

system prompt% pwd
TUTORIAL/digit/script
system prompt% cat user.tcl
package require Mx
package require Analysis

proc UserFeat wmf {

 # mean subtraction
 set mf [mx zeromean $wmf]
 nuke $wmf
 
 # delta
 set dd  [analysis delta initialize -order 2 -sigmaT2 13]
 set mld [analysis delta $dd $mf -flush]; nuke $dd;
 lappend mf [set mlc [mx cut $mld :,0:12]]; nuke $mld
 
 # delta^2
 set dd  [analysis delta initialize -order 2 -sigmaT2 13]
 set mld [analysis delta $dd $mlc -flush]; nuke $dd;
 lappend mf [mx cut $mld :,0:12]; nuke $mld
 set wfeat [mx join col $mf]
 nuke $mf
 
 return $wfeat
}

Step 3 - Data selection


 
Table 2.3:   Data selection configuration setup (digit.cfg).
set config(data,cdid) digit
set config(data,labelext) mono
set config(data,maxwant) 200
set config(data,output) $base/mono/digit.pick
set config(data,waveext) wav
set param(data,modelname) $base/mono/digit.0
set param(data,labeldir) $base/lola
set param(data,wavedir) $wavedir/numbers/speechfiles

Table 2.3 depicts the configuration setup for segment selection. The CSLU-HMM script pickdata.tcl will read the mapped phonetic transcriptions and select 200 examples for each class defined by the initial model configuration script.

system prompt% pickdata.tcl - - - -config digit.cfg
{<z> 200} {<ih> 200} {<r> 200} {<ow> 200} {<w> 200} {<ah> 200} {<n> 200}
{<t> 200} {<uw> 200} {<th> 200} {<iy> 200} {<f> 200} {<aor> 192} {<ay> 200}
{<v> 200} {<s> 200} {<ks> 200} {<eh> 200} {<ey> 200} {<sil> 200}

Model Training

Step 4 - Model Initialization


 
Table 2.4:   Model initialization configuration setup (digit.cfg).
set config(init,pickfile) $base/mono/digit.pick
set config(init,numiter) 10
set param(init,basename) $base/mono/digit

system prompt% hmminit.tcl - -config digit.cfg
<z>: 199 examples loaded. total #frames = 2036
numiter: 10
avg logProb: -6.152841e+02  delta: 1.224415e+05
avg logProb: -6.023079e+02  delta: 2.582265e+03
    .
    .
    .

The model parameters are stored as digit.0.

Step 5 - Singel model training


 
Table 2.5:   Model training configuration setup (digit.cfg).
set config(train,pickfile) $config(data,output)
set config(train,numiter) 10
set config(train,updatestate) 1
set config(train,updatetransp) 1
set param(train,basename) $base/mono/digit

system prompt% hmmtrain.tcl - -config digit.cfg
<z>: 199 examples loaded. total #frames = 2036
avg logProb: -5.960176e+02  delta: 1.186075e+05
avg logProb: -5.949093e+02  delta: 2.205604e+02
avg logProb: -5.947156e+02  delta: 3.854687e+01
     .
     .
     .

Step 6 - Allophonic Variant

Part of designing a good recognizer is understanding the problem. The ``seed'' models trained from the phonetic hand labels in this example do not distinguish same sounds which are produced in different parts of a word. For example, the ``t'' in two will most probably be pronouced differently from the ``t'' in eight. In this case the burst of the sound ``t'' in eight is sometimes unreleased, and therefore needs to be modeled differently. The configuration script shown below creates a cloned version of the ``t'' model, but instead of using the standard left-to-right model, the third state (i.e. the hypothesized burst) is made optional.

For the digits which end in a nasal (one/seven/nine) the speakers often tend emphasize the word final phoneme. The effect is that these words are often pronounced one-a, seven-a or nine-a. These variations are captured by cloning the <ah> model. Table 2.6 lists the new set of pronunciations given the design considerations discussed.


 
Table 2.6:   Word pronunciations for the digit recognizer. Pronunciation based on allophonic variations.
one w ah n [ah3]
two t uw
three th r iy
four f aor
five f ay v
six s ih ks
seven s eh v ah2 n [ah3]
eight ey td
nine n ay n
zero z ih r ow
oh ow

#!hscript
#

inputmodel "digit.1";
outputmodel "allophone.0";
 
prototype skip numstate 5 mixtures 1 transp
 0.000000 1.000000 0.000000 0.000000 0.000000 
 0.000000 0.500000 0.500000 0.000000 0.000000 
 0.000000 0.000000 0.333333 0.333333 0.333333
 0.000000 0.000000 0.000000 0.500000 0.500000
 0.000000 0.000000 0.000000 0.000000 0.000000;
 
prototype mono numstate 5 mixtures 1 transp
 0.000000 1.000000 0.000000 0.000000 0.000000 
 0.000000 0.500000 0.500000 0.000000 0.000000 
 0.000000 0.000000 0.500000 0.500000 0.000000
 0.000000 0.000000 0.000000 0.500000 0.500000
 0.000000 0.000000 0.000000 0.000000 0.000000;
 
define skip <td>;
define mono <ah2> <ah3> <.garbage>;
 
clone <t>.state[1-3]  <td>.state[1-3];
clone <ah>.state[1-3] {<ah2> <ah3>}.state[1-3];
clone <ah>.transp     {<ah2> <ah3>}.transp;
clone <sil>.state[1-3]  <.garbage>.state[1-3];

In this script we also create a cloned copy of the silence model which will be used to model non-speech (low energy) background noises such as line noise or breath noise. Executing the model configuration script results in the files allophone.0, allophone.list and allophone.rr. These files describe the new model set. Initially this model set will be parameter equivalent to the ``seed'' model (digit.1). Further training will create distinctive diffirences between the original and cloned models.

system prompt% hscript allophone.desc
opening file: allophone.desc
Input model: digit.1 digit.list digit.rr
Output model: allophone.0 allophone.list allophone.rr
Checking HMM model
minimum mixute weight: -5.947759
clone: <t> to <td>  (1)->(1)  (2)->(2)  (3)->(3)  
clone: <ah> to <ah2>  (1)->(1)  (2)->(2)  (3)->(3)  
clone: <ah> to <ah3>  (1)->(1)  (2)->(2)  (3)->(3)  
clone: <ah> to <ah2>  transp 
clone: <ah> to <ah3>  transp 
clone: <sil> to <.garbage>  (1)->(1)  (2)->(2)  (3)->(3)  
stateNum = 72
transpNum = 24
meanNum = 288
varNum: 288

Step 7 - Transcription

system prompt% pwd
digit/dict
system prompt% worddb.tcl digit.dict digit.db

Before proceeding we need to create a triphone lookup table for the search algorithm. The search build procedure uses a triphone lookup table to determine which model to use during cross word expansion. The following CSLU-HMM configuration script creates this lookup table.

table.desc
----------
#!hscript
#

inputmodel "allophone.0";

$mono = z ih r ow w ah n ah3 t td uw th iy f aor ay v s ks eh ah2 ey 
	sil .garbage;

outputtable $mono "allophone.tt";

Since each of our models are essentially monophone models rather than the expected triphone models each triphone model will be tied to its corresponding monophone model.


 
Table 2.7:   Forced alignment configuration setup (digit.cfg).
set config(scribe,dictionary) $base/dict/digit.db
set config(scribe,input) $base/labs/train.all.wrd
set config(scribe,output) $base/labs/train.all.force
set param(scribe,modelname) allophone.0

With all of these components in place we can now select the pronuciation which best matches the acoustics. Table 2.7 describes the configuration setup for the forced alignment procedure:

system prompt% hmmscribe.tcl - -config digit.cfg
processing MLF File: /u/johans/csluhmm/digit/labs/train.all.wrd
0 /projects/cslu/speech/corpora/CSLU/numbers/speechfiles/0/NU-25.zipcode.wav 
1 /projects/cslu/speech/corpora/CSLU/numbers/speechfiles/0/NU-30.zipcode.wav 
	.
	.

Step 8 - Embeded parameter reestimation


 
Table 2.8:   Model transcription generation configuration setup (digit.cfg).
set config(genmodel,dictionary) $base/dict/digit.db
set config(genmodel,shortpause)  
set config(genmodel,type) mono
set param(genmodel,input) $base/labs/train.all.force
set param(genmodel,output) $base/labs/train.all.mono

Table 2.8 describes the configuration setup used to generated the monophone model transcriptions. The following session reads the forced aligned word transcriptions (train.all.force) and generates the model transcriptions using the pronunciation database.

system prompt% genmodel.tcl - - -config digit.cfg


 
Table 2.9:   Embedded reestimation configuration setup (digit.cfg).
set config(embed,trainfile) $base/labs/digit.embed.mono
set config(embed,numiters) 9
set config(embed,prune) 1600.0
set config(embed,type) NORMAL
set config(embed,dumpstate) 1
set param(embed,basemodel) allophone.0

Table 2.9 lists the configuration setup for embedded parameter re-estimation. Starting from the initial cloned model the embedded re-estimation performs a total of nine passes over the data.

system prompt% hmmembed.tcl - -config digit.cfg
Iteration #1
processing MLF File: /u/johans/csluhmm/digit/labs/digit.all.mono
0 /projects/cslu/speech/corpora/CSLU/numbers/speechfiles/0/NU-25.zipcode.wav 
= -62.2594 
        .
        .
        .

The resulting HMM models (allophone.1 through allophone.9) may now be evaluated to select the best model.

Evaluation

system prompt% pwd
TUTORIAL/digit/mono	
system prompt% buildsearch.tcl allophone.0 ../search/digit.dict \
  ../search/digit.grammar digit.search


 
Table 2.10:   Search configuration setup (digit.cfg).
set config(search,beam) 300.0
set config(search,wordendbeam) 100.0
set config(search,grammarscale) 10.0
set config(search,crossmodel) -30.0
set config(search,input) $base/labs/dev.wrd
set config(search,output) $base/labs/dev.answer
set param(search,modelname) $base/mono/allophone.2
set param(search,searchfile) $base/mono/digit.search

Table 2.10 lists the setting for the Viterbi decoder. In the following session each of the files listed in the word transcription file (dev.wrd) are processed. The first best answers are written to the file dev.answer.

system prompt% ~/2.0/script/hmm_1.0/hmmsearch.tcl allophone.6 - \
		-config digit.cfg
processing MLF File: /u/johans/csluhmm/digit/labs/dev.wrd
0 0/NU-23.streetaddr.wav 
1 0/NU-23.zipcode.wav 
	.
	.


 
Table 2.11:   Scoring configuration setup (digit.cfg).
set config(score,suppress) sil <bn> <pau> <bs>
  <br> <ln> <tc> <cough>
  <laugh> <ct> <uu>
set param(score,reference) $base/labs/dev.wrd
set param(score,hypothesis) $base/labs/dev.answer

Using the true transcriptions (dev.wrd) and the hypothesized transcriptions (dev.answer), we can compute the performance of our models. Table 2.11 lists the settings for the NIST alignment procedure. In this example all extraneous speech labels are surpressed during the scoring.

system prompt% hmmscore.tcl - - -config digit.cfg
# words        : 4152
# insertions   : 62 (1.49325626204)
# deletions    : 47 (1.13198458574)
# substitutions: 106 (2.55298651252)
Word Correct    : 96.3150289017
Sentence Correct: 80.3894297636
Accuracy        : 94.8217726397

Iterative improvement

All models trained thus far has assumed an initial model complexity of 4 mixtures per state. Given the amount of data available for training the total number parameters can be increased substantially which will in turn increase the performance (accuracy) of the recognizer. The configuration script split4.desc increases the number of mixtures per state from 4 to 8.

system prompt% pwd
TUTORIAL/digit/mono
system prompt% cat split4.desc
inputmodel "allophone.5";
loadstats "state.5";

outputmodel "allophone.10";
include "split.desc";

system prompt% hscript split4.desc
Input model: allophone.5 allophone.list allophone.rr
Output model: allophone.10 allophone.list allophone.rr
Checking HMM model
WARNING: DEFUNCT MIXTURE: <w> state: 1 mixture: 2
WARNING: DEFUNCT MIXTURE: <td> state: 1 mixture: 2
minimum mixute weight: -6.908567
split: 8
	.
	.

In this example we see that there are two defunction mixtures. The splitting procedure above will first remove the defunct mixtures before increasing the number of parameters until the desired number of mixtures are reached.

References

1
J.Schalkwyk, J.H.de Villiers, S.van Vuuren, and P.Vermeulen, ``Cslush: An extendible research environment,'' Eurospeech 97, September 1997.

2
J. K. Ousterhout, Tcl and the Tk Toolkit.
Addison-Wesley, 1994.

3
G.Zavaliagkos, Maximum A Posteriori Adaptation Techniques for Speech Recognition.
PhD thesis, Northeastern University, Boston, Massachusetts, October 1995.

4
C.J.Legetter and P.C.Woodland, ``Speaker adaptation of hmm's using linear regression,'' Tech. Rep. CUED/F-INGENG/TR.181, Cambridge University Engineering Department, Cambridge, England, 1994.

5
S.J.Young and P.C.Woodland, ``Tree-based state-tying for high accuracy acoustic modeling,'' Proc Human Language Technology Workshop, pp. pp307-312, March 1994.

6
Y.Yan, M.Fanty, and R.Cole, ``Speech recognition using neural networks with forward-backward probability generated targets,'' Proceedings of the International Conference on Acoustic, Speech and Signal Processing, vol. IV, pp. 3241-3244, 1997.

HMM Configuration Script

HMM Utilities

HMM Tcl Extensions

Data Selection Tcl extensions

Word models and Grammar Tcl extensions

Viterbi Search Tcl extensions

About this document ...

CSLU-HMM: The CSLU Hidden Markov Modeling Environment

This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 2 -dir html csluhmm.tex.

The translation was initiated by Johan Schalkwyk on 12/17/1997


Subsections
next up previous
Johan Schalkwyk
12/17/1997