CSLU Toolkit Packages


Viterbi
Version: 1.0
Created:
25 May 2003
Modified:
10 October 2005

Overview
The Viterbi package performs a Viterbi search on a state network and observation probabilities, returning the most likely word, phoneme, and/or state sequence for a given speech utterance.


Synopsis
package require Viterbi
viterbi init  statenetObject   [-blockSize blockSize] [-beamThreshold beamThreshold] [-beamThresholdAdjust beamThresholdAdjust] [-maxBeamThreshold maxBeamThreshold] [-targetBeamSize targetBeamSize] [-targetBeamVariance targetBeamVariance] [-beamRatio beamRatio] [-durationModel durationModel] [-allowWordCutoff allowWordCutoff] [-durationShortPenalty durationShortPenalty] [-durationLongPenalty durationLongPenalty] [-wordBoundaryPenalty wordBoundaryPenalty] [-observationProbFloor observationProbFloor]
viterbi search  viterbiObject statenetObject  probabilityMatrix
viterbi answer  viterbiObject statenetObject  [-score returnScore] [-words returnWords] [-phonemes returnPhonemes] [-states returnStates] [-name returnName]
viterbi reset viterbiObject

Parameters
Parameter Name
Type
Description
allowWordCutoff
integer
If this value is 1, then the waveform may begin or end with a partial word (a word cutoff) anywhere in the grammar.  If this value is 0, then the waveform must begin and end as specified in the grammar.  The default value is 0.
beamRatio
float
If, at any point in time, the number of states that survive the beam divided by the total number of states in the network is less than beamRatio, then only those states that survive the beam are evaluated at the next time frame.  If the ratio is greater than or equal to beamRatio, then all states are evaluated at the next time frame (the beam is effectively set to all possible states).  This provides a simple method of making the beam width maximum (by setting beamRatio to 0.0), and in other cases the value of beamRatio affects processing speed.  The default value is 0.0, although a value of 0.5 should yield reasonable results.
beamThreshold
float
If the accumulated probability score of a state at a point in time is less than the maximum accumulated probability score at that time minus beamThreshold, then the state is removed from the beam search.  If the state score minus the maximum score is greater than or equal to beamThreshold, then the state remains active at the next time frame.  The default value is -500.0.
beamThresholdAdjust
float
If the number of states active at any point in time is greater than targetBeamSize + targetBeamVariance and targetBeamSize is greater than zero, then beamThreshold is decreased by beamThresholdAdjust. Or, if the number of states active at any point in time is less than targetBeamSize - targetBeamVariance and targetBeamSize is greater than zero, then beamThreshold is increased by beamThresholdAdjust.   The default value of beamThresholdAdjust is -5.0.
blockSize
integer
Memory is allocated in chunks of blockSize units (where a single unit may be an integer or an entire structure).  A smaller value of blockSize (with a  lower limit of 1) reduces total memory consumption by only allocating space that will probably be used, but causes an increase in processing time by repeated calls to malloc() and realloc().  The default value is 500.
durationLongPenalty
float
If the "minmax" duration model is employed, then a state's score is penalized by durationLongPenalty for every frame that the state is longer than its maximum allowed duration.   If the "minmax" duration model is not employed, this parameter has no effect.  The default value is -1.0; this value is in the log domain, and corresponds to a probability value of 0.368.
durationModel
string
This string specifies the duration model used in the Viterbi search.  Allowable values are "exponential", "gamma", and "minmax", although only the "minmax" duration model is currently implemented.  An exponential model corresponds to the model implicit in HMMs through the use of state transition probabilities.  The gamma model models state duration using a gamma distribution function.  The minmax model applies penalties to a state if its duration is longer or shorter than specified values.
durationShortPenalty
float
If the "minmax" duration model is employed, then a state's score is penalized by durationShortPenalty for every frame that the state is shorter than its minimum allowed duration. If the "minmax" duration model is not employed, this parameter has no effect.  The default value is -6.5; this value is in the log domain, and corresponds to a probability value of 0.0015.
maxBeamThreshold
float
The beam threshold value beamThreshold is guaranteed to never be greater (closer to zero) than maxBeamThreshold.  The default value is -50.0.  This prevents the beam from becoming too restrictive.
observationProbFloor
float
If a state observation probability is less than observationProbFloor, the probability value will be set to observationProbFloor.   The default value is 1.0E-5.
probabilityMatrix
2D float vector object
This two-dimensional array of floating-point values contains the observation probabilities for the utterance.  The number of frames is the first dimension, and the number of categories is the second dimension.  The exact structure is specified in the AfloatT type definition. This object is typically generated by the "nnet x" or "garbage median" commands.
returnName
integer
If this value is 1, then the viterbi answer command will return the name of the grammar as the last item in a list.  If this value is 0, then the name will not be returned by the viterbi answer command.  The default value is 1.
returnPhonemes
integer
If this value is 1, then the viterbi answer command will return the list of best phonemes (with time information) as determined by the search.  If this value is 0, then the phoneme information will not be returned by the viterbi answer command.  The default value is 0.
returnScore

integer

If this value is 1, then the viterbi answer command will return the highest log-probability score of the utterance as the first item in a list.  If this value is 0, then the score will not be returned by the viterbi answer command.  The default value is 1.
returnStates
integer

If this value is 1, then the viterbi answer command will return the list of best states (with time information) as determined by the search.  If this value is 0, then the state information will not be returned by the viterbi answer command.  The default value is 1.
returnWords
integer

If this value is 1, then the viterbi answer command will return the list of best words (with time information) as determined by the search.  If this value is 0, then the word information will not be returned by the viterbi answer command.  The default value is 1.
statenetObject
statenet
object

This is the object returned by statenet create and used or modified by other Statenet commands.  It is used by the Viterbi package to define the state network of the HMM.
targetBeamSize
integer
This is the "ideal" number of states that survive the beam at each time frame.  If targetBeamSize is -1, then all states will survive the beam.  If targetBeamSize is 0, then the number of states in the beam is not adjusted to reach any particular target value.
targetBeamVariance
integer
If the number of states active at any point in time is greater than targetBeamSize + targetBeamVariance and targetBeamSize is greater than zero, then beamThreshold is decreased by beamThresholdAdjust. Or, if the number of states active at any point in time is less than targetBeamSize - targetBeamVariance and targetBeamSize is greater than zero, then beamThreshold is increased by beamThresholdAdjust.   The default value of targetBeamVariance is 10.  NOTE:  The use of the word "variance" in the variable name indicates that the target beam size is allowed to vary before applying adjustments; it is not indicative of the statistical notion of variance.
viterbiObject
viterbi object
This is the object returned by viterbi init and used or modified by other Viterbi commands.  This object contains information relevant to the search process other than the state network structure.
wordBoundaryPenalty
float
At every word boundary, wordBoundaryPenalty is added to the accumulated log probability score.  This allows some control over the relative numbers of insertion and deletion errors, although it's not a very elegant technique.  The default value is -3.0; this value is in the log domain, and corresponds to a probability value of approximately 0.05.


Description
The Viterbi package performs a Viterbi search on a state network and observation probabilities, returning the most likely word, phoneme, and/or state sequence for a given speech utterance.  The state network is specified in a statenet object, created by the Statenet package.  The observation probabilities are specified in a probability matrix, with frames along one dimension and classifier categories along the second dimension.  The Viterbi package is therefore applicable to both standard GMM-based Hidden Markov Models as well as HMM/ANN hybrids.  A number of duration models are expected to be implemented, although currently only one is supported.  A fairly large number of parameters allows for a great deal of control over the beam size and other aspects of the search process.

viterbi init creates and initializes the Viterbi search object based on the statenet object and other parameter values.

viterbi search  performs the actual search for the most likely state/phoneme/word sequence of a given utterance.  The search function may be used in a pipelined fashion, in that observation probabilities may be passed to the vitertbi search function in several different calls to the function.  The viterbi search function assumes that time accumulates with each new invocation until the viterbi reset function is called.

viterbi answer  determines the most likely state, phoneme, and/or word sequence after all observation probabilities have been passed to the viterbi search command.  

viterbi reset  resets the viterbi object to begin a new utterance.


Example
The following example of Tcl code performs all steps in the recognition of an isolated word using the Viterbi package.  For purposes of illustration,  the search is done first in a pipelined fashion, then the search is reset, and then the search is re-done in a single pass. For this example, the matrix of observation probabilities has been divided into four equal-length segments for the pipelined search.  A children's speech recognizer is used, and the utterance is from a child about 10 years old.  The code and other data necessary to run this example are in the following files:

viterbiExample.tcl
the following Tcl code
kids.grammar
the grammar that allows any amount of silence, garbage, or breath noise, followed by a single word, followed by any amount of silence, garbage, or breath noise.
kids.lexicon
the specification of each of 207 words, plus the before- and after-word separator model, in terms of phonemes
digit.train_fa2.spec
the specification of all context-dependent states in the recognizer, as well as other information used during recognition
kids_fa2_net.30
the neural network used for recognition
test.wav
the file containing the test waveform

foreach package {Statenet Nnrun Wave TrainLibrary Garbage Feature \
    Encode Context Prep Mx Viterbi} {
    package require $package
    }

set nnet_file kids_fa2_net.30
set wav_file  test.wav
set garbage   100

set stateNet [statenet create "kids.grammar" "grammar"]
statenet add $stateNet "kids.lexicon" "word"
statenet addSpec $stateNet "kids.train_fa2.spec" "phoneme" -selfLoops 1
statenet specUpdate $stateNet "kids.train_fa2.spec"

# statenet print $stateNet

set sampling_rate [expr int([statenet info -sampFreq $stateNet])]
set nnet [nnet optload $nnet_file]
set vObj [viterbi init $stateNet -targetBeamSize 10 \
    -beamThreshold -50.0]

set wave [wave read $wav_file]
get_features $wave feat $sampling_rate
compose_feature_vector $wave $feat $sampling_rate myNetIn
set myNetOut [nnet x $nnet $myNetIn]
set myNetOutG [garbage median -N $garbage $myNetOut]

set numFr [lindex [mx dim -row $myNetOutG] 0]
set btime_list [list 0 $numFr/4 $numFr/2 3*$numFr/4]
set etime_list [list $numFr/4 $numFr/2 3*$numFr/4 $numFr]
for {set idx 0} {$idx < 4} {incr idx} {
    set btime [lindex $btime_list $idx]
    set etime [lindex $etime_list $idx]
    set probMatrix [mx cut $myNetOutG [expr $btime]:[expr $etime-1],:]
    viterbi search $vObj $stateNet $probMatrix
    nuke $probMatrix
    }

set answer [viterbi answer $vObj $stateNet]
puts "Answer obtained via viterbi search pipeline:"
puts "\t[lindex $answer 1]"

viterbi reset $vObj

viterbi search $vObj $stateNet $myNetOutG
set answer [viterbi answer $vObj $stateNet]
puts "Answer obtained from processing probability matrix at once:"
puts "\t[lindex $answer 1]"

nuke $wave
nuke $stateNet
nuke $vObj
nuke $feat $myNetIn $myNetOut $myNetOutG


Returns
viterbi init returns a viterbi object.
viterbi search does not return anything.
viterbi answer returns a Tcl list containing the most likely word, phoneme, and/or state sequence.
viterbi reset does not return anything.

See Also
The Statenet package

Author
John-Paul Hosom, hosom@{cslu, bme, cse}.ogi.edu