Johan Schalkwyk,Don Colton and Mark Fanty
Center for Spoken Language Understanding (cslu)
Oregon Graduate Institute of Science & Technology
June 1, 1998
Welcome to CSLUsh, the programming shell from CSLU, the Center for Spoken Language Understanding at Oregon Graduate Institute.
CSLUsh (pronounced ``slush'') provides a Tcl-based environment for research and development of spoken language systems (SLS). CSLUsh provides powerful tools to manipulate wave files, extract features, train and utilize artificial neural networks and Hidden Markov models, recognize spoken utterances subject to grammar and word model constraints, and perform other related activities. These tools are bound together using Tcl (pronounced ``tickle''), a flexible and extensible command language ideal for scripting and composing larger aggregate tools.
Developers and researchers should study Tcl to understand its syntax and calling conventions. The balance of this document assumes the reader is generally familiar with Tcl, and can read example code and modify it as needed to meet programming objectives. For familiarity with Tcl, the primary reference is ``Tcl and the Tk Toolkit,'' by John K. Ousterhout, published by Addison-Wesley.
The chapters of this book present a series of lessons that will acquaint the reader with CSLUsh, and which lead through the experiences of building simple recognizers. The lessons show hands-on interaction between a user (yourself) and the system in processing spoken language for various purposes. The major command groups and most common commands are introduced in this way.
The appendices of this book present reference material, including details of command usage.
Building and using the recognizers is a two-stage process. The training is conducted in advance. Examples of speech data similar to those we want to recognize are collected and used to train a neural network. Once trained, the neural network ``retains knowledge'' of these examples. Later, during the recognition stage, we'll use this neural network to compare speech from live users with these examples to see if they are a probable match.
Let's assume for the moment that we already have a neural network trained and ready for use, and discuss the recognition process first.
First, the signal is captured by a microphone (or other transducer) and converted into an electrical signal, where the amplitude of the signal corresponds to the magnitude of the original pressure variation. Such signals can be played back through a loudspeaker system, or through headphones, to recreate pressure variations that can be recognized by humans as the original speech.
Second, the signal is sampled at some frequency, so that only a finite
number of amplitudes are recorded, stored, or transmitted, for a given
period of time. Common sampling rates include
8000 Hz
for telephone speech and
44100 Hz for compact disk recordings.
Shannon's law provides that frequencies up to half the sampling rate can be reconstructed from the sampled signal, so an 8000 Hz telephone signal can reconstruct frequencies up to 4000 Hz. Higher frequencies are subject to aliasing, such that a frequency of 4010 Hz cannot be distinguished from a frequency of 3990 Hz. (This same aliasing makes spinning wheels on stagecoaches appear to spin backwards on old western movies and television shows).
Third, the signal is quantized into one of a discrete number of ``bins,'' so that only a finite number of bits is required to represent each sample. This is called Analog-to-Digital (A-to-D) conversion. In telephone speech, there are commonly 214=16384 such bins in linear coding. Because variations in higher amplitudes are not usually discriminated by humans, a logarithmic scaling can be used to represent the same information in 28=256 bins, where each of the bin numbers is essentially an 8-bit representation of the 14-bit integer originally captured.
Thus, a telephone signal will typically carry 8000 speech samples per second, each represented by an 8-bit number, for a total of 64K bits per second. In contrast, cellular telephone may only employ 4800 or 2400 bits per second, by using linear predictive coding (LPC) [#!rabiner83!#] or other signal compression techniques.
/.pau/), the phoneme
/n/, the phoneme /oU/ and silence. Each frame ``belongs''
to one of these parts. Using a neural network, we will phonetically
classify each frame. For the utterance ``no'' we will find some
number of frames of silence followed by some number of frames of
/n/ followed by /oU/ followed by silence. If instead we
find silence, /j/, /E/, /s/, silence, then we
know the word ``yes'' was spoken.
![]() |
In reality, things are not so clearly cut. Any particular 10 msec frame may happen to overlap two phonemes and the exact boundary between the phonemes is not always clear in any case. The transition is often gradual. Still, we assume that this model is close enough to give a good result.
Phoneme classifiers sometimes make mistakes, unfortunately. We cannot
rely on getting a perfect sequence of /n oU/ or /j E s/.
Faced with all this uncertainty, we take a probabilistic approach.
The phoneme classifier is used as a probability estimator. At every
frame, it computes the probability of every phoneme. We make the
assumption that the probability of a word is the product of the
phoneme probabilities for each frame of that word. This is not really
true since the probabilities of phonemes in adjacent frames are not
independent. However, it is computationally simple and this is what is
usually done. Hereafter the product of phoneme probabilities will be
called the score for the word.
There are however many ways in which the original signal can be modified, such that when the result is played, it is still readily recognized by human listeners as the original. This raises the question of which aspects of the signal actually carry the information needed for speech recognition and understanding.
From studying speech signals we know that the signal contains enough redundancy that some of the frequencies can be blocked out, but the listener will still understand what was said.
Telephone transmission uses this fact to represent human speech, with
an audible frequency range from 20 Hz to 20 000 Hz, using instead the
much smaller range of 300 Hz to 3300 Hz. Telephone users can tell that
the conversation is not face-to-face, but generally have very little
difficulty understanding each other. The worst difficulties seem to be
loss of discrimination of obstruent consonants: humans have trouble
telling /s/ from /f/ on the telephone. But in
general this is not a terrible problem, as humans make up for
recognition difficulties by using other context information to
accurately decide which phoneme was spoken.
This signal processing method modifies the short-term spectrum of the speech by several psychophysically based transformations. These values therefore are robust to many kinds of variation in the speech waveform. That is, if a change in the waveform is not perceived by a human listener, the corresponding PLP values will be very similar.
The frame-based recognizer provided with the CSLUsh development environment uses a window of 80 samples (10 msec) to derive seven PLP coefficients and a measure of energy within the window. This process is repeated at 10 msec intervals giving eight number each 10 msec, by which to characterize speech.
Very briefly, a neural network consists of several simple units (analogous to neurons in the brain) which have a numerical output value. There are weighted connections between units. Positive weights are excitatory and negative weights are inhibitory. The networks we use have no feedback paths. The output values of certain units are set externally. These are the inputs to the net representing the frame of speech to be classified. These values are propagated across connections to other units. Each destination unit sums its net input and computes an output value in the range 0 to 1 from this sum. If there are additional layers of units, the process is repeated.
Eventually, the propagated activation reaches the final layer of the
net which has no further connections. These units are the output of
the net and, ideally, their values represent the probability that the
input is from various phonemes.
A snapshot of a neural network for
our yes/no example is show in Figure 1.2.
In addition to the eight PLP coefficients representing the frame of speech we want to classify, we also present as input to the network PLP coefficients from nearby frames. This gives the net some context and greatly improves the classification accuracy.
While we want to find the highest scoring path through the matrix, we
also want the resulting word to represent something meaningful in our
vocabulary. The word ``yes'' has three phonemes: /j/,
/E/ and /s/, with a transition from /j/ to
/E/ and from /E/ to /s/. We'll constrain the
transitions between phonemes so that they reflect the legal
pronunciations of our words.
There are many possible paths through the matrix, but for any particular phoneme in any given word there is one path to the end that will maximize the score. We can take advantage of this insight to increase the efficiency of the search. If we see two paths come together at the same phoneme in the same word at the same time, we'll discard the path with the lower score. Since the two paths will behave identically from that point on, we know the lower-scoring path will never ``catch up''.
/th/ in the
middle: ``tweny''. Accurately representing real variation(s) in
pronunciation is very important. The Viterbi search will blindly score
words according to the pronunciation we have given it with no
flexibility. If a /th/ is required and none is found, the score
may be significantly lowered.
To alleviate this problem, our recognizers take as an input a word
pronunciation dictionary. The following example of a pronunciation
entry for the word ``alligators'' reflects differences in how the word
is pronounced by various regional and ethnic groups. See
Appendix
for an explanation of the symbol set
used.
Currently, getting good pronunciations is heavily dependent on manual intervention. Future research plans include using phonological rules to map dictionary pronunciations to the forms most likely to actually be said.
To alleviate this problem, we need to fine-tune our phoneme modeling.
In essence, we'll treat a single phoneme as consisting of two or three
parts, with each part depending either on the phoneme to the left
(left context) or the phoneme to the right (right context). For
example the word ``near'' has the phonetic pronunciation
/n i: 9r/. Breaking this into finer units which are called
segments, our recognizer models this as
$sil<n n>$fnt $nas<i: <i:> i:>$ret $fnt<9r 9r>$sil
The /n/ and the /9r/ are modeled as two separate
segments, each dependent on the context provided by its neighbors. The
/i:/ has an additional context-independent middle part. For example,
the symbol n>$fnt means ``the last part of n before a front vowel''. This ensures greater uniformity across multiple
instances of the segment. Unfortunately, it also complicates the
recognition process, since phonetic pronunciations must be expanded
into segments prior to the Viterbi search.
/f/ and /n/ at the
beginning and end of words can't be accurately distinguished from
background, we'll always need to leave a buffer around the end points
our algorithm has selected. There may also be brief pauses between
words.
To solve this problem, we treat background as if it were a phoneme with its own output on the neural net. It is trained on examples of background from the collected speech data. Unfortunately, if the background noise is loud, this may not work well (and may also fool our endpointing algorithm). So we supplement this approach by also using a garbage model.
The idea behind a garbage model is to give our recognizers the
flexibility to compensate for extraneous noise (if present) without
penalizing the recognition. Our background model is composed of two
``phonemes'': /.pau/, which models silence and has been trained
on collected speech data, and /.garbage/, which models
unexpected noises like coughs, unintelligible speech and the
like. However, unlike /.pau/, whose score is a direct output of
the neural net, the score for /.garbage/ is computed in terms
of the other phoneme scores. In our garbage model we assign the median
score of the N largest probabilities to /.garbage/, where N
is a function of the number of outputs in the net. This means that
/.garbage/ is equal to the
highest probability that
the neural net comes up with for a given frame. We model our
background as the maximum of either /.pau/ or
/.garbage/, so that we can compensate for either case without
penalizing the quality of the recognition. The ideal garbage model
will have a better score while noise is occurring than any of the
vocabulary words, and a worse score during speech, provided the speech
is in our vocabulary. We can also use the garbage model to detect when
the word spoken is not in our vocabulary, because the score for
/.garbage/ will be ``relatively'' high. As you might expect, getting the
``relatively'' part just right requires some adjusting. (This is done
by including samples of both in-and out-of vocabulary utterances among
our training data.)
The garbage model allows us to do some simple word spotting as well. If the answer to a yes/no question is ``no, I would not'', the ``no'' would match as usual and the ``I would not'' would match the garbage model. Optimal word spotting performance requires a more sophisticated approach and is a subject of ongoing research.
Rather than apply absolute minimums or maximums, our recognizers impose a per-frame penalty on the score for segments which are too long or too short. This gives the recognizer some flexibility in overcoming poor word models or sloppy articulation where a segment may really be missing. These duration limits are derived from ``generic'' samples of the phonemes. The recognition improves if we fine-tune the duration limits to match a specific task.
In a frame-based recognizer, the speech is processed a frame at a time anyway, so there is no reason to wait until the user has stopped talking to begin recognition. As the speech is recorded, phoneme probabilities can be estimated and the Viterbi search can begin its left-to-right processing. If the algorithm can keep up in real time (taking no more than a second to process a second of speech) then it does not matter how long the caller talks; recognition will be done when he or she is, with some small delay.
The only parts of the algorithm which are not inherently pipelined are the DC removal and the normalization of the PLP energy. For DC removal, we ideally compute the average of the signal over the whole utterance. In the pipelined system, we just use past information.
For the normalization of PLP energy, we ideally know the peak energy
in the call, so we can represent the energy at each frame as a
fraction of the peak. In the pipelined system, we use the peak energy
so far (beginning with an estimate of the expected peak) up to the
current frame plus 150 msec. This causes a 150msec delay in the
processing of speech.
Feature computation requires an 80 msec delay since the classification of a frame uses features from 80 msec in the future (as well as 80 msec in the past).
On a DSP implementation, the speech will actually be processed as it arrives, in 10 msec chunks, with the 150 msec delay handled by buffering 15 frames of speech. In a work-station implementation, the speech will probably arrive in larger chunks. Each chunk will be processed and buffers will store enough context to begin the next chunk.
yes'' and ``no.'' For this
purpose we choose to use one of the general-purpose recognizers,
supplied with the CSLUsh development environment. All of the
information in this chapter will be revisited in more detail in later
chapters as well as in the reference guide. The purpose of this
chapter is to show the overall structure of CSLUsh and how it can be
used to build a simple speech recognition system.
tclsh will start up in interactive mode, accepting input from the tclsh prompt. For ease of presentation, we will show interactions as follows:
system prompt% tclsh %
The session box indicates that you can type some commands and get some responses. In particular, this session box tells us that there is a system prompt (``system prompt%'') already present on your screen. You type ``tclsh'' and press enter. The system responds with ``% '' on the new line.
To start off we first need to copy all files needed for the tutorial into a our private work space.
system prompt% mkdir foo system prompt% cd foo system prompt% cp $CSLUDIR/tutorial/cslush/* . system prompt% ls analysis.tcl month.wav taskdigit.tcl* digit.desc months.tcl* yesno.wav . . system prompt% tclsh8.0 %You will need most of these files. These include examples which will be explained in later chapters. The actual path name referred to above will depend on the location of your local CSLUsh installation. The transcript for the examples presented in this chapter can be found in the file chapter2.tcl.
% llength [info commands] 84
There are 84 commands active in tclsh. The set of packages in the CSLUsh installation adds to the base functionality of Tcl, providing among many others, an extended set of commands used to build speech recognizers.
CSLUsh uses the base Tcl command package to load the necessary modules needed. To maintain proper version control, all needed packages must be loaded explicitly as is done in the following piece of Tcl code.
% foreach x {Context Wave Prep Analysis Opt Obtrain Garbage Word Tree Rtcl} {
package require $x 1.0
}
Typical setup commands like these may be added to you local
.tclshrc file, which is read when tclsh is started in interactive
mode. The optional version number (1.0) indicates that we are
requesting version 1.0 of each package specified.
% wave read yesno.wav wave:0
Tcl operates exclusively on strings of ASCII characters. This provides a consistent and extensible programming interface, because there is only one data type.
But this oversimplifies the truth. Even Tcl, when it opens files, does not return the contents of the file. Instead a file handle, such as ``file3,'' is returned. Similarly, CSLUsh returns a handle which can be used later to access the wave object. The actual wave data is therefore never referred to directly.
To get some information on our new wave object, you may use the wave info command. The info sub-command will be referred to frequently in this book. It allows us to display information of previously created CSLUsh objects.
% wave info wave:0
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192
This indicates that the wave has 9812 samples. The samples are
encoded in linear format (as opposed to a scaled format such as
-law) using 16 bits for each sample. There are 8000 samples per
second. The entire utterance is 1226.5 milliseconds long (1.226
seconds). The maximum magnitude occurs at millisecond 565.375 (sample
point 4523), and is 2492. The average energy (average of squares of
magnitude) in the wave is 39151.1. The DC offset (average sample value)
for this wave is -10.0192. Figure 2.1 depicts the speech
waveform referenced as wave:0.
% nuke wave:0 % wave info wave:0 object "wave:0" does not exist
The nuke command tells CSLUsh that the memory required for the wave
object wave:0 is no longer needed. Therefore the memory can be
freed. During the course of this tutorial we will be creating many
similar objects, some of these are used only during an intermediate
processing step and may be destroyed (nuked) at that time.
Another alternative is to destroy all objects when we are done. There
are advantages and disadvantages to both. It will be come more clear
in later chapters when it is necessary to destroy objects no longer in
use. For the purpose of this tutorial we will destroy all objects
created when we are done building our first recognizer.
% set myWave [wave read yesno.wav] wave:1
Tcl provides command substitution, which allows us to use the result of one command in an argument to another. We eventually plan to write our own CSLUsh scripts, so we might as well start using some variables.
% puts $myWave wave:1
% prep wrong # args: should be "prep option arg ?arg ...?"
The prep command requires that we specify an option and one or more arguments. In this case the options will specify which subcommand of the prep command should be used. The !dc subcommand removes the direct current offset using a simple first-order low-pass filter. Notice that by giving an incomplete (or incorrect) command, some (usually) helpful information is printed in response.
% prep foo bad option "foo": should be !dc or fir
The result here indicates here that the prep command accepts two valid subcommands, !dc (for no direct current) and fir for implementing finite impulse response filters.
% prep !dc wrong # args:should be !dc initialize || reset || info || dcrmob(string) wave(string) ?outwave(string)
Note also that the prep !dc command takes a variable number of parameters depending on the task it is performing. Please refer to the reference pages for a detailed description of the prep command.
Before we can remove the direct current offset we first need to create a prep-operator object. This object will contain all information needed for the prep !dc command to perform its task. Almost all CSLUsh commands have associated operator-objects. These need to be created before the specific command can function or operate on its associated data.
The first step therefore is to create a !dc-operator. This is done by running the prep !dc command with the initialize option. The !dc-operator object returned specifies how to remove the direct current offset. Various parameters can be set with the initialize option, all of which will influence how the direct current offset will be removed.
% set prepInit [prep !dc initialize]
dcrm:2
% prep !dc info $prepInit
{-tau 300.0} {-rate 8000.0}
Here !dc refers to its functionality (removing direct current or
``no direct current''). You will probably recognize the other
mnemonics as we go along. The first part of the object's name tells us
what kind of object it is, and identifies the internal (hidden) data
structure of the object. The second part of the object's name
indicates the total number of objects created thus far. The actual
numbering is not important since we will be referring to the object
using variables instead. Since the initialize option was used
without any optional parameters the default parameters were chosen.
The default parameters could therefore be overidden during
initialization using the -tau and -rate options of the
prep !dc initialize command.
Now we have created a !dc-operator object which will now be referenced using the variable prepInit. The next step is to use this operator to process our wave object and consequently remove the direct current offset from the speech data object myWave.
% set myPrep [prep !dc $prepInit $myWave]
wave:3
% wave info $myPrep
9812 linear-16 8000.0 1226.5 {2488 4523 565.375} 39283.8 -0.131166
``wave:3'' is the handle for the result of the dc removal. Note here
that the direct current offset (-0.13116) is not eqaul to zero. This
residual is a result of approximating the DC-offset removal using a
low-pass filter, rather than computing the DC-offset over the whole
waveform.
Similarly to the prep command, we first need to create a plp-operator object. This is done using the initialize option.
% set plpInit [analysis plp initialize]
plp:4
% analysis plp info $plpInit
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 7}
{-feats 8} {-filters 17} {-exp 0.6} {-lograsta 1.0}
% set myPlp [analysis plp $plpInit $myPrep]
arrayF:6
The defaults are listed using the info option of the
analysis plp command. The above command therefore extracts 8 PLP
cepstral coefficients every 10 msec, based on an analysis window of 10
msec. The cepstral coefficients are exponentially liftered (weighted)
using a liftering factor set to 0.6. Figure 2.2 depicts the
spectrogram calculated using our PLP coefficients calculated for every
10 msec time slice. The spectrogram is calculated by converting the
time-domain cepstral parameters (PLP coefficients) back to frequency
components using a discrete Fourier transform (DFT).
% set enormInit [analysis energynorm initialize]
enorm:7
% analysis energynorm info $enormInit
{-lookahead 160.0} {-framesize 10.0} {-decay 0.999} {-filt 0.1}
{-maxest 4.0} {-coeffs 8}
% set myEnorm [analysis energynorm $enormInit $myPlp -flush]
arrayF:8
The zero'th element of our PLP feature represents an energy-like
feature. analysis energynorm normalizes this element by
dividing it by an estimate of the maximum power in the speech
signal. The estimate of the maximum is initialized to a constant (4.0)
and decays until another peak of higher power is detected in which
case the maximum estimate is re-initialized to this peak value.
% set contextInit [collect initialize]
context:9
% collect info $contextInit
{-frames {{-8 1} {-4 1} {-1 1} {0 1} {1 1} {4 1} {8 1}}} {-coeffs 8}
{-delay 90.0}
% set myFeat [collect $contextInit $myEnorm -flush]
arrayF:10
Figure 2.3 depicts graphically the feature collection
using a context window of 160 msec.
% package require Genrecog 0.0 0.0 % genrecog initialize recog CREATE % puts $recog(nnet) nnet:13 % set netOut [nnet x $recog(nnet) $myFeat] arrayF:22The genrecog initialize function call creates an instance of the general purpose recognizer. This defines the specific signal processing needed for this recognizer, the neural network being used, and the functional description of the neural network output units (context-dependent phoneme models).
Our ultimate task is keyword spotting. We want to ignore words that do not match our target vocabulary. We do this by defining a ``garbage'' phoneme, whose probability is calculated as the median of the top N probabilities, frame by frame. The actual value of N is specific to the recognizer being used.
When the Viterbi search seeks the best path through the phonemes, our garbage phoneme should give a better score on average than the phonemes required for the target vocabulary, unless we are actually scanning a portion of a target vocabulary word.
% set netOutG [garbage median -N $recog(garbage) $netOut] arrayF:23
Here $netOutG is just like $netOut, but with an extra value added (garbage score) for each frame of phoneme probabilities. Now we are ready with an array of probability estimates. When we have our search model ready, we will examine this array and identify the target vocabulary word that matches most closely.
% set Words {{"yes" "j E s"} {"no" "n oU"}}
{"yes" "j E s"} {"no" "n oU"}
The next step is to convert these word pronunciations into word model objects. Word model objects are CSLUsh objects, which describe the pronunciation of the words in a more machine-accessible form. For this purpose we use the word create command.
% set myWords [word create $Words] word:24 word:25Although these word pronunciations are now in a more machine accessible form, they do not fully describe the pronunciation of the words with respect to our phoneme probability estimator. The aim is to build word pronunciation which are recognizable by our phoneme recognizer. This is known as adding context to our word pronunciation models. The word context command is used for this purpose.
% set contextWords [word context $myWords $recog(names)] word:26 word:27
Here the variable $recog(names) refers to the functional description of the neural network output units, and is created when an instance of the recognizer is initialized (e.g., genrecog initialize). Chapter 5 discusses the description of the neural-network output units in greater detail. To find out how our general purpose recognizer would pronounce each of the words in the vocabulary, we can use the word pronun command.
% word pronun [lindex $contextWords 0] $recog(names)
{{$sil<j} {j>$mid} {$fntl<E} <E> {E>$obs} {$mid<s} {s>$sil}}
% word pronun [lindex $contextWords 1] $recog(names)
{{$sil<n} {n>$bckr} {$nas<oU} <oU> {oU>$sil}}
{"call-block" "kc kh A l [.pau] bc b l A kc kh"}
{"call-forwarding" "kc kh A l [.pau] f > 9r w 3r d( i: N"}
a lexical tree would combine the first section of each word, such that the search is described as depicted in Figure 2.4.
By using a lexical tree to direct the Viterbi search saves both in memory and computational requirements. The tree build command is used to build such a lexical tree.
% set myTree [tree build $contextWords $recog(any) $recog(names)]
treesearch:28
% tree info $myTree -nbest -shortpen -longpen -framesize
{-nbest 4} {-shortpen -5.99146} {-longpen -2.00248} {-framesize 10.0}
Similar to the $recog(names) object the variable $recog(any), which specifies the default background speech modeling, is defined when the recognizer is created.
% tree update $myTree $netOutG
The last step of the recognition process is to retrieve the answer. This is done using the tree getbest command.
% tree getbest $myTree
{no -131.681} {yes -178.28}
By default the lexical tree search, will keep the top four best
answers. In this case we can clearly see that the best answer was
``no.'' The second best word ``yes,'' has a much lower associated
score, and we therefore conclude that the file yesno.wav
contains the spoken word ``no.''
% prep !dc reset $prepInit dcrm:2 % analysis plp reset $plpInit plp:4 % analysis energynorm reset $enormInit enorm:7 % collect reset $contextInit context:9 % tree reset $myTree
% set oblist [object list] context:9 arrayF:10 context:19 dcrm:2 enorm:7 btraceheap:20 word:24 enorm:18 word:25 arrayF:6 word:26 plp:17 wave:1 word:27 arrayF:8 wave:3 probname:12 treesearch:28 plp:4 arrayF:22 nnet:13 arrayF:23 garbage:21 dcrm:16 % nuke $oblist % puts [object list]The object list command provides a listing of all remaining objects. Since we are done with the first tutorial, none of these are needed, and we can therefore free all allocated memory with the nuke command. The nuke command operates on both single objects and lists of objects. In this case we are using object list to create a list of CSLUsh objects, which can the be freed using the nuke command.
% package require Genrecog 0.0
0.0
% genrecog initialize recog
CREATE
% set w [wave read yesno.wav]
wave:11
% set Words {{yes "j E s"} {no "n oU"}}
{yes "j E s"} {no "n oU"}
% genrecog tree recog search $Words
% genrecog pipe recog search $w
% genrecog result recog search
{no -131.681 .....
Together with the word scores, the genrecog result function call also returns the associated phoneme alignment. More on this in chapter 6.
This shows which key components are needed, and how they work together for building a fixed vocabulary recognizer.
Session:
% foreach x {Context Wave Prep Analysis Opt Obtrain Garbage Word Tree Rtcl} {
package require $x 1.0
}
% wave read yesno.wav
wave:0
% wave info wave:0
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192
% nuke wave:0
% wave info wave:0
object "wave:0" does not exist
% set myWave [wave read yesno.wav]
wave:1
% puts $myWave
wave:1
% set prepInit [prep !dc initialize]
dcrm:2
% prep !dc info $prepInit
{-tau 300.0} {-rate 8000.0}
% set myPrep [prep !dc $prepInit $myWave]
wave:3
% wave info $myPrep
9812 linear-16 8000.0 1226.5 {2488 4523 565.375} 39283.8 -0.131166
% set plpInit [analysis plp initialize]
plp:4
% analysis plp info $plpInit
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 7}
{-feats 8} {-filters 17} {-exp 0.6} {-lograsta 1.0}
% set myPlp [analysis plp $plpInit $myPrep]
arrayF:6
% set enormInit [analysis energynorm initialize]
enorm:7
% analysis energynorm info $enormInit
{-lookahead 160.0} {-framesize 10.0} {-decay 0.999} {-filt 0.1}
{-maxest 4.0} {-coeffs 8}
% set myEnorm [analysis energynorm $enormInit $myPlp]
arrayF:8
% set contextInit [collect initialize]
context:9
% collect info $contextInit
{-frames {{-8 1} {-4 1} {-1 1} {0 1} {1 1} {4 1} {8 1}}} {-coeffs 8}
{-delay 90.0}
% set myFeat [collect $contextInit $myEnorm -flush]
arrayF:10
% package require Genrecog 0.0
0.0
% genrecog initialize recog
CREATE
% puts $recog(nnet)
nnet:13
% set netOut [nnet x $recog(nnet) $myFeat]
arrayF:22
% set netOutG [garbage median -N $recog(garbage) $netOut]
arrayF:23
% set Words {{"yes" "j E s"} {"no" "n oU"}}
{"yes" "j E s"} {"no" "n oU"}
% set myWords [word create $Words]
word:24 word:25
% set contextWords [word context $myWords $recog(names)]
word:26 word:27
% word pronun [lindex $contextWords 0] $recog(names)
{{$sil<j} {j>$mid} {$fntl<E} <E> {E>$obs} {$mid<s} {s>$sil}}
% word pronun [lindex $contextWords 1] $recog(names)
{{$sil<n} {n>$bckr} {$nas<oU} <oU> {oU>$sil}}
% set myTree [tree build $contextWords $recog(any) $recog(names)]
treesearch:28
% tree info $myTree -nbest -shortpen -longpen -framesize
{-nbest 4} {-shortpen -5.99146} {-longpen -2.00248} {-framesize 10.0}
% tree update $myTree $netOutG
% tree getbest $myTree
{no -131.681} {yes -178.28}
% prep !dc reset $prepInit
dcrm:2
% analysis plp reset $plpInit
plp:4
% analysis energynorm reset $enormInit
enorm:7
% collect reset $contextInit
context:9
% tree reset $myTree
% set oblist [object list]
context:9 arrayF:10 context:19 dcrm:2 enorm:7 btraceheap:20 word:24
enorm:18 word:25 arrayF:6 word:26 plp:17 wave:1 word:27 arrayF:8
wave:3 probname:12 treesearch:28 plp:4 arrayF:22 nnet:13 arrayF:23
garbage:21 dcrm:16
% nuke $oblist
% puts [object list]
% package require Genrecog 0.0
0.0
% genrecog initialize recog
CREATE
% set w [wave read yesno.wav]
wave:11
% set Words {{yes "j E s"} {no "n oU"}}
{yes "j E s"} {no "n oU"}
% genrecog tree recog search $Words
% genrecog pipe recog search $w
% genrecog result recog search
{no -131.681 {{560 570 {$sil<n}} {570 580 {n>$bckr}} {580 590 {$nas<oU}}
{590 600 <oU>} {600 630 {oU>$sil}}} {{560 630 no}}}
{yes -178.28 {{20 30 {$sil<j}} {30 40 {j>$mid}} {40 50 {$fntl<E}}
{50 60 <E>} {60 70 {E>$obs}} {70 80 {$mid<s}} {80 100 {s>$sil}}}
{{20 100 yes}}} -1116.5933
As mentioned in the previous chapter one of the core functions of CSLUsh is interfacing with speech data. In this chapter we review the CSLUsh utilities used for the manipulation of CSLUsh speech wave objects. These consists of
The transcript for the examples presented can be found in the file chapter3.tcl.
% package require Wave 1.0
1.0
% set myWave [wave read yesno.wav]
wave:0
% wave info $myWave
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192
% wave write $myWave output.wav -encoding linear
wave:0
Both the wave read and wave write commands have options for reading or writing selected portions of the waveform. The following example shows how we can read only the first 1000 msec of the speech file and then write the latter half to disk.
% set newWave [wave read yesno.wav 0 1000]
wave:1
% wave info $newWave
8000 linear-16 8000.0 1000.0 {2492 4523 565.375} 48005.2 -11.9775
% wave write $newWave segment.wav 500 1000
wave:1
The wave info command returns statistics regarding our wave object. Our new wave object, has 8000 samples. The samples are encoded in linear format using 16 bits for each sample. The sampling rate is 8000 Hz, meaning there are 8000 samples per second of speech. The entire utterance is 1000 milliseconds long (1.0 second). The maximum magnitude occurs at millisecond 565.375 (sample number 4523) and is equal to 2492. The average energy (energy of squares magnitude) in the wave is 48005.2, and the DC offset is -11.9775.
The wave chopsil command uses an estimate of the variance of the absolute amplitude of the signal. Together with state duration constraints it is able to filter short energy events and trigger on end-of-speech events.
% set chopWave [wave chopsil $myWave]
wave:2 400.625 1118.12 -10.0192 197.611
% wave info [lindex $chopWave 0]
5740 linear-16 8000.0 717.5 {2492 1318 164.75} 66873.4 -15.9645
The wave chopsil command returns a Tcl list containing a handle to the silence chopped wave object (wave2); the starting point of detected speech in milliseconds (400.62 msec); the end of utterance detected in milliseconds (1118.12 msec); the DC offset of the original waveform; and lastly the standard deviation of signal amplitude in the original waveform.
In some circumstances, it is possible that the speech detector might not detect any speech events. In this case it will not return a wave handle, thus indicating that no speech events were detected.
% set zeroWave [wave zero 100]
wave:3
% wave chopsil $zeroWave
{} 0.0 0.0 0.0 0.0
Here the first element in the returned Tcl list contains the handle to
the silence chopped wave object. Since the first element is an empty
list, we know that no speech events were detected in the input wave
object.
Various parameters affect the silence chopping. These are described in the reference page for the wave chopsil command.
arrayF''. The wave tovec and wave fromvec
commands convert to and from CSLUsh vector objects. More
specifically these commands function on a two-dimensional floating
point array structure. Since a wave object is only a one-dimensional
data structure, the zero'th dimensions of the vector object is set to
one (i.e., the matrix has only one row). Within CSLUsh all signal
processing routines work on the floating point representation of the
speech waveform. These routines may be used to convert to the correct
format if necessary.
% set floatWave [wave tovec $myWave]
arrayF:4
% set originalWave [wave fromvec $floatWave]
wave:5
% wave info $originalWave
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192
Session:
% package require Wave 1.0
1.0
% set myWave [wave read yesno.wav]
wave:0
% wave info $myWave
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192
% wave write $myWave output.wav -encoding linear
wave:0
% set newWave [wave read yesno.wav 0 1000]
wave:1
% wave info $newWave
8000 linear-16 8000.0 1000.0 {2492 4523 565.375} 48005.2 -11.9775
% wave write $newWave segment.wav 500 1000
wave:1
% set chopWave [wave chopsil $myWave]
wave:2 400.625 1118.12 -10.0192 197.611
% wave info [lindex $chopWave 0]
5740 linear-16 8000.0 717.5 {2492 1318 164.75} 66873.4 -15.9645
% set zeroWave [wave zero 100]
wave:3
% wave chopsil $zeroWave
{} 0.0 0.0 0.0 0.0
% set floatWave [wave tovec $myWave]
arrayF:4
% set originalWave [wave fromvec $floatWave]
wave:5
% wave info $originalWave
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192
The transcript for the examples presented can be found in the file chapter4.tcl.
The analysis realfft and analysis pspec commands are used for this purpose. The first step towards computing the power spectrum of the speech signal is to perform a Discrete Fourier Transform (DFT), which computes the frequency information to the equivalent time domain signal. Since a speech signal contains only real-point values, we can make use of this fact and use a real-point Fast Fourier Transform (FFT) for increased efficiency. The resulting output contains both the magnitude and phase information of the original time-domain signal.
The analysis pspec converts this output to contain only log-magnitude information, which we can then be used to plot the familiar spectogram.
% foreach x {Analysis Prep Wave} {
package require $x 1.0
}
% set w [wave read yesno.wav]
wave:0
% set nodc [prep !dc initialize]
dcrm:1
% set rfft [analysis realfft initialize -framesize 3.0 -hamming]
fft:2
% set wnodc [prep !dc $nodc $w]
wave:3
% set wfft [analysis realfft $rfft $wnodc]
arrayF:5
% set wspec [analysis pspec $rfft $wfft]
arrayF:6
Figure 4.1 depicts the spectrogram plotted using the power
spectral analysis of the speech waveform.
The basic idea behind linear predictive analysis is that a speech sample can be approximated as a linear combination of past speech samples. Through minimizing the sum of squared differences (over a finite interval) between the actual speech samples and predicted values, a unique set of parameters or predictor coefficients can be determined. These coefficients form the basis for linear predictive analyis of speech.
The analysis lpc command provides the capability for computing the linear prediction model of speech over time. In reality the actual predictor coefficients are never used in recognition, since they typical show high variance. The predictor coefficient are therefore transformed to a more robust set of parameters known as cepstral coefficients.
% set lpc [analysis lpc initialize]
lpc:7
% analysis lpc info $lpc
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 12}
{-feats 12} {-exp 0.6} {-preemphasis 0.98}
% set wlpc [analysis lpc $lpc $wnodc]
arrayF:9
Figure 4.2 depicts the LPC spectrogram computed from a 12th
order LPC analysis of the speech waveform.
Just like most other short-term spectrum-based techniques this method is vulnerable when the short-term spectral values are modified by the frequency response of the communication channel. The analysis plp command provides limited capability for dealing with these distortions by employing a RASTA (Relative Spectral) filter which makes PLP analysis more robust to linear spectral distortions.
% set plp [analysis plp initialize]
plp:10
% analysis plp info $plp
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 7}
{-feats 8} {-filters 17} {-exp 0.6} {-lograsta 1.0}
% set wplp [analysis plp $plp $wnodc]
arrayF:12
The rasta coefficient, currently set to 0.0, indicates that no RASTA
processing is being done. This value may vary between 0.0 (no rasta)
to 1.0 (full rasta). For intermediate values, the output represents a
mixture of both RASTA filtered and unfiltered PLP cepstral
coefficients. Figure 4.3 depicts the PLP spectrogram
computed from a seventh-order PLP analysis of the speech waveform.
Similar to PLP, Mel scale analysis has the option of using a RASTA filter to compensate for linear channel distortions.
% set mel [analysis mel initialize]
mel:13
% analysis mel info $mel
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-feats 7}
{-filters 21} {-melres 100.0} {-preemphasis 0.98} {-exp 0.6}
{-lograsta 1.0}
% set wmel [analysis mel $mel $wnodc]
arrayF:15
Figure 4.4 depicts the MEL spectrogram computed from a eighth
order MEL analysis of the speech waveform.
% set rasta [analysis rastafilter initialize 12 -rasta 1.0] rasta:16 % set wrasta [analysis rastafilter $rasta $wlpc] arrayF:17
% set delta [analysis delta initialize 12 -order 2] delta:18 % set wdelta [analysis delta $delta $wlpc] arrayF:19
The optional argument order in the initialize call specifies the order of the numerical approximation used for computing the time derivative of the input feature vector sequence.
Figure 4.5 depicts the magnitude of the first order derivative calculated from the LPC cepstrum obtained earlier in the chapter.
The energy coefficient is normalized using an automatic gain control filter (AGC), with a look-ahead buffer of 160 msec. The normalization is performed using a variable gain amplifier in which the gain is controlled by a peak detector on the energy feature. The peak detector has a decay factor of 0.999 and also includes a limiter to prevent excessive gain during silence. Currently the energy normalization leads to an inherent delay of 160 msec within a pipelined recognition process.
The analysis energynorm command implements the above described algorithm. This command functions on the zero'th coefficient of the input feature object.
% set enorm [analysis energynorm initialize -coeffs 12]
enorm:20
% analysis energynorm info $enorm
{-lookahead 160.0} {-framesize 10.0} {-decay 0.999} {-filt 0.1}
{-maxest 4.0} {-coeffs 12}
% set wenorm [analysis energynorm $enorm $wlpc]
arrayF:21
Session:
% foreach x {Analysis Prep Wave} {
package require $x 1.0
}
% set w [wave read yesno.wav]
wave:0
% set nodc [prep !dc initialize]
dcrm:1
% set rfft [analysis realfft initialize -framesize 3.0 -hamming]
fft:2
% set wnodc [prep !dc $nodc $w]
wave:3
% set wfft [analysis realfft $rfft $wnodc]
arrayF:5
% set wspec [analysis pspec $rfft $wfft]
arrayF:6
% set lpc [analysis lpc initialize]
lpc:7
% analysis lpc info $lpc
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 12}
{-feats 12} {-exp 0.6} {-preemphasis 0.98}
% set wlpc [analysis lpc $lpc $wnodc]
arrayF:9
% set plp [analysis plp initialize]
plp:10
% analysis plp info $plp
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 7}
{-feats 8} {-filters 17} {-exp 0.6} {-lograsta 1.0}
% set wplp [analysis plp $plp $wnodc]
arrayF:12
% set mel [analysis mel initialize]
mel:13
% analysis mel info $mel
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-feats 7}
{-filters 21} {-melres 100.0} {-preemphasis 0.98} {-exp 0.6}
{-lograsta 1.0}
% set wmel [analysis mel $mel $wnodc]
arrayF:15
% set rasta [analysis rastafilter initialize 12 -rasta 1.0]
rasta:16
% set wrasta [analysis rastafilter $rasta $wlpc]
arrayF:17
% set delta [analysis delta initialize 12 -order 3]
delta:18
% set wdelta [analysis delta $delta $wlpc]
arrayF:19
% set enorm [analysis energynorm initialize -coeffs 12]
enorm:20
% analysis energynorm info $enorm
{-lookahead 160.0} {-framesize 10.0} {-decay 0.999} {-filt 0.1}
{-maxest 4.0} {-coeffs 12}
% set wenorm [analysis energynorm $enorm $wlpc]
arrayF:21
The goal of a speech recognition algorithm is to map a speech signal to the words which were spoken. Many different approaches to this problem have been presented by the speech research community. CSLUsh recognizes phonemes (basic sounds of speech), and then models words as sequences of phonemes. In this chapter we will learn how to define phoneme probability estimators within the CSLUsh environment. Although we attempt to keep things simple, some knowledge regarding modeling of speech units will be useful when defining phoneme probability estimators. The transcript for the examples presented can be found in the file names.tcl.
When defining the model set, each model is defined by its nucleus (phoneme), and optional left or right context.
phoneme = char{char}
Here char represent any character except one of the meta
characters <> $ = ;. These may however be escaped using a single backslash. The syntax defined above indicates that a phoneme
may either be a single character or a concatenation of multiple
characters. Throughout the syntactical definition of the phoneme
probability estimator we use {} braces to indicate either a single
item or a concatenation of multiple similar items.
The left or right context of the model may be defined either as a single phoneme, or as a list of phonemes.
phoneme_list = phoneme{`` '' phoneme}
Quotes (``'') in the syntax description refer to a string literal. In this case phoneme_list may either be a single phoneme or a list of phonemes delimited by any white space character (indicated by `` '').
Variables are defined by a leading $ character. Variables define a list or grouping of phonemes which can be used to described the left or right context of a particular biphone model.
name = char{char}
variable = ``$''name
context = variable `` = '' phoneme_list ``;''
A biphone model may therefore be dependent on a particular phoneme to
the left or right of the nucleus (center phoneme) or on a group of
phonemes defined using variables. The following syntax can then be
used to define a particular model.
model = ``<'' phoneme ``>'' |
phoneme ``<'' phoneme |
phoneme ``>'' phoneme |
variable ``<'' phoneme |
phoneme ``>'' variable
The braces <> denote either the left or the right context. Context independent models are defined with braces on both sides of the center phoneme. Using this syntax a phoneme can be modeled as (a) a one part model (context independent, left dependent or right dependent), (b) a two part model (left biphone model followed by a right biphone) or (c) a three part phoneme (left biphone followed by a context independent model followed by a right biphone).
The complete description of the outputs of the phoneme probability estimator is defined by a list of models, where the order of the list corresponds to the order of the outputs of the particular phoneme probability estimator.
model_list = model{`` '' model}
estimator = ``define '' model_list ``;''
Multiple define commands may be used throughout the probability estimator definition file. The final set of models will then be defined as the concatenation of each list of models defined by the separate instances of the define command.
Rather than apply absolute minimums or maximums, our recognizers impose a per-frame penalty on the score for segments which are too long or too short. This gives the recognizer some flexibility in overcoming poor word models or sloppy articulation where a segment may really be missing. These duration limits are derived from averaging multiple samples of the phonemes. The following syntax defines a duration model.
mindur = digit{digit}
maxdur = digit{digit}
duration = phoneme mindur maxdur |
model mindur maxdur
Minimum and maximum durations are specified in milliseconds. Model durations which are not specified are calculated from the base phoneme duration models using the transformation rules as described in table 5.1. Each row in table represents each of the possible combinations of the particular model type. For example, a two-part phoneme may be represented either as (a) left-dependent part followed by a context-independent part, (b) a context-independent part followed by a right-dependet part, or (c) a left-dependent part followed by a right-dependent part.
The table read as follows: For example, for the two-part phoneme case, the first line describes a phoneme, with the first part a left-dependent phoneme (x<A) and the second part a right-dependent phoneme (A>x). Based on the table 5.1, the durations computed for these models would therefore be:
The following two models are a bit unconventional. The second line describes a phoneme, with the first part a left-dependent phoneme (x<A) and the second part a context-indepent phoneme (<A>). The duration models computed for these models would therefore be:
Finally, the third line describes a phoneme, with the first part a context-independent phoneme (<A>) and the second part a right-dependent phoneme (A>x). The durations computed for these models would therefore be:
In order to fully specify the models described a duration model needs to be specified at least for each of the unique phoneme nuclei. If minimum and maximum durations are specified for some of the predefined models, then these durations will override the default durations which are calculated from the base phoneme minimum and maximum durations. The list of duration models are specified using the duration command.
duration_list = duration{`` '' duration}
duration_models = ``duration '' duration_list ``;''
Multiple duration commands may be used throughout the phoneme probability description file. The final set of duration models will then be defined as the concatenation of each set of duration models defined.
tied_models = ``tie '' model model_list ``;''
The list of models indicated by model_list will be tied to the singular model indicated by model. Similar to the define and duration commands, multiple tie commands may be used throughout the phoneme probability description file. The final set of tied models will then be defined as the concatenation of each.
I_x'' will have much the same characteristics than models
defined using the phoneme ``I'', with the main difference
relating to the underlying duration models of each. Using the map
command we only need to define models for one of the symbols(let say
``I'') and then all models can be duplicated for ``I_x''
by mapping ``I_x'' to ``I''.
The map command maps either a phoneme to a phoneme, or a list of phonemes to a phoneme.
mappings = ``map '' phoneme phoneme_list ``;''
Multiple map commands may be used throughout the probability names description file. The final list of phoneme mappings is formed as the concatenation of all mappings defined by each of the individual map commands.
/silence j E s n oU/. Since all of these phonemes are
context independent, each phoneme is uniquely described as consisting
of only one part. For context dependent modeling, phonemes are
typically described as having two or even three parts.
The recognizer describe command creates a CSLUsh object which
contains all the necessary information needed for pronunciating words
according to the output categories of the specific phoneme classifier.
This object will be referred to as the probability estimator names
description object, or names object for short.
% package require Word 1.0
1.0
% set description {
define <.pau> <j> <E> <s> <n> <oU>;
duration .pau 49 5000
j 18 177
E 40 226
s 49 270
n 28 193
oU 59 397;
}
.
.
% set rr [recognizer describe $description]
probname:0
A complete listing of duration models for the base worldbet symbols
can be found in the general purpose recognizer description file (
genrecog.desc).
In the example above, the names object consists of the phonemes
/.pau j E s n oU/, with corresponding neural network category names
/<.pau> <j> <E> <s> <n> <oU>/. This object can be used to create
word models which are described in terms of the neural network output
classes rather than the base worldbet symbols.
% set Words {
{yes "j E s"}
{no "n oU"}
}
.
.
% set ww [word create $Words]
word:1 word:2
% set wx [word context $ww $rr]
word:3 word:4
% word pronun [lindex $wx 0] $rr
{<j> <E> <s>}
% word pronun [lindex $wx 1] $rr
{<n> <oU>}
genrecog.desc contains
the model definitions needed for the general purpose recognizer
supplied with CSLUsh . We will refer to this file, to explain the
process involved.
The first phase consists of deciding how each phoneme will be described, namely which are one part phonemes, two part phonemes etc. This is very much dependent on the design of the phoneme probability estimator. For our purposes we have chosen to model all english phonemes as a collection of one part(i.e. context independent), two part, three part and also right dependent phonemes. Right dependent phonemes are phonemes which characterize themselves as only being influenced by the phonemes or context to the right of the phoneme.
For each of the context-dependent phonemes, we need to decide upon the
left and right context of each phoneme. In the most specific case we
would have each phoneme dependent on each other phoneme. This is
however not necessary. Studying contextual influences of phonemes, we
find that phonemes can be grouped according to their influence of the
phoneme in question. For example in the CSLUsh recognizer the
phoneme /A/ has three parts: a left context part, a context
independent part and a right context part. For either of the left
context and right contexts we find that the collection of phonemes
/m n N N= n=/ all have roughly the same contextual influence on
the base phoneme /A/. These can therefore be grouped together,
as is done in the general purpose CSLUsh recognizer.
$sil = .pau pc kc tc tSc dZc .garbage; /* silence models */
$bck = \> A aU oU w l l= U U_x; /* back vowels */
$mid = E @ ^ &; /* mid vowels */
$fnt = u i: I_x I ei aI \>i j; /* front vowels */
$ret = 9r 3r &r; /* retro flex */
$nas = m n N N= n=; /* nasals */
$obs = ph kh th h s z Z S T f ts dZ; /* obstruents */
$wev = dZc bc dc gc b d g D t( d( d_( th_( v; /* what ever */
.
.
In many cases we find that some phonemes are very hard to recognize as
independent entities. They tend to have the same characteristics as
other phonemes which are very similar. For this reason the neural
network has a tough time distinguishing among very similar
phonemes. One of the design steps is to identify these phonemes and
then map them to some central representing phoneme. This information
is also needed when describing the word pronunciation models
according to the specific phoneme probability estimator. For example
all voiced closures /bc dc gc dZc/ can be mapped to some general
voiced closure /vc/.
.
.
.
/* map some phonemes to what we have */
map vc bc dc gc dZc; /* voiced closure */
map uc pc tc kc tSc; /* unvoiced closure */
map A \>;
map l l;
map n n= N N=;
map m m=;
map ^ &;
map 3r &r;
map s Z;
map A 5;
.
.
.
The next phase is to define the output probability names according to the syntax specified above. For our general purpose recognizer, the neural network output class names are defined as follows:
.
.
.
/* now define the outputs of the nnet */
define <.pau> <.br> <vc> <uc>;
define $obs<f $fnt<f $bck<f $sil<f $wev<f $nas<f $mid<f $ret<f
f>$fnt f>$bck f>$sil f>$wev f>$nas f>$mid f>$obs f>$ret;
define $obs<v $fnt<v $bck<v $sil<v $wev<v $nas<v $mid<v $ret<v
v>$fnt v>$bck v>$sil v>$wev v>$nas v>$mid v>$obs v>$ret;
define $obs<T $fnt<T $bck<T $sil<T $wev<T $nas<T $mid<T $ret<T
T>$fnt T>$bck T>$sil T>$wev T>$nas T>$mid T>$obs T>$ret;
define $obs<D $fnt<D $bck<D $sil<D $wev<D $nas<D $mid<D $ret<D
D>$fnt D>$bck D>$sil D>$wev D>$nas D>$mid D>$obs D>$ret;
.
.
.
We also need to define the set of duration models for each of our models defined above. In this case we define the duration models for the underlying base phonemes, and rely on the automatic table translation to compute the duration models for the models specified.
.
.
.
/* specify base duration models */
duration .garbage 49 5000
.pau 10 5000
dZc 20 261
bc 23 172
dc 15 153
gc 20 163
pc 30 163
tc 20 168
kc 25 159
tSc 20 168
f 37 247
v 26 172
.
.
.
Given all these information sources we are now ready to define the probability names description object.
% set fh [open genrecog.desc]; set desc [read $fh]; close $fh % set genrecog [recognizer describe $desc] probname:5These data are then used when defining word pronunciation models in terms of the outputs of the neural network category names.
% set wx [word context $ww $genrecog]
word:6 word:7
% word pronun [lindex $wx 0] $genrecog
{{$sil<j} {j>$mid} {$fntl<E} <E> {E>$obs} {$mid<s} {s>$sil}}
digit.desc contains the model definitions needed for
the continuous digit recognizer supplied with the CSLUsh development
environment. With the exception of the following groupings
$sil = .pau tc .garbage; $denl = s ks th; $denr = s th;the digit recognizer was designed using full context dependent models for each of the phonemes which occur in continuous digit strings. Full context dependent modeling consists of a set of biphones where each context of the center phoneme is a singular phoneme, rather than a grouping of phonemes as found in the previous example.
.
.
.
define f<\>r <\>r> \>r>$sil \>r>T \>r>ei \>r>f \>r>w \>r>z \>r>oU
\>r>n \>r>$denr;
define v<^2 ^2>n n<^3 ^3>$sil ^3>ei ^3>f ^3>w ^3>oU;
.
.
Note that the models are defined using a combination of single
phonemes as either the left or right context as well as variables
denoting phoneme groupings.
Full context dependent modeling however has the disadvantage that not all possible context dependent models may be found in the training set. The tie command may therefore be used to tie models for which no data could be found in the training set to models which are similar enough and sufficient training data could be found.
.
.
/* define here all models which are not in nnet and tie them to
the silence context dependent model */
define ^3<z
^3<T
^3<s
^3<n;
define ^3>z
^3>T
^3>s
^3>n;
tie $sil<z ^3<z;
tie $sil<T ^3<T;
tie $sil<s ^3<s;
tie $sil<n ^3<n;
tie ^3>$sil ^3>z ^3>T ^3>s ^3>n;
.
.
In the digit recognizer we chose to tie all models which need be
defined, but which do not have sufficient training data to the
equivalent silence dependent model.
Session:
% package require Word 1.0
1.0
% set description {
define <.pau> <j> <E> <s> <n> <oU>;
duration .pau 49 5000
j 18 177
E 40 226
s 49 270
n 28 193
oU 59 397;
}
.
.
% set rr [recognizer describe $description]
probname:0
% set Words {
{yes "j E s"}
{no "n oU"}
}
.
.
% set ww [word create $Words]
word:1 word:2
% set wx [word context $ww $rr]
word:3 word:4
% word pronun [lindex $wx 0] $rr
{<j> <E> <s>}
% word pronun [lindex $wx 1] $rr
{<n> <oU>}
% set fh [open genrecog.desc]; set desc [read $fh]; close $fh
% set genrecog [recognizer describe $desc]
probname:5
% set wx [word context $ww $genrecog]
word:6 word:7
% word pronun [lindex $wx 0] $genrecog
{{$sil<j} {j>$mid} {$fntl<E} <E> {E>$obs} {$mid<s} {s>$sil}}
genrecog. The transcript for the examples presented can be found
in the file chapter6.tcl.
greeting''
would be pronounced phonemically as follows:
| greeting | [gc] g 9r i: (d( |
Here the pronounciation model takes the format
a [((b c) | d | e)] f |
This represents a word pronounced with symbol a optionally
followed by either symbols (1) b c (grouped with ()'s), (2)
d or (3) e, followed by f. Thus [ ]
means optional, () groups symbols and
denotes
alternatives within the grouping specified. A phoneme may be formed
as a concatenation of any valid character, except the meta characters
[ ] (). Meta characters may however be escaped using a
single backslash. The above example would therefore expand into the
following pronunciation models.
| greeting(1) | gc g 9r i: d( i: N |
| greeting(2) | gc g 9r i: tc th i: N |
| greeting(3) | g 9r i: d( i: N |
| greeting(4) | g 9r i: tc th i: N |
Given the word pronunciations for each of the words in our vocabularly we create word-model objects using the word create command. Word-model objects describe the word pronunciations in a more machine-accessable form.
% package require Word
1.0
% set Words {
{january {dZ @ n [j] u 3r i:}}
{february {f E [bc] b [9r] [j] u 3r i:}}
{march {m A 9r tSc tS}}
{april {ei [pc] ph 9r I l}}
{may {m ei}}
{june {dZ u n}}
{july {dZ u l aI}}
{august {A gc g ^ s tc th}}
{september {s E pc [ph] tc th E m bc b 3r}}
{october {A kc [kh] tc th oU bc b 3r}}
{november {n oU v E m bc b 3r}}
{december {dc d i: s E m bc b 3r}}
}
.
.
% set wordList [word create $Words]
word:0 word:1 word:2 word:3 word:4 word:5
word:6 word:7 word:8 word:9 word:10 word:11
Figure 6.1 depicts graphically the word model for the
word april. The word-model object created by the word
create command describes the pronunciation of the word model. The
base symbols are typically drawn from the worldbet symbol set
(Appendix
). Any symbol used is valid, as long
as the corresponding model names are described using the same symbol
set.
These word objects are, however, not useful when building a search engine such as a lexical tree or a grammar search. Before they can become useful the word pronunciations need to be described in terms of the phoneme probability-estimator symbol set. The word context command is used for this purpose. The input to the word context command is a list of word-model objects, created using word create and also the recognizer description object. The recognizer-description object (chapter 5) describes the relation between word pronunciations and its own respective symbol set. The following example shows how to generate a list of word-model objects in terms of the symbol set of the general purpose recognizer supplied with the CSLUsh development environment.
% package require Genrecog 3.0 % genrecog initialize recog CREATE % set expandList [word context $wordList $recog(names)] word:22 word:23 word:24 word:25 word:26 word:27 word:28 word:29 word:30 word:31 word:32 word:33
These word models created with the word context command are often referred to as context-expanded word models, since the base word pronunciation models are expanded to include the phoneme probability-estimator phonemes, which are typically context-dependent.
![]() |
Figure 6.2 depicts the expanded word-model object for the word ``april.'' Once the word models have been expanded such that the word pronunciations are in terms of phoneme-probability estimator symbols, they can be used to build either a lexical tree or a finite-state grammar search.
{"june" "dZ u n"}
{"july" "dZ u l aI"}
would combine the section of each word, such that the search is described as depicted in Figure 6.3.
Since the number of unique word-initial states in a lexical tree is limited to a closed set of phonemes, independent of the number of words in the vocabularly, this greatly saves in both memory and computational requirements.
Because the current implementation of the lexical tree can only operate as a keyword spotter, it does need some way of distinguishing out-of-vocabulary speech events. Given the above vocabulary description, we would like the ability to recognize the spoken word even if it was embedded in a continuously spoken sentence. For example if a person said ``I was born in January'', then we would expect to recognize the word ``january'' within the context of the rest of the words spoken. For this purpose we use the following two methods of background speech modelling.
Any'' word-modeling
An any word is a special type of word where the states or
phonemes which describe the word may transition to any of the states
or phonemes in the word. Figure 6.4 depicts an
any model, consisting of the phonemes /.pau/, /s/ and
/n/. Since any state or phoneme can transition to any other
state or phoneme within the any model, this particular model
will be effective in handling backgroud silence, background hissing
which typically sounds like an /s/ phoneme, and also background
humming type sounds which is modelled using the /n/ phoneme.
![]() |
% set AnyModel {
{<.pau> 1.0}
{<.any> 1.0}
}
The any model is specified by a list of phonemes and for each phoneme an associated state transition penalty. Since some phonemes in the any model may also be initial or final states in the vocabularly it is necessary to penalize these transitions so they do not have the same score as when they enter or exit a word in the vocabularly.
Robustness is also greatly increased by using a garbage
model. This means that the score for the background is more than the
``silence'' output of our phoneme probability estimator. Two simple
garbage models currently used in CSLUsh are the median of the top
N
sorted phoneme probabilities and
the maximum of a collection of phonemes likely to have high
probabilities in noise. The ideal garbage model will have a
better score in noise and out-of-vocabularly speech events than any of
the vocabulary words.
Given the Any model and the list of expanded word models, we
are now ready to create our search structure, using the tree
build command.
% set tree [tree build $expandList $AnyModel $recog(names) -nbest 2] treesearch:34The lexical tree search functions only as a keyword spotter with the object of finding the most likely word within a spoken utterance. Typically it is not neccessary to know exactly where in time the word occurred, but rather whether a word in the specified vocabulary was spoken or not and which word it was. There are however cases where we wish to know exactly where the word occurred in time, and also what the corresponding phoneme alignment was. This is especially useful for automatic phoneme labeling or for word confidence measurements.
The lexical tree search has the capability of remembering the phoneme alignment of the requested N-best words recognized. Each time a state in a word is entered, this information can be stored on a backtrace heap. After recognition the best path is retrieved from the backtrace heap by tracing the state sequence back in time. A backtrace heap can be created as follows:
% set btheap [backtrace create 20000 20000] btraceheap:35
Here the first value (20000) refers to the initial heap size and the second value to the heap growth size. Whenever the backtrace heap reaches full capacity it will grow according to the number of