next up previous


The CSLUsh Toolkit
for Automatic Speech Recognition

Johan Schalkwyk,Don Colton and Mark Fanty

Center for Spoken Language Understanding (cslu)
Oregon Graduate Institute of Science & Technology




June 1, 1998


Contents

Preface

Welcome to CSLUsh, the programming shell from CSLU, the Center for Spoken Language Understanding at Oregon Graduate Institute.

CSLUsh (pronounced ``slush'') provides a Tcl-based environment for research and development of spoken language systems (SLS). CSLUsh provides powerful tools to manipulate wave files, extract features, train and utilize artificial neural networks and Hidden Markov models, recognize spoken utterances subject to grammar and word model constraints, and perform other related activities. These tools are bound together using Tcl (pronounced ``tickle''), a flexible and extensible command language ideal for scripting and composing larger aggregate tools.

Developers and researchers should study Tcl to understand its syntax and calling conventions. The balance of this document assumes the reader is generally familiar with Tcl, and can read example code and modify it as needed to meet programming objectives. For familiarity with Tcl, the primary reference is ``Tcl and the Tk Toolkit,'' by John K. Ousterhout, published by Addison-Wesley.

Organization of this Book

The chapters of this book present a series of lessons that will acquaint the reader with CSLUsh, and which lead through the experiences of building simple recognizers. The lessons show hands-on interaction between a user (yourself) and the system in processing spoken language for various purposes. The major command groups and most common commands are introduced in this way.

The appendices of this book present reference material, including details of command usage.

Overview

This document describes the overall architecture of the CSLU automatic speech recognition tools. It is intended to serve as a tutorial as well as a user's guide.

Building and using the recognizers is a two-stage process. The training is conducted in advance. Examples of speech data similar to those we want to recognize are collected and used to train a neural network. Once trained, the neural network ``retains knowledge'' of these examples. Later, during the recognition stage, we'll use this neural network to compare speech from live users with these examples to see if they are a probable match.

Let's assume for the moment that we already have a neural network trained and ready for use, and discuss the recognition process first.

Digital waveforms

      Speech typically occurs as pressure variations in air. These variations are continuous in time and in magnitude. For computer processing, several adaptations are made.

First, the signal is captured by a microphone (or other transducer) and converted into an electrical signal, where the amplitude of the signal corresponds to the magnitude of the original pressure variation. Such signals can be played back through a loudspeaker system, or through headphones, to recreate pressure variations that can be recognized by humans as the original speech.

Second, the signal is sampled at some frequency, so that only a finite number of amplitudes are recorded, stored, or transmitted, for a given period of time. Common sampling rates include 8000 Hz[*] for telephone speech and 44100 Hz for compact disk recordings.

  Shannon's law provides that frequencies up to half the sampling rate can be reconstructed from the sampled signal, so an 8000 Hz telephone signal can reconstruct frequencies up to 4000 Hz. Higher frequencies are subject to aliasing, such that a frequency of 4010 Hz cannot be distinguished from a frequency of 3990 Hz. (This same aliasing makes spinning wheels on stagecoaches appear to spin backwards on old western movies and television shows).

  Third, the signal is quantized into one of a discrete number of ``bins,'' so that only a finite number of bits is required to represent each sample. This is called Analog-to-Digital (A-to-D) conversion. In telephone speech, there are commonly 214=16384 such bins in linear coding. Because variations in higher amplitudes are not usually discriminated by humans, a logarithmic scaling can be used to represent the same information in 28=256 bins, where each of the bin numbers is essentially an 8-bit representation of the 14-bit integer originally captured.

  Thus, a telephone signal will typically carry 8000 speech samples per second, each represented by an 8-bit number, for a total of 64K bits per second. In contrast, cellular telephone may only employ 4800 or 2400 bits per second, by using linear predictive coding (LPC) [#!rabiner83!#] or other signal compression techniques.

Phonemes

  Phonemes are the basic sounds of a language. There are only about 40 phonemes in English, which means any word can be described as a sequence of these sounds (just as in a pronunciation dictionary). Currently CSLUsh supports phoneme-based recognition. Recognizing phonemes gives one the basis for recognition of all possible words in a language. Appendix [*] gives a complete listing of phonemes used in the English language. An alternative would be to try to recognize words directly, without first recognizing the primitive sounds. This is an approach typically adopted by HMM's (hidden Markov models). For small tasks with ample training data (e.g. letters and digits), this has been done with great success.

Frame-based recognition

  Given a speech file, our first task will be to do phonetic classification, but we don't know where the phonemes are. Rather than try to find phoneme boundaries we simply cut the utterance into equally spaced 10 msec frames. This means there are 100 frames per second and 80 speech samples per frame. We will classify each frame as to which phoneme it is part of. Figure 1.1 shows an utterance of ``no'' as 4 parts: silence (/.pau/), the phoneme /n/, the phoneme /oU/ and silence. Each frame ``belongs'' to one of these parts. Using a neural network, we will phonetically classify each frame. For the utterance ``no'' we will find some number of frames of silence followed by some number of frames of /n/ followed by /oU/ followed by silence. If instead we find silence, /j/, /E/, /s/, silence, then we know the word ``yes'' was spoken.


 
Figure 1.1:   Speech is divided into equally-spaced, 10 msec frames, for an utterance of ``no''. Phoneme boundaries are also shown.
\begin{figure}
\centering
\begin{tabular}
{c} \\ \centerline{\epsfxsize=150mm\leavevmode
\epsfbox {/u/johans/doc/figures/frames.ps}
} \\ \end{tabular}\end{figure}

In reality, things are not so clearly cut. Any particular 10 msec frame may happen to overlap two phonemes and the exact boundary between the phonemes is not always clear in any case. The transition is often gradual. Still, we assume that this model is close enough to give a good result.

  Phoneme classifiers sometimes make mistakes, unfortunately. We cannot rely on getting a perfect sequence of /n oU/ or /j E s/. Faced with all this uncertainty, we take a probabilistic approach. The phoneme classifier is used as a probability estimator. At every frame, it computes the probability of every phoneme. We make the assumption that the probability of a word is the product of the phoneme probabilities for each frame of that word. This is not really true since the probabilities of phonemes in adjacent frames are not independent. However, it is computationally simple and this is what is usually done. Hereafter the product of phoneme probabilities will be called the score for the word.

Feature extraction

The raw waveform can be played through loudspeakers to generate an image of the original signal, and this image can be readily recognized as that of the original.

There are however many ways in which the original signal can be modified, such that when the result is played, it is still readily recognized by human listeners as the original. This raises the question of which aspects of the signal actually carry the information needed for speech recognition and understanding.

DC-offset removal

  A direct current offset in a speech wave is typically an artifact of the recording process. One of the first processing steps involved in speech recognition is to remove the direct current offset of the speech wave.

Frequency components

  The original signal can be analyzed into component signals at different frequencies, and then added together to reconstruct the original signal. The Fourier Transform is used to separate out component signals at multiples of some basic frequency. Each resulting component is sinusoidal with fixed frequency, but with varying amplitude [#!stremler82!#].

From studying speech signals we know that the signal contains enough redundancy that some of the frequencies can be blocked out, but the listener will still understand what was said.

Telephone transmission uses this fact to represent human speech, with an audible frequency range from 20 Hz to 20 000 Hz, using instead the much smaller range of 300 Hz to 3300 Hz. Telephone users can tell that the conversation is not face-to-face, but generally have very little difficulty understanding each other. The worst difficulties seem to be loss of discrimination of obstruent consonants: humans have trouble telling /s/ from /f/ on the telephone. But in general this is not a terrible problem, as humans make up for recognition difficulties by using other context information to accurately decide which phoneme was spoken.

PLP encoding of frames

  Perceptual linear predictive (PLP) [#!hermansky90!#] coding takes into account many limitations and characteristics of human speech production and hearing to reduce a number of direct waveform samples into a few numbers that represent the perceived frequency concentrations and widths.

This signal processing method modifies the short-term spectrum of the speech by several psychophysically based transformations. These values therefore are robust to many kinds of variation in the speech waveform. That is, if a change in the waveform is not perceived by a human listener, the corresponding PLP values will be very similar.

The frame-based recognizer provided with the CSLUsh development environment uses a window of 80 samples (10 msec) to derive seven PLP coefficients and a measure of energy within the window. This process is repeated at 10 msec intervals giving eight number each 10 msec, by which to characterize speech.

Energy normalization

  One of the features included in the feature extraction process describes the amount of energy in the window of speech. The energy in a speech signal typically has a great deal of variance. These include variance in loudness and of the speaker, the recording, as well as the variance in the signal energy between different phoneme sounds. CSLUsh provides an energy normalization algorithm which is used to reduce this variance, such that either loud or soft speech recordings will not affect the underlying recognition technology.

Computing phoneme probabilities

  CSLUsh uses a feed-forward neural network--a very common architecture for the last 10 years--to compute phoneme probabilities. The net is usually thought of as a classifier, but in the theoretically optimal case, the outputs are probabilities.[*]

  Very briefly, a neural network consists of several simple units (analogous to neurons in the brain) which have a numerical output value. There are weighted connections between units. Positive weights are excitatory and negative weights are inhibitory. The networks we use have no feedback paths. The output values of certain units are set externally. These are the inputs to the net representing the frame of speech to be classified. These values are propagated across connections to other units. Each destination unit sums its net input and computes an output value in the range 0 to 1 from this sum. If there are additional layers of units, the process is repeated.

Eventually, the propagated activation reaches the final layer of the net which has no further connections. These units are the output of the net and, ideally, their values represent the probability that the input is from various phonemes.[*] A snapshot of a neural network for our yes/no example is show in Figure 1.2.


 
Figure 1.2:   Graphical depiction of a neural-network phoneme recognizer.
\begin{figure}
\centering
\begin{tabular}
{c} \\ \centerline{\epsfxsize=80mm\leavevmode
\epsfbox {/u/johans/doc/figures/nnet.ps}
} \\ \end{tabular}\end{figure}

In addition to the eight PLP coefficients representing the frame of speech we want to classify, we also present as input to the network PLP coefficients from nearby frames. This gives the net some context and greatly improves the classification accuracy.

Search algorithms

    Once we have the phoneme probabilities, how do we find the word with the highest score? Most recognizers, including ours, use an algorithm called a Viterbi search. The phoneme probabilities for each successive frame are arranged in a matrix. We then find the path through the matrix that gives us the highest score. An example of such a path is shown conceptually in Figure 1.3.


 
Figure 1.3:   Conceptual representation of a Viterbi search.
\begin{figure}
\centering
\begin{tabular}
{c} \\ \centerline{\epsfxsize=120mm\leavevmode
\epsfbox {/u/johans/doc/figures/viterbi.ps}
} \\ \end{tabular}\end{figure}

While we want to find the highest scoring path through the matrix, we also want the resulting word to represent something meaningful in our vocabulary. The word ``yes'' has three phonemes: /j/, /E/ and /s/, with a transition from /j/ to /E/ and from /E/ to /s/. We'll constrain the transitions between phonemes so that they reflect the legal pronunciations of our words.

There are many possible paths through the matrix, but for any particular phoneme in any given word there is one path to the end that will maximize the score. We can take advantage of this insight to increase the efficiency of the search. If we see two paths come together at the same phoneme in the same word at the same time, we'll discard the path with the lower score. Since the two paths will behave identically from that point on, we know the lower-scoring path will never ``catch up''.

Word pronunciation

  People don't always pronounce a word strictly according to the dictionary. For example, ``twenty'' may not have a /th/ in the middle: ``tweny''. Accurately representing real variation(s) in pronunciation is very important. The Viterbi search will blindly score words according to the pronunciation we have given it with no flexibility. If a /th/ is required and none is found, the score may be significantly lowered.

To alleviate this problem, our recognizers take as an input a word pronunciation dictionary. The following example of a pronunciation entry for the word ``alligators'' reflects differences in how the word is pronounced by various regional and ethnic groups. See Appendix [*] for an explanation of the symbol set used.


 
Table 1.1: Regional pronunciation variations for the word ``alligators''.
alligators /@ l & gc g ei tc th 3r s/
  /@ l & gc g ei d_( 3r z/
  /@ l I gc g E d_( 3r z/
  /@ l I gc g ei d_( &r Z/
  /@ l I gc g ei d_( &r z/
  /@ l I gc g ei d_( 3r z/
  /@ l I gc g I d_( &r z/

Currently, getting good pronunciations is heavily dependent on manual intervention. Future research plans include using phonological rules to map dictionary pronunciations to the forms most likely to actually be said.

Context-dependent phonetic modeling

    Up until now, we've been proceeding as though speech is broken up into discrete segments of uniform, independent phonemes. But the portions of the human vocal tract that produce sound are constantly in motion, and that causes the phonemes to be neither discrete, uniform, nor independent. Even though two instances of the same phoneme may be just as dissimilar as two distinct phonemes, we expect our phoneme classifier to accurately recognize all frames of a phoneme no matter if they are on the edge or in the middle, and no matter what the surrounding phonemes may be. This is difficult to achieve.

To alleviate this problem, we need to fine-tune our phoneme modeling. In essence, we'll treat a single phoneme as consisting of two or three parts, with each part depending either on the phoneme to the left (left context) or the phoneme to the right (right context). For example the word ``near'' has the phonetic pronunciation /n i: 9r/. Breaking this into finer units which are called segments, our recognizer models this as

       $sil<n  n>$fnt  $nas<i:  <i:>  i:>$ret  $fnt<9r  9r>$sil

The /n/ and the /9r/ are modeled as two separate segments, each dependent on the context provided by its neighbors. The /i:/ has an additional context-independent middle part. For example, the symbol n>$fnt means ``the last part of n before a front vowel''. This ensures greater uniformity across multiple instances of the segment. Unfortunately, it also complicates the recognition process, since phonetic pronunciations must be expanded into segments prior to the Viterbi search.

Background modeling

  Typically a spoken language system uses some endpointing algorithm to decide when speech has stopped. Naively, we could just record whatever was said for a predetermined time, but if the interval we choose is too long the user will hear lengthy delays; if it is too short we may miss an important part of the utterance. Even with our endpointing algorithm, some background sound (ideally silence) will be recorded. Because weak phonemes like /f/ and /n/ at the beginning and end of words can't be accurately distinguished from background, we'll always need to leave a buffer around the end points our algorithm has selected. There may also be brief pauses between words.

To solve this problem, we treat background as if it were a phoneme with its own output on the neural net. It is trained on examples of background from the collected speech data. Unfortunately, if the background noise is loud, this may not work well (and may also fool our endpointing algorithm). So we supplement this approach by also using a garbage model.

  The idea behind a garbage model is to give our recognizers the flexibility to compensate for extraneous noise (if present) without penalizing the recognition. Our background model is composed of two ``phonemes'': /.pau/, which models silence and has been trained on collected speech data, and /.garbage/, which models unexpected noises like coughs, unintelligible speech and the like. However, unlike /.pau/, whose score is a direct output of the neural net, the score for /.garbage/ is computed in terms of the other phoneme scores. In our garbage model we assign the median score of the N largest probabilities to /.garbage/, where N is a function of the number of outputs in the net. This means that /.garbage/ is equal to the $\frac{N}{2}$ highest probability that the neural net comes up with for a given frame. We model our background as the maximum of either /.pau/ or /.garbage/, so that we can compensate for either case without penalizing the quality of the recognition. The ideal garbage model will have a better score while noise is occurring than any of the vocabulary words, and a worse score during speech, provided the speech is in our vocabulary. We can also use the garbage model to detect when the word spoken is not in our vocabulary, because the score for /.garbage/ will be ``relatively'' high. As you might expect, getting the ``relatively'' part just right requires some adjusting. (This is done by including samples of both in-and out-of vocabulary utterances among our training data.)

  The garbage model allows us to do some simple word spotting as well. If the answer to a yes/no question is ``no, I would not'', the ``no'' would match as usual and the ``I would not'' would match the garbage model. Optimal word spotting performance requires a more sophisticated approach and is a subject of ongoing research.

Duration modeling

  In the Viterbi search, the length of a particular phoneme determines how important it is to the overall score. Because they are only a short component of the path, very short phonemes influence the score less than long ones, which can lead to misrecognitions. A significant reduction in errors results if we impose minimum and maximum durations for phonemes, although too-long phonemes are a less frequent source of errors.

Rather than apply absolute minimums or maximums, our recognizers impose a per-frame penalty on the score for segments which are too long or too short. This gives the recognizer some flexibility in overcoming poor word models or sloppy articulation where a segment may really be missing. These duration limits are derived from ``generic'' samples of the phonemes. The recognition improves if we fine-tune the duration limits to match a specific task.

A pipelined implementation

  Ideally, spoken language systems provide quick responses to user input. Long delays will probably not be tolerated. Suppose that a one-second delay from the end of the caller's utterance was the maximum desired delay, and the caller's utterance is three seconds long. From the time the caller stops talking, the endpointing algorithm will have to wait about half a second to make sure the caller is done. If we start recognition at this point, we have only half a second to process three seconds of speech, find the words, formulate a reply and begin to produce it (e.g., play a recording). Since speech recognition is computationally expensive, this would probably not be possible.

In a frame-based recognizer, the speech is processed a frame at a time anyway, so there is no reason to wait until the user has stopped talking to begin recognition. As the speech is recorded, phoneme probabilities can be estimated and the Viterbi search can begin its left-to-right processing. If the algorithm can keep up in real time (taking no more than a second to process a second of speech) then it does not matter how long the caller talks; recognition will be done when he or she is, with some small delay.

The only parts of the algorithm which are not inherently pipelined are the DC removal and the normalization of the PLP energy. For DC removal, we ideally compute the average of the signal over the whole utterance. In the pipelined system, we just use past information.

  For the normalization of PLP energy, we ideally know the peak energy in the call, so we can represent the energy at each frame as a fraction of the peak. In the pipelined system, we use the peak energy so far (beginning with an estimate of the expected peak) up to the current frame plus 150 msec. This causes a 150msec delay in the processing of speech.[*]

Feature computation requires an 80 msec delay since the classification of a frame uses features from 80 msec in the future (as well as 80 msec in the past).

On a DSP implementation, the speech will actually be processed as it arrives, in 10 msec chunks, with the 150 msec delay handled by buffering 15 frames of speech. In a work-station implementation, the speech will probably arrive in larger chunks. Each chunk will be processed and buffers will store enough context to begin the next chunk.

Chapter References



A Quick Tutorial  

This chapter introduces CSLUsh, with a series of interactions which describe the main features of the the CSLUsh environment. In particular we will be building a small system that identifies an utterance from a closed set of two phrases. The set we will be using for this example is ``yes'' and ``no.'' For this purpose we choose to use one of the general-purpose recognizers, supplied with the CSLUsh development environment. All of the information in this chapter will be revisited in more detail in later chapters as well as in the reference guide. The purpose of this chapter is to show the overall structure of CSLUsh and how it can be used to build a simple speech recognition system.

Environment variables

Before reading further please make sure the following environment variables are set to match your local Toolkit installation

Starting CSLUsh

To invoke CSLUsh scripts, you must have Tcl installed on your system. Furthermore if you have CSLUsh installed on your system there should be a collection of Tcl packages installed, all which together form the CSLUsh development environment. These are needed for the examples presented in this chapter. To start CSLUsh simply load in Tcl (version 7.5 or later).

tclsh will start up in interactive mode, accepting input from the tclsh prompt. For ease of presentation, we will show interactions as follows:

  system prompt% tclsh
  %

The session box indicates that you can type some commands and get some responses. In particular, this session box tells us that there is a system prompt (``system prompt%'') already present on your screen. You type ``tclsh'' and press enter. The system responds with ``% '' on the new line.

To start off we first need to copy all files needed for the tutorial into a our private work space.

system prompt% mkdir foo
system prompt% cd foo
system prompt% cp $CSLUDIR/tutorial/cslush/* .
system prompt% ls
analysis.tcl          month.wav             taskdigit.tcl*
digit.desc            months.tcl*           yesno.wav
.
.
system prompt% tclsh8.0
%
You will need most of these files. These include examples which will be explained in later chapters. The actual path name referred to above will depend on the location of your local CSLUsh installation. The transcript for the examples presented in this chapter can be found in the file chapter2.tcl.

CSLUsh is Tcl plus a bit more

CSLUsh is built on top of Tcl, so all Tcl commands and behaviors are present. To efficiently do research using CSLUsh, you will need to learn Tcl.

  % llength [info commands]
  84

There are 84 commands active in tclsh. The set of packages in the CSLUsh installation adds to the base functionality of Tcl, providing among many others, an extended set of commands used to build speech recognizers.

CSLUsh uses the base Tcl command package to load the necessary modules needed. To maintain proper version control, all needed packages must be loaded explicitly as is done in the following piece of Tcl code.

   

% foreach x {Context Wave Prep Analysis Opt Obtrain Garbage Word Tree Rtcl} {
  package require $x 1.0
}
Typical setup commands like these may be added to you local .tclshrc file, which is read when tclsh is started in interactive mode. The optional version number (1.0) indicates that we are requesting version 1.0 of each package specified.

Surfing waves in CSLUsh

  Before we can start building a simple recognizer, we need something to recognize. One of the core functions of CSLUsh is interfacing with speech data. The wave read command provides the capability of reading an existing waveform file into memory.

   

% wave read yesno.wav
wave:0

Tcl operates exclusively on strings of ASCII characters. This provides a consistent and extensible programming interface, because there is only one data type.

But this oversimplifies the truth. Even Tcl, when it opens files, does not return the contents of the file. Instead a file handle, such as ``file3,'' is returned. Similarly, CSLUsh returns a handle which can be used later to access the wave object. The actual wave data is therefore never referred to directly.

To get some information on our new wave object, you may use the wave info command. The info sub-command will be referred to frequently in this book. It allows us to display information of previously created CSLUsh objects.

 

% wave info wave:0
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192

This indicates that the wave has 9812 samples. The samples are encoded in linear format (as opposed to a scaled format such as $\mu$-law) using 16 bits for each sample. There are 8000 samples per second. The entire utterance is 1226.5 milliseconds long (1.226 seconds). The maximum magnitude occurs at millisecond 565.375 (sample point 4523), and is 2492. The average energy (average of squares of magnitude) in the wave is 39151.1. The DC offset (average sample value) for this wave is -10.0192. Figure 2.1 depicts the speech waveform referenced as wave:0.


 
Figure 2.1:   Speech Wave (yesno.wav).
\begin{figure}
\centering
\begin{tabular}
{c}
\\ \centerline{\epsfxsize=150mm\leavevmode
\epsfbox {/u/johans/doc/figures/wave.ps}
}
\\ \end{tabular}\end{figure}

Nuclear fallout

One result of using handles for objects is that you must explicitly destroy the wave object in order to reclaim the memory space it uses.

 

% nuke wave:0
% wave info wave:0
object "wave:0" does not exist

The nuke command tells CSLUsh that the memory required for the wave object wave:0 is no longer needed. Therefore the memory can be freed. During the course of this tutorial we will be creating many similar objects, some of these are used only during an intermediate processing step and may be destroyed (nuked) at that time. Another alternative is to destroy all objects when we are done. There are advantages and disadvantages to both. It will be come more clear in later chapters when it is necessary to destroy objects no longer in use. For the purpose of this tutorial we will destroy all objects created when we are done building our first recognizer.

Using variables in CSLUsh

Since CSLUsh is Tcl plus a bit more, we can take advantage of Tcl variables. The following command reads in the speech waveform file and assigns the variable myWave to the object handle returned by the wave read command.

% set myWave [wave read yesno.wav]
wave:1

Tcl provides command substitution, which allows us to use the result of one command in an argument to another. We eventually plan to write our own CSLUsh scripts, so we might as well start using some variables.

% puts $myWave
wave:1

The prep routine

  One of the first processing steps involved in speech recognition is to remove the DC-offset of the speech wave. For this purpose we use the prep command.

 

% prep
wrong # args: should be "prep option arg ?arg ...?"

The prep command requires that we specify an option and one or more arguments. In this case the options will specify which subcommand of the prep command should be used. The !dc subcommand removes the direct current offset using a simple first-order low-pass filter. Notice that by giving an incomplete (or incorrect) command, some (usually) helpful information is printed in response.

% prep foo
bad option "foo": should be !dc or fir

The result here indicates here that the prep command accepts two valid subcommands, !dc (for no direct current) and fir for implementing finite impulse response filters.

 

% prep !dc
wrong # args:should be !dc initialize || reset || info || 
dcrmob(string) wave(string) ?outwave(string)

Note also that the prep !dc command takes a variable number of parameters depending on the task it is performing. Please refer to the reference pages for a detailed description of the prep command.

Before we can remove the direct current offset we first need to create a prep-operator object. This object will contain all information needed for the prep !dc command to perform its task. Almost all CSLUsh commands have associated operator-objects. These need to be created before the specific command can function or operate on its associated data.

The first step therefore is to create a !dc-operator. This is done by running the prep !dc command with the initialize option. The !dc-operator object returned specifies how to remove the direct current offset. Various parameters can be set with the initialize option, all of which will influence how the direct current offset will be removed.

   

% set prepInit [prep !dc initialize]
dcrm:2
% prep !dc info $prepInit  
{-tau 300.0} {-rate 8000.0}
Here !dc refers to its functionality (removing direct current or ``no direct current''). You will probably recognize the other mnemonics as we go along. The first part of the object's name tells us what kind of object it is, and identifies the internal (hidden) data structure of the object. The second part of the object's name indicates the total number of objects created thus far. The actual numbering is not important since we will be referring to the object using variables instead. Since the initialize option was used without any optional parameters the default parameters were chosen. The default parameters could therefore be overidden during initialization using the -tau and -rate options of the prep !dc initialize command.

Now we have created a !dc-operator object which will now be referenced using the variable prepInit. The next step is to use this operator to process our wave object and consequently remove the direct current offset from the speech data object myWave.

 

% set myPrep [prep !dc $prepInit $myWave]
wave:3
% wave info $myPrep
9812 linear-16 8000.0 1226.5 {2488 4523 565.375} 39283.8 -0.131166
``wave:3'' is the handle for the result of the dc removal. Note here that the direct current offset (-0.13116) is not eqaul to zero. This residual is a result of approximating the DC-offset removal using a low-pass filter, rather than computing the DC-offset over the whole waveform.

Speech signal processing

Our recognizer uses features based on the perceptual linear prediction (PLP) model of speech. The analysis command provides, among other signal processing algorithms, the capability for extracting perceptual-based features from the speech waveform.

Similarly to the prep command, we first need to create a plp-operator object. This is done using the initialize option.

     

% set plpInit [analysis plp initialize]
plp:4
% analysis plp info $plpInit
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 7} 
{-feats 8} {-filters 17} {-exp 0.6} {-lograsta 1.0}
% set myPlp [analysis plp $plpInit $myPrep]
arrayF:6
The defaults are listed using the info option of the analysis plp command. The above command therefore extracts 8 PLP cepstral coefficients every 10 msec, based on an analysis window of 10 msec. The cepstral coefficients are exponentially liftered (weighted) using a liftering factor set to 0.6. Figure 2.2 depicts the spectrogram calculated using our PLP coefficients calculated for every 10 msec time slice. The spectrogram is calculated by converting the time-domain cepstral parameters (PLP coefficients) back to frequency components using a discrete Fourier transform (DFT).


 
Figure 2.2:   Spectrogram calculated from the PLP model of speech.
\begin{figure}
\centering
\begin{tabular}
{c}
\\ \centerline{\epsfxsize=150mm\leavevmode
\epsfbox {/u/johans/doc/figures/plp.ps}
}
\\ \end{tabular}\end{figure}

Energy normalization

The perceptual linear predictive model of speech also includes the energy of the speech. This energy measure gives an indication of the loudness of the speech over time. We may to normalize the energy coefficient so that both loud speech and quiet speech have similar energy characteristics. This process is referred to as energy normalization.

     

% set enormInit [analysis energynorm initialize]
enorm:7
% analysis energynorm info $enormInit
{-lookahead 160.0} {-framesize 10.0} {-decay 0.999} {-filt 0.1}
{-maxest 4.0} {-coeffs 8}
% set myEnorm [analysis energynorm $enormInit $myPlp -flush] 
arrayF:8
The zero'th element of our PLP feature represents an energy-like feature. analysis energynorm normalizes this element by dividing it by an estimate of the maximum power in the speech signal. The estimate of the maximum is initialized to a constant (4.0) and decays until another peak of higher power is detected in which case the maximum estimate is re-initialized to this peak value.

Feature selection

The input to the phonemic classifier consists of 56 features representing PLP cepstral coefficients from seven distinct regions spanning a 160 msec window centered on the frame of speech to be classified. This 160 msec window contains some contextual information of the current speech frame. For this process the collect command is used.

     

% set contextInit [collect initialize]
context:9
% collect info $contextInit
{-frames {{-8 1} {-4 1} {-1 1} {0 1} {1 1} {4 1} {8 1}}} {-coeffs 8} 
{-delay 90.0}
% set myFeat [collect $contextInit $myEnorm -flush]
arrayF:10
Figure 2.3 depicts graphically the feature collection using a context window of 160 msec.


 
Figure 2.3:   Graphical representation of the collect command.
\begin{figure}
\centering
\begin{tabular}
{\vert c\vert}
\hline
\\ \centerline{\...
 ...\epsfbox {/u/johans/doc/figures/collect.ps}
}
\\ \hline\end{tabular}\end{figure}

Neural network probability estimation

The nnet command converts the input feature vectors into probability estimates for the phonemes which describe the recognizer. The neural-network inputs are fed through the net, row by row, and the results are accumulated as rows of the output array $NetOut. Before we can compute the actual phoneme probabilities for each frame of speech we need to specify the recognizer we are going to use. CSLUsh provides two implementations of a general purpose recognizer, indicated here as the package Genrecog. Version 0.0 of the Genrecog package implements context dependent phoneme models calculated from energy-normalized features as computed in our current example. Version 1.2 of the Genrecog package has the same functional output units (context dependent phoneme models), but uses RASTA-processing of the speech signal for greater robustness. This issue will be dealt with in greater detail in later chapters.

   

% package require Genrecog 0.0
0.0
% genrecog initialize recog
CREATE
% puts $recog(nnet)
nnet:13
% set netOut [nnet x $recog(nnet) $myFeat]
arrayF:22
The genrecog initialize function call creates an instance of the general purpose recognizer. This defines the specific signal processing needed for this recognizer, the neural network being used, and the functional description of the neural network output units (context-dependent phoneme models).

        Our ultimate task is keyword spotting. We want to ignore words that do not match our target vocabulary. We do this by defining a ``garbage'' phoneme, whose probability is calculated as the median of the top N probabilities, frame by frame. The actual value of N is specific to the recognizer being used.

When the Viterbi search seeks the best path through the phonemes, our garbage phoneme should give a better score on average than the phonemes required for the target vocabulary, unless we are actually scanning a portion of a target vocabulary word.

   

% set netOutG [garbage median -N $recog(garbage) $netOut]
arrayF:23

Here $netOutG is just like $netOut, but with an extra value added (garbage score) for each frame of phoneme probabilities. Now we are ready with an array of probability estimates. When we have our search model ready, we will examine this array and identify the target vocabulary word that matches most closely.

Building our vocabulary

The goal is to build a simple recognizer which can recognize the words ``yes'' and ``no.'' First of all we need to define the pronunciation models for each of these words. Typically these are automatically extracted from a pronunciation dictionary, however we will define them here by hand.

% set Words {{"yes" "j E s"} {"no" "n oU"}}
{"yes" "j E s"} {"no" "n oU"}

The next step is to convert these word pronunciations into word model objects. Word model objects are CSLUsh objects, which describe the pronunciation of the words in a more machine-accessible form. For this purpose we use the word create command.

   

% set myWords [word create $Words]
word:24 word:25
Although these word pronunciations are now in a more machine accessible form, they do not fully describe the pronunciation of the words with respect to our phoneme probability estimator. The aim is to build word pronunciation which are recognizable by our phoneme recognizer. This is known as adding context to our word pronunciation models. The word context command is used for this purpose.

   

% set contextWords [word context $myWords $recog(names)]
word:26 word:27

Here the variable $recog(names) refers to the functional description of the neural network output units, and is created when an instance of the recognizer is initialized (e.g., genrecog initialize). Chapter 5 discusses the description of the neural-network output units in greater detail. To find out how our general purpose recognizer would pronounce each of the words in the vocabulary, we can use the word pronun command.

 

% word pronun [lindex $contextWords 0] $recog(names)
{{$sil<j} {j>$mid} {$fntl<E} <E> {E>$obs} {$mid<s} {s>$sil}}
% word pronun [lindex $contextWords 1] $recog(names)
{{$sil<n} {n>$bckr} {$nas<oU} <oU> {oU>$sil}}

Building a lexical tree

  Given the context-expanded word objects created in the previous section we are now ready to build our Viterbi search engine. One of the more efficient ways of doing word spotting is by using a lexical tree to represent the Viterbi search. A lexical tree is a collection of word pronunciation models where nodes or states in the Viterbi search are shared among words. For example if we have the following two words

 {"call-block"            "kc kh A l [.pau] bc b l A kc kh"}
 {"call-forwarding"       "kc kh A l [.pau] f > 9r w 3r d( i: N"}

a lexical tree would combine the first section of each word, such that the search is described as depicted in Figure 2.4.


 
Figure 2.4:   Graphical representation of a lexical tree
\begin{figure}
\centering
\begin{tabular}
{c}
\\ \centerline{\epsfxsize=150mm\leavevmode
\epsfbox {/u/johans/doc/figures/tree.ps}
}
\\ \end{tabular}\end{figure}

By using a lexical tree to direct the Viterbi search saves both in memory and computational requirements. The tree build command is used to build such a lexical tree.

     

% set myTree [tree build $contextWords $recog(any) $recog(names)]
treesearch:28
% tree info $myTree -nbest -shortpen -longpen -framesize
{-nbest 4} {-shortpen -5.99146} {-longpen -2.00248} {-framesize 10.0}

Similar to the $recog(names) object the variable $recog(any), which specifies the default background speech modeling, is defined when the recognizer is created.

Doing a search

Now we are almost ready to do recognition. We have phoneme probability estimates for each time slice of our test utterance, and we are ready to do a Viterbi search, to find the best hypothesized word. Given the initialized lexical tree we can perform the Viterbi search with the tree update command.

 

% tree update $myTree $netOutG

The last step of the recognition process is to retrieve the answer. This is done using the tree getbest command.

 

% tree getbest $myTree
{no -131.681} {yes -178.28}
By default the lexical tree search, will keep the top four best answers. In this case we can clearly see that the best answer was ``no.'' The second best word ``yes,'' has a much lower associated score, and we therefore conclude that the file yesno.wav contains the spoken word ``no.''

Resetting the recognizer

Once we are done with recognition, we can either destroy all objects and recreate them for each following recognition, or we could reset each core object to its original state. The reset sub-command allows us to do just that.

         

% prep !dc reset $prepInit
dcrm:2
% analysis plp reset $plpInit
plp:4
% analysis energynorm reset $enormInit
enorm:7
% collect reset $contextInit
context:9
% tree reset $myTree

Cleaning up

During this session we have created many CSLUsh objects. All of these consume memory. When building an application using CSLUsh it is necessary to make sure that objects which are no longer needed are destroyed. The nuke command is used for this purpose.

   

% set oblist [object list]
context:9 arrayF:10 context:19 dcrm:2 enorm:7 btraceheap:20 word:24
enorm:18 word:25 arrayF:6 word:26 plp:17 wave:1 word:27 arrayF:8
wave:3 probname:12 treesearch:28 plp:4 arrayF:22 nnet:13 arrayF:23
garbage:21 dcrm:16
% nuke $oblist 
% puts [object list]
The object list command provides a listing of all remaining objects. Since we are done with the first tutorial, none of these are needed, and we can therefore free all allocated memory with the nuke command. The nuke command operates on both single objects and lists of objects. In this case we are using object list to create a list of CSLUsh objects, which can the be freed using the nuke command.

The quick and easy way

The steps presented in this chapter to build a general purpose recognizer are implemented in the Genrecog package. The following example shows how we can repeat our recognition experiment using the general purpose recognizer specified by version 0.0 of the Genrecog package. The exact meaning of each interface variable is described in greater detail in the reference pages.

       

% package require Genrecog 0.0
0.0
% genrecog initialize recog
CREATE
% set w [wave read yesno.wav]
wave:11
% set Words {{yes "j E s"} {no "n oU"}}
{yes "j E s"} {no "n oU"}
% genrecog tree recog search $Words
% genrecog pipe recog search $w
% genrecog result recog search
{no -131.681 .....

Together with the word scores, the genrecog result function call also returns the associated phoneme alignment. More on this in chapter 6.

A quick recap

In this chapter we presented a basic outline the key components of needed to build a speech recognizer, namely

This shows which key components are needed, and how they work together for building a fixed vocabulary recognizer.






Session:

% foreach x {Context Wave Prep Analysis Opt Obtrain Garbage Word Tree Rtcl} {
  package require $x 1.0
}
% wave read yesno.wav
wave:0
% wave info wave:0
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192
% nuke wave:0
% wave info wave:0
object "wave:0" does not exist
% set myWave [wave read yesno.wav]
wave:1
% puts $myWave
wave:1
% set prepInit [prep !dc initialize]
dcrm:2
% prep !dc info $prepInit  
{-tau 300.0} {-rate 8000.0}
% set myPrep [prep !dc $prepInit $myWave]
wave:3
% wave info $myPrep
9812 linear-16 8000.0 1226.5 {2488 4523 565.375} 39283.8 -0.131166
% set plpInit [analysis plp initialize]
plp:4
% analysis plp info $plpInit
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 7} 
{-feats 8} {-filters 17} {-exp 0.6} {-lograsta 1.0}
% set myPlp [analysis plp $plpInit $myPrep]
arrayF:6
% set enormInit [analysis energynorm initialize]
enorm:7
% analysis energynorm info $enormInit
{-lookahead 160.0} {-framesize 10.0} {-decay 0.999} {-filt 0.1}
{-maxest 4.0} {-coeffs 8}
% set myEnorm [analysis energynorm $enormInit $myPlp] 
arrayF:8
% set contextInit [collect initialize]
context:9
% collect info $contextInit
{-frames {{-8 1} {-4 1} {-1 1} {0 1} {1 1} {4 1} {8 1}}} {-coeffs 8} 
{-delay 90.0}
% set myFeat [collect $contextInit $myEnorm -flush]
arrayF:10
% package require Genrecog 0.0
0.0
% genrecog initialize recog
CREATE
% puts $recog(nnet)
nnet:13
% set netOut [nnet x $recog(nnet) $myFeat]
arrayF:22
% set netOutG [garbage median -N $recog(garbage) $netOut]
arrayF:23
% set Words {{"yes" "j E s"} {"no" "n oU"}}
{"yes" "j E s"} {"no" "n oU"}
% set myWords [word create $Words]
word:24 word:25
% set contextWords [word context $myWords $recog(names)]
word:26 word:27
% word pronun [lindex $contextWords 0] $recog(names)
{{$sil<j} {j>$mid} {$fntl<E} <E> {E>$obs} {$mid<s} {s>$sil}}
% word pronun [lindex $contextWords 1] $recog(names)
{{$sil<n} {n>$bckr} {$nas<oU} <oU> {oU>$sil}}
% set myTree [tree build $contextWords $recog(any) $recog(names)]
treesearch:28
% tree info $myTree -nbest -shortpen -longpen -framesize
{-nbest 4} {-shortpen -5.99146} {-longpen -2.00248} {-framesize 10.0}
% tree update $myTree $netOutG
% tree getbest $myTree
{no -131.681} {yes -178.28}
% prep !dc reset $prepInit
dcrm:2
% analysis plp reset $plpInit
plp:4
% analysis energynorm reset $enormInit
enorm:7
% collect reset $contextInit
context:9
% tree reset $myTree
% set oblist [object list]
context:9 arrayF:10 context:19 dcrm:2 enorm:7 btraceheap:20 word:24
enorm:18 word:25 arrayF:6 word:26 plp:17 wave:1 word:27 arrayF:8
wave:3 probname:12 treesearch:28 plp:4 arrayF:22 nnet:13 arrayF:23
garbage:21 dcrm:16
% nuke $oblist 
% puts [object list]

% package require Genrecog 0.0
0.0
% genrecog initialize recog
CREATE
% set w [wave read yesno.wav]
wave:11
% set Words {{yes "j E s"} {no "n oU"}}
{yes "j E s"} {no "n oU"}
% genrecog tree recog search $Words
% genrecog pipe recog search $w
% genrecog result recog search
{no -131.681 {{560 570 {$sil<n}} {570 580 {n>$bckr}} {580 590 {$nas<oU}}
{590 600 <oU>} {600 630 {oU>$sil}}} {{560 630 no}}}
{yes -178.28 {{20 30 {$sil<j}} {30 40 {j>$mid}} {40 50 {$fntl<E}}
{50 60 <E>} {60 70 {E>$obs}} {70 80 {$mid<s}} {80 100 {s>$sil}}}
{{20 100 yes}}} -1116.5933

Digital Waveforms

As mentioned in the previous chapter one of the core functions of CSLUsh is interfacing with speech data. In this chapter we review the CSLUsh utilities used for the manipulation of CSLUsh speech wave objects. These consists of

The transcript for the examples presented can be found in the file chapter3.tcl.

Reading and writing wave objects

The wave read command reads an existing waveform file into memory. Currently only NIST sphere-headed files are supported.

       

% package require Wave 1.0
1.0
% set myWave [wave read yesno.wav]
wave:0
% wave info $myWave
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192
% wave write $myWave output.wav -encoding linear
wave:0

Both the wave read and wave write commands have options for reading or writing selected portions of the waveform. The following example shows how we can read only the first 1000 msec of the speech file and then write the latter half to disk.

% set newWave [wave read yesno.wav 0 1000]
wave:1
% wave info $newWave
8000 linear-16 8000.0 1000.0 {2492 4523 565.375} 48005.2 -11.9775
% wave write $newWave segment.wav 500 1000
wave:1

The wave info command returns statistics regarding our wave object. Our new wave object, has 8000 samples. The samples are encoded in linear format using 16 bits for each sample. The sampling rate is 8000 Hz, meaning there are 8000 samples per second of speech. The entire utterance is 1000 milliseconds long (1.0 second). The maximum magnitude occurs at millisecond 565.375 (sample number 4523) and is equal to 2492. The average energy (energy of squares magnitude) in the wave is 48005.2, and the DC offset is -11.9775.

Removing leading and trailing silence

When doing speech recognition it is advantageous to not have long portions of silence at the beginning and ending of the waveforms. This however is not an easy task. We wish to remove silence and only start recognition once the actual speech data occurs. While searching for speech, we also wish to not act on spurious high energy events which may occur. The wave chopsil command provides this capability.

The wave chopsil command uses an estimate of the variance of the absolute amplitude of the signal. Together with state duration constraints it is able to filter short energy events and trigger on end-of-speech events.

 

% set chopWave [wave chopsil $myWave]
wave:2 400.625 1118.12 -10.0192 197.611
% wave info [lindex $chopWave 0]
5740 linear-16 8000.0 717.5 {2492 1318 164.75} 66873.4 -15.9645

The wave chopsil command returns a Tcl list containing a handle to the silence chopped wave object (wave2); the starting point of detected speech in milliseconds (400.62 msec); the end of utterance detected in milliseconds (1118.12 msec); the DC offset of the original waveform; and lastly the standard deviation of signal amplitude in the original waveform.

In some circumstances, it is possible that the speech detector might not detect any speech events. In this case it will not return a wave handle, thus indicating that no speech events were detected.

   

% set zeroWave [wave zero 100]
wave:3
% wave chopsil $zeroWave
{} 0.0 0.0 0.0 0.0
Here the first element in the returned Tcl list contains the handle to the silence chopped wave object. Since the first element is an empty list, we know that no speech events were detected in the input wave object.

Various parameters affect the silence chopping. These are described in the reference page for the wave chopsil command.

Converting wave objects

In CSLUsh wave objects are specific objects which associate with waveform data stored in memory using a 16-bit linear encoding. However, it is sometimes neccessary to access the same data using a more general CSLUsh object such as the CSLUsh vector object ``arrayF''. The wave tovec and wave fromvec commands convert to and from CSLUsh vector objects. More specifically these commands function on a two-dimensional floating point array structure. Since a wave object is only a one-dimensional data structure, the zero'th dimensions of the vector object is set to one (i.e., the matrix has only one row). Within CSLUsh all signal processing routines work on the floating point representation of the speech waveform. These routines may be used to convert to the correct format if necessary.

% set floatWave [wave tovec $myWave]
arrayF:4
% set originalWave [wave fromvec $floatWave]
wave:5
% wave info $originalWave
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192

Other useful things to do with waves

Other things to do with waves are fairly self-explanatory. These consist of scaling a wave, adding (sample by sample) waves together and so forth. The reference pages presents each of these separately with some examples and discussion.






Session:

% package require Wave 1.0
1.0
% set myWave [wave read yesno.wav]
wave:0
% wave info $myWave
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192
% wave write $myWave output.wav -encoding linear
wave:0
% set newWave [wave read yesno.wav 0 1000]
wave:1
% wave info $newWave
8000 linear-16 8000.0 1000.0 {2492 4523 565.375} 48005.2 -11.9775
% wave write $newWave segment.wav 500 1000
wave:1
% set chopWave [wave chopsil $myWave]
wave:2 400.625 1118.12 -10.0192 197.611
% wave info [lindex $chopWave 0]
5740 linear-16 8000.0 717.5 {2492 1318 164.75} 66873.4 -15.9645
% set zeroWave [wave zero 100]
wave:3
% wave chopsil $zeroWave
{} 0.0 0.0 0.0 0.0
% set floatWave [wave tovec $myWave]
arrayF:4
% set originalWave [wave fromvec $floatWave]
wave:5
% wave info $originalWave
9812 linear-16 8000.0 1226.5 {2492 4523 565.375} 39151.1 -10.0192

Feature Extraction

Speech recognition at its most elementary level comprises a collection of algorithms drawn from a wide variety of disciplines, including statistical pattern recognition, communication theory, signal processing and linguistics among others. Although each of these areas is fundamental to varying degrees in different recognizers, the greatest important common denominator of all recognition systems is the signal processing front-end, which converts the speech waveform to some type of parametric representation. This parametric representation is then used for further analysis and processing. This chapter is devoted to discussing the signal processing algorithms currently available within the CSLUsh environment. These are

The transcript for the examples presented can be found in the file chapter4.tcl.

Power spectral analysis

One of the more common techniques of studying a speech signal is via the power spectrum. The power spectrum of a speech signal describes the frequency content of the signal over time.

The analysis realfft and analysis pspec commands are used for this purpose. The first step towards computing the power spectrum of the speech signal is to perform a Discrete Fourier Transform (DFT), which computes the frequency information to the equivalent time domain signal. Since a speech signal contains only real-point values, we can make use of this fact and use a real-point Fast Fourier Transform (FFT) for increased efficiency. The resulting output contains both the magnitude and phase information of the original time-domain signal.

The analysis pspec converts this output to contain only log-magnitude information, which we can then be used to plot the familiar spectogram.

   

% foreach x {Analysis Prep Wave} {
  package require $x 1.0
}
% set w [wave read yesno.wav]
wave:0
% set nodc [prep !dc initialize]
dcrm:1
% set rfft [analysis realfft initialize -framesize 3.0 -hamming]
fft:2
% set wnodc [prep !dc $nodc $w]
wave:3
% set wfft [analysis realfft $rfft $wnodc]
arrayF:5
% set wspec [analysis pspec $rfft $wfft]
arrayF:6
Figure 4.1 depicts the spectrogram plotted using the power spectral analysis of the speech waveform.


 
Figure 4.1:   Power spectral analysis of speech.
\begin{figure}
\centering
\begin{tabular}
{c} \\ \centerline{\epsfxsize=150mm\leavevmode
\epsfbox {/u/johans/doc/figures/fft.ps}
}
\\ \end{tabular}\end{figure}

Linear predictive analysis (LPC)

One of the more powerful analysis techniques is the method of linear prediction. Linear predictive analysis of speech has become the predominant technique for estimating the basic parameters of speech. Linear predictive analysis provides both an accurate estimate of the speech parameters and it is also an efficient computational model of speech.

The basic idea behind linear predictive analysis is that a speech sample can be approximated as a linear combination of past speech samples. Through minimizing the sum of squared differences (over a finite interval) between the actual speech samples and predicted values, a unique set of parameters or predictor coefficients can be determined. These coefficients form the basis for linear predictive analyis of speech.

The analysis lpc command provides the capability for computing the linear prediction model of speech over time. In reality the actual predictor coefficients are never used in recognition, since they typical show high variance. The predictor coefficient are therefore transformed to a more robust set of parameters known as cepstral coefficients.

 

% set lpc [analysis lpc initialize]
lpc:7
% analysis lpc info $lpc
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 12} 
{-feats 12} {-exp 0.6} {-preemphasis 0.98}
% set wlpc [analysis lpc $lpc $wnodc] 
arrayF:9
Figure 4.2 depicts the LPC spectrogram computed from a 12th order LPC analysis of the speech waveform.


 
Figure 4.2:   Linear predictive (LP) analysis of speech.
\begin{figure}
\centering
\begin{tabular}
{c} \\ \centerline{\epsfxsize=150mm\leavevmode
\epsfbox {/u/johans/doc/figures/lpc.ps}
}
\\ \end{tabular}\end{figure}

Perceptual linear prediction (PLP)

Perceptual linear prediction, similar to LPC analysis, is based on the short-term spectrum of speech. In contrast to pure linear predictive analysis of speech, perceptual linear prediction (PLP) modifies the short-term spectrum of the speech by several psychophysically-based transformations. The PLP cepstral coefficients are computed using the analysis plp command.

Just like most other short-term spectrum-based techniques this method is vulnerable when the short-term spectral values are modified by the frequency response of the communication channel. The analysis plp command provides limited capability for dealing with these distortions by employing a RASTA (Relative Spectral) filter which makes PLP analysis more robust to linear spectral distortions.

 

% set plp [analysis plp initialize]
plp:10
% analysis plp info $plp
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 7} 
{-feats 8} {-filters 17} {-exp 0.6} {-lograsta 1.0}
% set wplp [analysis plp $plp $wnodc]
arrayF:12
The rasta coefficient, currently set to 0.0, indicates that no RASTA processing is being done. This value may vary between 0.0 (no rasta) to 1.0 (full rasta). For intermediate values, the output represents a mixture of both RASTA filtered and unfiltered PLP cepstral coefficients. Figure 4.3 depicts the PLP spectrogram computed from a seventh-order PLP analysis of the speech waveform.


 
Figure 4.3:   Perceptual linear predictive (PLP) analysis of speech.
\begin{figure}
\centering
\begin{tabular}
{c}
\\ \centerline{\epsfxsize=150mm\leavevmode
\epsfbox {/u/johans/doc/figures/plp.ps}
}
\\ \end{tabular}\end{figure}

Mel-scale cepstral analysis (MEL)

Mel-scale cepstral analysis is very similar to perceptual linear predictive analysis of speech, where the short-term spectrum is modified based on psychophysically-based spectral transformations. In this method, however, the spectrum is warped according to the Mel scale, whereas in PLP the spectrum is warped according to the Bark scale. The main difference between Mel scale cepstral analysis and PLP is related to the output cepstral coefficients. The PLP model discussed above uses an all-pole model to smooth the modified power spectrum. The output cepstral coefficients are then computed based on this model. In contrast Mel scale cepstral analysis uses cepstral smoothing to smooth the modified power spectrum. This is done by direct transformation of the log power spectrum to the cepstral domain using an inverse discrete cosine transform (DCT).

Similar to PLP, Mel scale analysis has the option of using a RASTA filter to compensate for linear channel distortions.

 

% set mel [analysis mel initialize]
mel:13
% analysis mel info $mel
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-feats 7} 
{-filters 21} {-melres 100.0} {-preemphasis 0.98} {-exp 0.6} 
{-lograsta 1.0}
% set wmel [analysis mel $mel $wnodc]
arrayF:15
Figure 4.4 depicts the MEL spectrogram computed from a eighth order MEL analysis of the speech waveform.


 
Figure 4.4:   Mel scale cepstral analysis of speech.
\begin{figure}
\centering
\begin{tabular}
{c}
\\ \centerline{\epsfxsize=150mm\leavevmode
\epsfbox {/u/johans/doc/figures/mel.ps}
}
\\ \end{tabular}\end{figure}

Relative spectra filtering (RASTA)

To compensate for linear channel distortions CSLUsh provides the ability to perform RASTA filtering. The RASTA filter can be used in either the log spectral or cepstral domains. In effect the RASTA filter band-passes each feature coefficient. Linear channel distortions appear as an additive constant in both the log spectral and cepstral domains. The high-pass portion of the equivalent band-pass filter alleviates the effect of convolutional noise introduced in the channel. The low-pass filtering helps in smoothing frame-to-frame spectral changes. The analysis rastafilter command is used for this purpose.

   

% set rasta [analysis rastafilter initialize 12 -rasta 1.0]
rasta:16
% set wrasta [analysis rastafilter $rasta $wlpc]
arrayF:17

Computing the first-order derivative

Another useful signal processing technique used when studying robust features, is the time derivative of the feature vector. The analysis delta command computes the first-order time derivative of an input feature vector sequence. Higher order derivatives can be obtained through successive calls of the analysis delta command.

   

% set delta [analysis delta initialize 12 -order 2]
delta:18
% set wdelta [analysis delta $delta $wlpc]
arrayF:19

The optional argument order in the initialize call specifies the order of the numerical approximation used for computing the time derivative of the input feature vector sequence.

Figure 4.5 depicts the magnitude of the first order derivative calculated from the LPC cepstrum obtained earlier in the chapter.


 
Figure 4.5:   Magnitude of first order derivative, calculated from the LPC cepstrum.
\begin{figure}
\centering
\begin{tabular}
{c}
\\ \centerline{\epsfxsize=150mm\leavevmode
\epsfbox {/u/johans/doc/figures/delta.ps}
}
\\ \end{tabular}\end{figure}

Energy normalization

One of the problems of working with the energy in a speech signal is that it typically has a great deal of variance. These include variance in loudness and of the speaker, the recording, as well as the variance in the signal energy between different phoneme sounds. The energy normalization algorithm in CSLUsh is used to reduce this variance, such that either loud or soft speech recordings will not effect the underlying recognition.

The energy coefficient is normalized using an automatic gain control filter (AGC), with a look-ahead buffer of 160 msec. The normalization is performed using a variable gain amplifier in which the gain is controlled by a peak detector on the energy feature. The peak detector has a decay factor of 0.999 and also includes a limiter to prevent excessive gain during silence. Currently the energy normalization leads to an inherent delay of 160 msec within a pipelined recognition process.

The analysis energynorm command implements the above described algorithm. This command functions on the zero'th coefficient of the input feature object.

     

% set enorm [analysis energynorm initialize -coeffs 12]
enorm:20
% analysis energynorm info $enorm
{-lookahead 160.0} {-framesize 10.0} {-decay 0.999} {-filt 0.1} 
{-maxest 4.0} {-coeffs 12}
% set wenorm [analysis energynorm $enorm $wlpc]
arrayF:21

A Quick recap






Session:

% foreach x {Analysis Prep Wave} {
  package require $x 1.0
}
% set w [wave read yesno.wav]
wave:0
% set nodc [prep !dc initialize]
dcrm:1
% set rfft [analysis realfft initialize -framesize 3.0 -hamming]
fft:2
% set wnodc [prep !dc $nodc $w]
wave:3
% set wfft [analysis realfft $rfft $wnodc]
arrayF:5
% set wspec [analysis pspec $rfft $wfft]
arrayF:6
% set lpc [analysis lpc initialize]
lpc:7
% analysis lpc info $lpc
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 12} 
{-feats 12} {-exp 0.6} {-preemphasis 0.98}
% set wlpc [analysis lpc $lpc $wnodc] 
arrayF:9
% set plp [analysis plp initialize]
plp:10
% analysis plp info $plp
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-order 7} 
{-feats 8} {-filters 17} {-exp 0.6} {-lograsta 1.0}
% set wplp [analysis plp $plp $wnodc]
arrayF:12
% set mel [analysis mel initialize]
mel:13
% analysis mel info $mel
{-framesize 10.0} {-windowsize 10.0} {-rate 8000.0} {-feats 7} 
{-filters 21} {-melres 100.0} {-preemphasis 0.98} {-exp 0.6} 
{-lograsta 1.0}
% set wmel [analysis mel $mel $wnodc]
arrayF:15
% set rasta [analysis rastafilter initialize 12 -rasta 1.0]
rasta:16
% set wrasta [analysis rastafilter $rasta $wlpc]
arrayF:17
% set delta [analysis delta initialize 12 -order 3]
delta:18
% set wdelta [analysis delta $delta $wlpc]
arrayF:19
% set enorm [analysis energynorm initialize -coeffs 12]
enorm:20
% analysis energynorm info $enorm
{-lookahead 160.0} {-framesize 10.0} {-decay 0.999} {-filt 0.1} 
{-maxest 4.0} {-coeffs 12}
% set wenorm [analysis energynorm $enorm $wlpc]
arrayF:21

Phoneme Probability Estimator  

The goal of a speech recognition algorithm is to map a speech signal to the words which were spoken. Many different approaches to this problem have been presented by the speech research community. CSLUsh recognizes phonemes (basic sounds of speech), and then models words as sequences of phonemes. In this chapter we will learn how to define phoneme probability estimators within the CSLUsh environment. Although we attempt to keep things simple, some knowledge regarding modeling of speech units will be useful when defining phoneme probability estimators. The transcript for the examples presented can be found in the file names.tcl.

Phoneme probability definition

Model definition

Each output of the phoneme probability estimator corresponds to a model. The collection of models describes the relation between the word pronunciation in terms of the base symbol set and the outputs of the particular phoneme probability estimator. In the most simple case each phoneme is represented by a corresponding context independent model or monophone. However, there can be a wide range of variation in the way that phonemes are realized. Much of this variation is dependent on context. As a result monophones are poor discriminators. One way to improve discrimination is to have context dependent models. Such models are called triphones if they account for both left and right context, or biphones if they account for either the left or the right context but not both. Currently CSLUsh supports only biphone modeling.

When defining the model set, each model is defined by its nucleus (phoneme), and optional left or right context.

          phoneme = char{char}

Here char represent any character except one of the meta characters <> $ = ;. These may however be escaped using a single backslash. The syntax defined above indicates that a phoneme may either be a single character or a concatenation of multiple characters. Throughout the syntactical definition of the phoneme probability estimator we use {} braces to indicate either a single item or a concatenation of multiple similar items.

The left or right context of the model may be defined either as a single phoneme, or as a list of phonemes.

          phoneme_list = phoneme{`` '' phoneme}

Quotes (``'') in the syntax description refer to a string literal. In this case phoneme_list may either be a single phoneme or a list of phonemes delimited by any white space character (indicated by `` '').

Variables are defined by a leading $ character. Variables define a list or grouping of phonemes which can be used to described the left or right context of a particular biphone model.

          name = char{char}
          variable = ``$''name
          context = variable `` = '' phoneme_list ``;''
A biphone model may therefore be dependent on a particular phoneme to the left or right of the nucleus (center phoneme) or on a group of phonemes defined using variables. The following syntax can then be used to define a particular model.

          model = ``<'' phoneme ``>''      |
                  phoneme ``<'' phoneme    |
                  phoneme ``>'' phoneme    |
                  variable ``<'' phoneme   |
                  phoneme ``>'' variable

The braces <> denote either the left or the right context. Context independent models are defined with braces on both sides of the center phoneme. Using this syntax a phoneme can be modeled as (a) a one part model (context independent, left dependent or right dependent), (b) a two part model (left biphone model followed by a right biphone) or (c) a three part phoneme (left biphone followed by a context independent model followed by a right biphone).

The complete description of the outputs of the phoneme probability estimator is defined by a list of models, where the order of the list corresponds to the order of the outputs of the particular phoneme probability estimator.

         model_list  = model{`` '' model}
         estimator = ``define '' model_list ``;''

Multiple define commands may be used throughout the probability estimator definition file. The final set of models will then be defined as the concatenation of each list of models defined by the separate instances of the define command.

Duration modeling

  In the Viterbi search, the length of a particular phoneme determines how important it is to the overall score. Because they are only a short component of the path, very short phonemes influence the score less than long ones, which can lead to misrecognitions. A significant reduction in errors results if we impose minimum and maximum durations for phonemes, although too-long phonemes are a less frequent source of errors.

Rather than apply absolute minimums or maximums, our recognizers impose a per-frame penalty on the score for segments which are too long or too short. This gives the recognizer some flexibility in overcoming poor word models or sloppy articulation where a segment may really be missing. These duration limits are derived from averaging multiple samples of the phonemes. The following syntax defines a duration model.

          mindur = digit{digit}
          maxdur = digit{digit}
          duration = phoneme mindur maxdur   |
                     model   mindur maxdur

Minimum and maximum durations are specified in milliseconds. Model durations which are not specified are calculated from the base phoneme duration models using the transformation rules as described in table 5.1. Each row in table represents each of the possible combinations of the particular model type. For example, a two-part phoneme may be represented either as (a) left-dependent part followed by a context-independent part, (b) a context-independent part followed by a right-dependet part, or (c) a left-dependent part followed by a right-dependent part.


 
Table 5.1:   Durations model rules for calculating duration models from base phoneme durations.
  2cleft dependent 2ccontext independent 2cright dependent      
modeltype min max min max min max
one part - - 1.0 1.0 - -
  1.0 1.0 - - - -
  - - - - 1.0 1.0
two part 0.5 0.6 - - 0.5 0.6
  0.5 0.5 0.5 0.5 - -
  - - 0.5 0.5 0.5 0.5
three part 0.5 0.4 0.5 0.5 0.5 0.4

The table read as follows: For example, for the two-part phoneme case, the first line describes a phoneme, with the first part a left-dependent phoneme (x<A) and the second part a right-dependent phoneme (A>x). Based on the table 5.1, the durations computed for these models would therefore be:



The following two models are a bit unconventional. The second line describes a phoneme, with the first part a left-dependent phoneme (x<A) and the second part a context-indepent phoneme (<A>). The duration models computed for these models would therefore be:



Finally, the third line describes a phoneme, with the first part a context-independent phoneme (<A>) and the second part a right-dependent phoneme (A>x). The durations computed for these models would therefore be:



In order to fully specify the models described a duration model needs to be specified at least for each of the unique phoneme nuclei. If minimum and maximum durations are specified for some of the predefined models, then these durations will override the default durations which are calculated from the base phoneme minimum and maximum durations. The list of duration models are specified using the duration command.

          duration_list    = duration{`` '' duration}
          duration_models  = ``duration '' duration_list ``;''

Multiple duration commands may be used throughout the phoneme probability description file. The final set of duration models will then be defined as the concatenation of each set of duration models defined.

Model tying

When designing phoneme probability estimators it sometimes happens that certain models need to be defined, but can not be trained due to lack of training data. The tie command can then be used to tie this particular model to a model having similar characteristics. The tie command ties a model to a model, or a list of models to a model.

          tied_models = ``tie '' model  model_list ``;''

The list of models indicated by model_list will be tied to the singular model indicated by model. Similar to the define and duration commands, multiple tie commands may be used throughout the phoneme probability description file. The final set of tied models will then be defined as the concatenation of each.

Phoneme mapping

Phoneme mapping allows the complete sharing of models defined with the same nuclei. For example all models defined using the phoneme ``I_x'' will have much the same characteristics than models defined using the phoneme ``I'', with the main difference relating to the underlying duration models of each. Using the map command we only need to define models for one of the symbols(let say ``I'') and then all models can be duplicated for ``I_x'' by mapping ``I_x'' to ``I''.

The map command maps either a phoneme to a phoneme, or a list of phonemes to a phoneme.

         mappings = ``map '' phoneme phoneme_list ``;''

Multiple map commands may be used throughout the probability names description file. The final list of phoneme mappings is formed as the concatenation of all mappings defined by each of the individual map commands.

A simple yes/no recognizer

This section describes the process involved in defining the base word pronunciations for the words ``yes'' and ``no'' in terms of a simple context independent phoneme recognizer which can recognize only the phonemes /silence j E s n oU/. Since all of these phonemes are context independent, each phoneme is uniquely described as consisting of only one part. For context dependent modeling, phonemes are typically described as having two or even three parts.

The recognizer describe command creates a CSLUsh object which contains all the necessary information needed for pronunciating words according to the output categories of the specific phoneme classifier. This object will be referred to as the probability estimator names description object, or names object for short.

 

% package require Word 1.0
1.0
% set description {
define <.pau> <j> <E> <s> <n> <oU>;
duration  .pau 49 5000    
          j    18 177
          E    40 226
          s    49 270
          n    28 193
          oU   59 397;
}
        .
        .
% set rr [recognizer describe $description]
probname:0
A complete listing of duration models for the base worldbet symbols can be found in the general purpose recognizer description file ( genrecog.desc).

In the example above, the names object consists of the phonemes /.pau j E s n oU/, with corresponding neural network category names /<.pau> <j> <E> <s> <n> <oU>/. This object can be used to create word models which are described in terms of the neural network output classes rather than the base worldbet symbols.

     

% set Words {
{yes "j E s"}
{no  "n oU"}
}
        .
        .
% set ww [word create $Words]
word:1 word:2
% set wx [word context $ww $rr]
word:3 word:4
% word pronun [lindex $wx 0] $rr
{<j> <E> <s>}
% word pronun [lindex $wx 1] $rr
{<n> <oU>}

A bit more complicated example

Things become a little more complicated when dealing with context dependent phonetic modelling. The file genrecog.desc contains the model definitions needed for the general purpose recognizer supplied with CSLUsh . We will refer to this file, to explain the process involved.

The first phase consists of deciding how each phoneme will be described, namely which are one part phonemes, two part phonemes etc. This is very much dependent on the design of the phoneme probability estimator. For our purposes we have chosen to model all english phonemes as a collection of one part(i.e. context independent), two part, three part and also right dependent phonemes. Right dependent phonemes are phonemes which characterize themselves as only being influenced by the phonemes or context to the right of the phoneme.


 
Table 5.2:   General purpose recognizer design.
one part .pau .br bc dc gc tc pc kc tSc dZc
two part f v T D s z Z S h m= n= N N= m n
  l= d_( th_( l 9r j w
three part I_x I i: E @ A > & ^ U_x U u u& &r
  3r e& ei >i aI aU oU
right dependent b d g th ph kh tS dZ

For each of the context-dependent phonemes, we need to decide upon the left and right context of each phoneme. In the most specific case we would have each phoneme dependent on each other phoneme. This is however not necessary. Studying contextual influences of phonemes, we find that phonemes can be grouped according to their influence of the phoneme in question. For example in the CSLUsh recognizer the phoneme /A/ has three parts: a left context part, a context independent part and a right context part. For either of the left context and right contexts we find that the collection of phonemes /m n N N= n=/ all have roughly the same contextual influence on the base phoneme /A/. These can therefore be grouped together, as is done in the general purpose CSLUsh recognizer.

$sil = .pau pc kc tc tSc dZc .garbage;          /* silence models */
$bck = \>  A  aU oU w  l l= U U_x;              /* back vowels    */
$mid = E  @  ^  &;                              /* mid vowels     */
$fnt = u  i: I_x I ei aI \>i j;                 /* front vowels   */
$ret = 9r 3r &r;                                /* retro flex     */
$nas = m n N N= n=;                             /* nasals         */
$obs = ph kh th h s z Z S T f  ts dZ;           /* obstruents     */
$wev = dZc bc dc gc b d g D t( d( d_( th_( v;   /* what ever      */
        .
        .

In many cases we find that some phonemes are very hard to recognize as independent entities. They tend to have the same characteristics as other phonemes which are very similar. For this reason the neural network has a tough time distinguishing among very similar phonemes. One of the design steps is to identify these phonemes and then map them to some central representing phoneme. This information is also needed when describing the word pronunciation models according to the specific phoneme probability estimator. For example all voiced closures /bc dc gc dZc/ can be mapped to some general voiced closure /vc/.

        .
        .
        .
/* map some phonemes to what we have */
map vc  bc dc gc dZc;                           /* voiced closure   */
map uc  pc tc kc tSc;                           /* unvoiced closure */
map A   \>;
map l   l;
map n   n= N N=;
map m   m=;
map ^   &;
map 3r  &r;
map s   Z;
map A   5;
        .
        .
        .

The next phase is to define the output probability names according to the syntax specified above. For our general purpose recognizer, the neural network output class names are defined as follows:

        .
        .
        .
/* now define the outputs of the nnet */
define  <.pau>  <.br>  <vc>  <uc>;              
define  $obs<f  $fnt<f  $bck<f  $sil<f  $wev<f  $nas<f  $mid<f  $ret<f
        f>$fnt  f>$bck  f>$sil  f>$wev  f>$nas  f>$mid  f>$obs  f>$ret;

define  $obs<v  $fnt<v  $bck<v  $sil<v  $wev<v  $nas<v  $mid<v  $ret<v
        v>$fnt  v>$bck  v>$sil  v>$wev  v>$nas  v>$mid  v>$obs  v>$ret;

define  $obs<T  $fnt<T  $bck<T  $sil<T  $wev<T  $nas<T  $mid<T  $ret<T
        T>$fnt  T>$bck  T>$sil  T>$wev  T>$nas  T>$mid  T>$obs  T>$ret;

define  $obs<D  $fnt<D  $bck<D  $sil<D  $wev<D $nas<D  $mid<D  $ret<D
        D>$fnt  D>$bck  D>$sil  D>$wev  D>$nas  D>$mid  D>$obs  D>$ret;
        .
        .
        .

We also need to define the set of duration models for each of our models defined above. In this case we define the duration models for the underlying base phonemes, and rely on the automatic table translation to compute the duration models for the models specified.

        .
        .
        .
/* specify base duration models */
duration .garbage   49  5000
         .pau       10  5000
         dZc        20  261
         bc         23  172
         dc         15  153
         gc         20  163
         pc         30  163
         tc         20  168
         kc         25  159
         tSc        20  168 
         f          37  247
         v          26  172
        .
        .
        .

Given all these information sources we are now ready to define the probability names description object.

 

% set fh [open genrecog.desc]; set desc [read $fh]; close $fh
% set genrecog [recognizer describe $desc]
probname:5
These data are then used when defining word pronunciation models in terms of the outputs of the neural network category names.

     

% set wx [word context $ww $genrecog]
word:6 word:7
% word pronun [lindex $wx 0] $genrecog
{{$sil<j} {j>$mid} {$fntl<E} <E> {E>$obs} {$mid<s} {s>$sil}}

Model tying - the digit recognizer

The file digit.desc contains the model definitions needed for the continuous digit recognizer supplied with the CSLUsh development environment. With the exception of the following groupings

$sil  = .pau tc .garbage;
$denl = s ks th;
$denr = s th;
the digit recognizer was designed using full context dependent models for each of the phonemes which occur in continuous digit strings. Full context dependent modeling consists of a set of biphones where each context of the center phoneme is a singular phoneme, rather than a grouping of phonemes as found in the previous example.

        .
        .
        .
define  f<\>r  <\>r>  \>r>$sil  \>r>T  \>r>ei  \>r>f  \>r>w  \>r>z  \>r>oU
        \>r>n  \>r>$denr;

define  v<^2  ^2>n  n<^3  ^3>$sil  ^3>ei  ^3>f  ^3>w  ^3>oU;
        .
        .
Note that the models are defined using a combination of single phonemes as either the left or right context as well as variables denoting phoneme groupings.

Full context dependent modeling however has the disadvantage that not all possible context dependent models may be found in the training set. The tie command may therefore be used to tie models for which no data could be found in the training set to models which are similar enough and sufficient training data could be found.

        .
        .
/* define here all models which are not in nnet and tie them to 
   the silence context dependent model */
define  ^3<z
        ^3<T
        ^3<s
        ^3<n;

define  ^3>z
        ^3>T
        ^3>s
        ^3>n;

tie $sil<z   ^3<z;
tie $sil<T   ^3<T;
tie $sil<s   ^3<s;
tie $sil<n   ^3<n;

tie ^3>$sil  ^3>z ^3>T ^3>s ^3>n;
        .
        .
In the digit recognizer we chose to tie all models which need be defined, but which do not have sufficient training data to the equivalent silence dependent model.






Session:

% package require Word 1.0
1.0
% set description {
define <.pau> <j> <E> <s> <n> <oU>;
duration  .pau 49 5000    
          j    18 177
          E    40 226
          s    49 270
          n    28 193
          oU   59 397;
}
        .
        .
% set rr [recognizer describe $description]
probname:0
% set Words {
{yes "j E s"}
{no  "n oU"}
}
        .
        .
% set ww [word create $Words]
word:1 word:2
% set wx [word context $ww $rr]
word:3 word:4
% word pronun [lindex $wx 0] $rr
{<j> <E> <s>}
% word pronun [lindex $wx 1] $rr
{<n> <oU>}
% set fh [open genrecog.desc]; set desc [read $fh]; close $fh
% set genrecog [recognizer describe $desc]
probname:5
% set wx [word context $ww $genrecog]
word:6 word:7
% word pronun [lindex $wx 0] $genrecog
{{$sil<j} {j>$mid} {$fntl<E} <E> {E>$obs} {$mid<s} {s>$sil}}

Word Models, Lexical trees and Grammars  

In this chapter we explore various issues regarding word pronunciation models, lexical-trees and grammar searches. We will be looking at keyword spotting using lexical trees and also continuous speech recognition using a finite-state-based grammar search. For the purpose of doing recognition we use our general purpose english recognizer genrecog. The transcript for the examples presented can be found in the file chapter6.tcl.

Word pronunciation models

To build a recognizer for a given set of words requires knowledge of how the words are pronounced. For example the word ``greeting'' would be pronounced phonemically as follows:


   
greeting [gc] g 9r i: (d( $\mid$ (tc th)) i: N
   

Here the pronounciation model takes the format


a [((b c) | d | e)] f

This represents a word pronounced with symbol a optionally followed by either symbols (1) b c (grouped with ()'s), (2) d or (3) e, followed by f. Thus [ ] means optional, () groups symbols and $\mid$ denotes alternatives within the grouping specified. A phoneme may be formed as a concatenation of any valid character, except the meta characters [ ] (). Meta characters may however be escaped using a single backslash. The above example would therefore expand into the following pronunciation models.


   
greeting(1) gc g 9r i: d( i: N
greeting(2) gc g 9r i: tc th i: N
greeting(3) g 9r i: d( i: N
greeting(4) g 9r i: tc th i: N
   

Given the word pronunciations for each of the words in our vocabularly we create word-model objects using the word create command. Word-model objects describe the word pronunciations in a more machine-accessable form.

 

% package require Word 
1.0
% set Words {
 {january   {dZ @ n [j] u 3r i:}}
 {february  {f E [bc] b [9r] [j] u 3r i:}}
 {march     {m A 9r tSc tS}}
 {april     {ei [pc] ph 9r I l}}
 {may       {m ei}}
 {june      {dZ u n}}
 {july      {dZ u l aI}}
 {august    {A gc g ^ s tc th}}
 {september {s E pc [ph] tc th E m bc b 3r}}
 {october   {A kc [kh] tc th oU bc b 3r}}
 {november  {n oU v E m bc b 3r}}
 {december  {dc d i: s E m bc b 3r}}
}
	.
	.
% set wordList [word create $Words]
word:0 word:1 word:2 word:3 word:4 word:5 
word:6 word:7 word:8 word:9 word:10 word:11

 
Figure 6.1:   Graphical representation of a word model object for the word ``april''.
\begin{figure}
\centering
\begin{tabular}
{c} \\ \centerline{\epsfxsize=80mm\leavevmode
\epsfbox {/u/johans/doc/figures/april.ps}
}
\\ \end{tabular}\end{figure}

Figure 6.1 depicts graphically the word model for the word april. The word-model object created by the word create command describes the pronunciation of the word model. The base symbols are typically drawn from the worldbet symbol set (Appendix [*]). Any symbol used is valid, as long as the corresponding model names are described using the same symbol set.

These word objects are, however, not useful when building a search engine such as a lexical tree or a grammar search. Before they can become useful the word pronunciations need to be described in terms of the phoneme probability-estimator symbol set. The word context command is used for this purpose. The input to the word context command is a list of word-model objects, created using word create and also the recognizer description object. The recognizer-description object (chapter 5) describes the relation between word pronunciations and its own respective symbol set. The following example shows how to generate a list of word-model objects in terms of the symbol set of the general purpose recognizer supplied with the CSLUsh development environment.

     

% package require Genrecog
3.0
% genrecog initialize recog
CREATE
% set expandList [word context $wordList $recog(names)]
word:22 word:23 word:24 word:25 word:26 word:27
word:28 word:29 word:30 word:31 word:32 word:33

These word models created with the word context command are often referred to as context-expanded word models, since the base word pronunciation models are expanded to include the phoneme probability-estimator phonemes, which are typically context-dependent.


 
Figure 6.2:   Graphical representation of the context expanded word model object for the word ``april.''
\begin{figure}
\centering
\begin{tabular}
{c} \\ \centerline{\epsfxsize=150mm\le...
 ...mode
\epsfbox {/u/johans/doc/figures/expanded.ps}
}
\\ \end{tabular}\end{figure}

Figure 6.2 depicts the expanded word-model object for the word ``april.'' Once the word models have been expanded such that the word pronunciations are in terms of phoneme-probability estimator symbols, they can be used to build either a lexical tree or a finite-state grammar search.

Lexical trees

In a lexical tree nodes or states in the Viterbi search are shared among words. A lexical tree consisting of the words

 {"june"       "dZ u n"}
 {"july"       "dZ u l aI"}

would combine the section of each word, such that the search is described as depicted in Figure 6.3.


 
Figure 6.3:   Graphical representation of a lexical tree.
\begin{figure}
\centering
\begin{tabular}
{c} \\ \centerline{\epsfxsize=150mm\leavevmode
\epsfbox {/u/johans/doc/figures/tree2.ps}
}
\\ \end{tabular}\end{figure}

Since the number of unique word-initial states in a lexical tree is limited to a closed set of phonemes, independent of the number of words in the vocabularly, this greatly saves in both memory and computational requirements.

Because the current implementation of the lexical tree can only operate as a keyword spotter, it does need some way of distinguishing out-of-vocabulary speech events. Given the above vocabulary description, we would like the ability to recognize the spoken word even if it was embedded in a continuously spoken sentence. For example if a person said ``I was born in January'', then we would expect to recognize the word ``january'' within the context of the rest of the words spoken. For this purpose we use the following two methods of background speech modelling.

An any word is a special type of word where the states or phonemes which describe the word may transition to any of the states or phonemes in the word. Figure 6.4 depicts an any model, consisting of the phonemes /.pau/, /s/ and /n/. Since any state or phoneme can transition to any other state or phoneme within the any model, this particular model will be effective in handling backgroud silence, background hissing which typically sounds like an /s/ phoneme, and also background humming type sounds which is modelled using the /n/ phoneme.


 
Figure 6.4:   Graphical representation of the any model used in a lexical tree search for keyword spotting.
\begin{figure}
\centering
\begin{tabular}
{c} \\ \centerline{\epsfxsize=50mm\leavevmode
\epsfbox {/u/johans/doc/figures/anymodel.ps}
}
\\ \end{tabular}\end{figure}

 

% set AnyModel {
 {<.pau> 1.0}
 {<.any> 1.0}
}

The any model is specified by a list of phonemes and for each phoneme an associated state transition penalty. Since some phonemes in the any model may also be initial or final states in the vocabularly it is necessary to penalize these transitions so they do not have the same score as when they enter or exit a word in the vocabularly.

Robustness is also greatly increased by using a garbage model. This means that the score for the background is more than the ``silence'' output of our phoneme probability estimator. Two simple garbage models currently used in CSLUsh are the median of the top N[*] sorted phoneme probabilities and the maximum of a collection of phonemes likely to have high probabilities in noise. The ideal garbage model will have a better score in noise and out-of-vocabularly speech events than any of the vocabulary words.

Given the Any model and the list of expanded word models, we are now ready to create our search structure, using the tree build command.

 

% set tree [tree build $expandList $AnyModel $recog(names) -nbest 2]
treesearch:34
The lexical tree search functions only as a keyword spotter with the object of finding the most likely word within a spoken utterance. Typically it is not neccessary to know exactly where in time the word occurred, but rather whether a word in the specified vocabulary was spoken or not and which word it was. There are however cases where we wish to know exactly where the word occurred in time, and also what the corresponding phoneme alignment was. This is especially useful for automatic phoneme labeling or for word confidence measurements.

The lexical tree search has the capability of remembering the phoneme alignment of the requested N-best words recognized. Each time a state in a word is entered, this information can be stored on a backtrace heap. After recognition the best path is retrieved from the backtrace heap by tracing the state sequence back in time. A backtrace heap can be created as follows:

   

% set btheap [backtrace create 20000 20000]
btraceheap:35

Here the first value (20000) refers to the initial heap size and the second value to the heap growth size. Whenever the backtrace heap reaches full capacity it will grow according to the number of