There are five basic steps to performing recognition. Each step
will be explained briefly here and in more detail later in this
document.
First, we digitize the speech that we want to recognize; for
telephone speech the sampling rate is 8000 samples per second.
Second, features that represent the spectral-domain content of the
speech (regions of strong energy at particular frequencies) are
computed. These features are computed every 10 msec, with one
10-msec section being called a frame.
Third, a multi-layer perceptron (MLP, or neural network) is used
to classify a set of these features into phonetic-based categories
at each frame.
Fourth, a Viterbi search is used to match the neural-network scores
to the target words (words that may be contained in the input
speech) to determine the word that was most likely uttered.
Fifth, wordspotting is done and the confidence we have in the
top-scoring word is measured. The word is accepted if the confidence
is sufficiently high.
Now we'll go into a bit more detail.
The digitized waveform is converted into a spectral-domain
representation, and cepstral-mean-subtraction is done to remove
some of the effects of noise. Twelve mel-frequency cepstral
coefficients (MFCC coefficients) are computed, and then twelve
delta features that indicate the degree of change occuring at
this frame are also computed. The last features are energy and delta
energy, for a total of 12 + 12 + 2 = 26 features per frame.
We then would like to classify each frame into phonetic-based
categories, but before we do that we take a context window.
This means simply taking the frame of interest as well as frames
that are -60, -30, 30, and 60 msec away from the frame of interest.
This is done to take into consideration the dynamic nature of
speech: the identity of a phoneme will often depend not only on
the spectral features at one point in time, but it will also
depend on how the features change over time.
We then send the features in a context window to a neural network
for classification. The output of the neural network is a
classification of the input frame, measured in terms of the
probabilities of phoneme-based categories. By sending context
windows for all frames of speech to the neural network, we can
build a matrix of the probabilities of phoneme-based categories
over time. In this example of neural-network output, the word to be
recognized is two, and the darker regions in the t, t<u,
and u categories indicate the greater probabilities of those classes
at the indicated times.
The target words are then combined with the grammar (if any)
to produce a list of legal strings of categories. A Viterbi
search is used to find the best path through the matrix of
probabilities for each legal string. The string with the greatest
probability is the output of the Viterbi search.
After checking the confidence that this best-scoring word
or phrase is actually what was spoken, we output this word or
phrase.
So what categories should we have the neural network classify?
The obvious choice is one phoneme per category, so a word such as
"yes" would have three cateogries: /y/, /E/, and /s/. Early recognizers
were built in this way, but it was found that phonemes have such
a great influence on neighboring phonemes that the /E/ that follows
a /s/ may look different from the /E/ that follows, say, a /b/.
In order to account for these coarticulatory effects, the current
strategy is to split each phoneme into one, two, or three parts,
depending on how much that phoneme will be influenced by surrounding
phonemes. The phoneme /E/, for example, can be influenced by the
phonemes on both the right and left side, but it can also be long
enough in duration that the middle part is not much influenced by
surrounding phonemes. As a result, we split /E/ into three parts.
If we were to model every phoneme in the left and right contexts
of every other phoneme, there would be on the order of 5000 different
categories we need to classify. In order to simplify this situation,
we group categories that are similar into one of eight broad
contexts. For example, /y/ is spectrally similar to a front vowel,
and so it is assigned (along with other front vowels such as /iy/ and
/ih/) to the broad context "$front". The phoneme /s/ is a fricative,
and so it is assigned, along with /f/ and /sh/ to the broad context
"$fric". The dollar-sign notation is used when we are referring to
a broad context instead of a single phoneme. So, our model for the
/E/ in "yes" then becomes $front<E (/E/ in the context of a
preceding front vowel), <E> (/E/ in the middle with no contextual
effects), and E>$fric (/E/ in the context of a following fricative).
For a vocabulary-independent system, /E/ will be split into 17
categories: 8 categories for each preceding broad context, 8
categories for each following broad context, and one
context-indpendent category. This method results in 543 categories
for all American English phonemes, which is more computationally
feasible.
The neural network has 130 input nodes, one for each feature in the
5-frame context window. There are 200 hidden nodes, and 545 output
nodes (one output node for each phonetic-based category, and one
extra node called "garbage" that will be explained later).
The general-purpose network used at OGI was trained using
1500 samples from each category, with the data taken from the OGI
Numbers, Yes/No, Apple, and Stories corpora, as well as the NYNEX
PhoneBook corpus. One important thing to
note is that the outputs of the neural network are used as
estimates of posterior probability; in other words, we don't just
take the category with the highest score and say, "OK, at frame
42 we found the category y>$mid." Instead, the network is used
to estimate the probabilities of each category, so that we might
say, "OK, at frame 42 we found an 82% proability of y>$mid, a 7%
probability of y>$front, a 7% probability of iy>$mid, a 3%
probability of <iy>, and a close to zero probability for all other
categories." It has been shown by Bourlard and Wellekens (NIPS 89) and
Richard and Lippmann (Neural Computation 1991) that neural networks
can classify in this way given enough training data and hidden
nodes.
Once we have a matrix of how the phonetic probabilities change
with time, we want to search for the best word. Before doing that,
we need to compute the set of legal strings of phonetic categories.
This set is dependent on the words we want to recognize and
the possible order of words, so we combine pronunciation models
for each of our words (for example, "yes" = $sil<y, y>$mid,
$front<E, <E>, E>$fric, $mid<s, s>$sil) with a grammar
if we allow multiple
words to come in more than one order. In the example shown here,
we have a simple search path that can recognize only either "yes"
or "no" which have to be preceded and followed by silence. In
searching, when we look at a new frame, we transition to a new state
if the probability of the new category is greater than the
probability of the current category. We also perform this search
through all branches of the search path, finding the paths through
the matrix that maximize the score for each word in our vocabulary.
At the end of searching, we have scores for all words in our
vocabulary, and the word with the highest score is the word with
the best fit to the input data, and it is therefore the word with the
highest probability of being the word that was uttered.
This figure traces the search paths for "yes" and "no" through
a hypothetical matrix of probabilities. Even though the score
for "no" is very low, we still find the most probable path for
this word and compare it to the score for "yes" at the end of
searching.
Word-spotting allows us to find our target word or phrase
somewhere in the utterance, even if the person said, "Well,
let me think, ummm.... <target word> is what I'd like."
The default grammar for the recognition of a single word or phrase is
ANY -> <target word or phrase> -> ANYwhich means that we recognize "anything," followed by one of the words from our vocabulary, followed by "anything." "Anything" at some frame is defined as being the score for silence or the score of the 16th-best phonetic category at the current frame, whichever is greater. In effect, we are saying that we will consider the target word to begin when it's score is better than the score for all 16-th best frames. This provides a measurement that is not dependent on a fixed threshold (if the signal is noisy, the scores for the target words AND the 16th-best frames will decrease) or grammar (no assumptions are made about what the extraneous speech really is).
The value 16 was determined empirically and can be changed by
the user in the cslurp or cslush packages.
We have recently implemented confidence and rejection in our
general-purpose recognizer and cslurp.
The way it works is this:
The threshold has been determined by measuring the probabilities of
a Type I error (rejecting a correct word) and a Type II error
(accepting an incorrect word) in a large training corpus. The word
score for which the two error rates are equal is the threshold,
and it minimizes the probability of both types of errors. This
threshold can be adjusted in cslurp, and must be set by the user
in cslush. (This work in confidence was done by Don Colton for
his thesis.)