Speech Recognition Using Neural Networks

Speech Recognition Using Neural Networks

at the Center for Spoken Language Understanding

John-Paul Hosom, Ron Cole, Mark Fanty, and Don Colton

Center for Spoken Language Understanding (cslu )
Oregon Graduate Institute of Science and Technology
June 6, 1997


The Recognition Process


There are five basic steps to performing recognition. Each step will be explained briefly here and in more detail later in this document.

First, we digitize the speech that we want to recognize; for telephone speech the sampling rate is 8000 samples per second. Second, features that represent the spectral-domain content of the speech (regions of strong energy at particular frequencies) are computed. These features are computed every 10 msec, with one 10-msec section being called a frame. Third, a multi-layer perceptron (MLP, or neural network) is used to classify a set of these features into phonetic-based categories at each frame. Fourth, a Viterbi search is used to match the neural-network scores to the target words (words that may be contained in the input speech) to determine the word that was most likely uttered. Fifth, wordspotting is done and the confidence we have in the top-scoring word is measured. The word is accepted if the confidence is sufficiently high.


Overview


Now we'll go into a bit more detail.

The digitized waveform is converted into a spectral-domain representation, and cepstral-mean-subtraction is done to remove some of the effects of noise. Twelve mel-frequency cepstral coefficients (MFCC coefficients) are computed, and then twelve delta features that indicate the degree of change occuring at this frame are also computed. The last features are energy and delta energy, for a total of 12 + 12 + 2 = 26 features per frame.

We then would like to classify each frame into phonetic-based categories, but before we do that we take a context window. This means simply taking the frame of interest as well as frames that are -60, -30, 30, and 60 msec away from the frame of interest. This is done to take into consideration the dynamic nature of speech: the identity of a phoneme will often depend not only on the spectral features at one point in time, but it will also depend on how the features change over time.

We then send the features in a context window to a neural network for classification. The output of the neural network is a classification of the input frame, measured in terms of the probabilities of phoneme-based categories. By sending context windows for all frames of speech to the neural network, we can build a matrix of the probabilities of phoneme-based categories over time. In this example of neural-network output, the word to be recognized is two, and the darker regions in the t, t<u, and u categories indicate the greater probabilities of those classes at the indicated times.

The target words are then combined with the grammar (if any) to produce a list of legal strings of categories. A Viterbi search is used to find the best path through the matrix of probabilities for each legal string. The string with the greatest probability is the output of the Viterbi search.

After checking the confidence that this best-scoring word or phrase is actually what was spoken, we output this word or phrase.


Context-Dependent Modeling


So what categories should we have the neural network classify? The obvious choice is one phoneme per category, so a word such as "yes" would have three cateogries: /y/, /E/, and /s/. Early recognizers were built in this way, but it was found that phonemes have such a great influence on neighboring phonemes that the /E/ that follows a /s/ may look different from the /E/ that follows, say, a /b/. In order to account for these coarticulatory effects, the current strategy is to split each phoneme into one, two, or three parts, depending on how much that phoneme will be influenced by surrounding phonemes. The phoneme /E/, for example, can be influenced by the phonemes on both the right and left side, but it can also be long enough in duration that the middle part is not much influenced by surrounding phonemes. As a result, we split /E/ into three parts.

If we were to model every phoneme in the left and right contexts of every other phoneme, there would be on the order of 5000 different categories we need to classify. In order to simplify this situation, we group categories that are similar into one of eight broad contexts. For example, /y/ is spectrally similar to a front vowel, and so it is assigned (along with other front vowels such as /iy/ and /ih/) to the broad context "$front". The phoneme /s/ is a fricative, and so it is assigned, along with /f/ and /sh/ to the broad context "$fric". The dollar-sign notation is used when we are referring to a broad context instead of a single phoneme. So, our model for the /E/ in "yes" then becomes $front<E (/E/ in the context of a preceding front vowel), <E> (/E/ in the middle with no contextual effects), and E>$fric (/E/ in the context of a following fricative). For a vocabulary-independent system, /E/ will be split into 17 categories: 8 categories for each preceding broad context, 8 categories for each following broad context, and one context-indpendent category. This method results in 543 categories for all American English phonemes, which is more computationally feasible.


Neural Network


The neural network has 130 input nodes, one for each feature in the 5-frame context window. There are 200 hidden nodes, and 545 output nodes (one output node for each phonetic-based category, and one extra node called "garbage" that will be explained later). The general-purpose network used at OGI was trained using 1500 samples from each category, with the data taken from the OGI Numbers, Yes/No, Apple, and Stories corpora, as well as the NYNEX PhoneBook corpus. One important thing to note is that the outputs of the neural network are used as estimates of posterior probability; in other words, we don't just take the category with the highest score and say, "OK, at frame 42 we found the category y>$mid." Instead, the network is used to estimate the probabilities of each category, so that we might say, "OK, at frame 42 we found an 82% proability of y>$mid, a 7% probability of y>$front, a 7% probability of iy>$mid, a 3% probability of <iy>, and a close to zero probability for all other categories." It has been shown by Bourlard and Wellekens (NIPS 89) and Richard and Lippmann (Neural Computation 1991) that neural networks can classify in this way given enough training data and hidden nodes.


Viterbi Search


Once we have a matrix of how the phonetic probabilities change with time, we want to search for the best word. Before doing that, we need to compute the set of legal strings of phonetic categories. This set is dependent on the words we want to recognize and the possible order of words, so we combine pronunciation models for each of our words (for example, "yes" = $sil<y, y>$mid, $front<E, <E>, E>$fric, $mid<s, s>$sil) with a grammar if we allow multiple words to come in more than one order. In the example shown here, we have a simple search path that can recognize only either "yes" or "no" which have to be preceded and followed by silence. In searching, when we look at a new frame, we transition to a new state if the probability of the new category is greater than the probability of the current category. We also perform this search through all branches of the search path, finding the paths through the matrix that maximize the score for each word in our vocabulary. At the end of searching, we have scores for all words in our vocabulary, and the word with the highest score is the word with the best fit to the input data, and it is therefore the word with the highest probability of being the word that was uttered.


Viterbi Search Example


This figure traces the search paths for "yes" and "no" through a hypothetical matrix of probabilities. Even though the score for "no" is very low, we still find the most probable path for this word and compare it to the score for "yes" at the end of searching.


Word Spotting


Word-spotting allows us to find our target word or phrase somewhere in the utterance, even if the person said, "Well, let me think, ummm.... <target word> is what I'd like."

The default grammar for the recognition of a single word or phrase is

	ANY -> <target word or phrase> -> ANY
which means that we recognize "anything," followed by one of the words from our vocabulary, followed by "anything." "Anything" at some frame is defined as being the score for silence or the score of the 16th-best phonetic category at the current frame, whichever is greater. In effect, we are saying that we will consider the target word to begin when it's score is better than the score for all 16-th best frames. This provides a measurement that is not dependent on a fixed threshold (if the signal is noisy, the scores for the target words AND the 16th-best frames will decrease) or grammar (no assumptions are made about what the extraneous speech really is).

The value 16 was determined empirically and can be changed by the user in the cslurp or cslush packages.


Confidence and Rejection


We have recently implemented confidence and rejection in our general-purpose recognizer and cslurp. The way it works is this:

  1. Word scores that reflect relative merit are computed. Specifically, normalization of each frame score is done based on the rank of the category in the current frame, and these normalized scores are averaged over segments and words to arrive at an overall word score.
  2. This word score is compared to a threshold. If the score is better than the threshold, then we accept the word. If worse, then we reject it.

The threshold has been determined by measuring the probabilities of a Type I error (rejecting a correct word) and a Type II error (accepting an incorrect word) in a large training corpus. The word score for which the two error rates are equal is the threshold, and it minimizes the probability of both types of errors. This threshold can be adjusted in cslurp, and must be set by the user in cslush. (This work in confidence was done by Don Colton for his thesis.)


Questions or comments? Please send mail to the first author, John-Paul Hosom.