Frederick Jelinek, Eric Brill, Sanjeev Khudanpur and David Yarowsky
Johns Hopkins University
A statistical model of language is a crucial component of systems which convert between speech, handwriting and text, and in statistical machine translation systems. Most current algorithms for language modeling exhibit an acute myopia, basing their predictions of the next word on only the rigidly fixed region of the few immediately preceding words. When humans are faced with a comparable task they easily outperform these models using the richer linguistic information available to them from more complete context.
Researchers at CLSP propose to investigate and develop novel language modeling techniques that exploit richer contextual information. They are exploring models that capture a variety of syntactic dependencies and that capture longer-distant dependencies through dynamic, hierarchical models of topic, and are combining both of these models with the best current models of local word sequence using the maximum entropy principle.
In order to evaluate the potential gain of using syntactic structure for making a more accurate prediction we need to describe a framework in which syntactic structure is developed in a left-to-right manner, as part of the language model. This framework was described in [2] (browse slides ) and is summarized below.
Consider the sentence:
the dog I have n't seen barked again
The complete parse of the sentence would be:
(( the ( dog ( I ( ( have n't HAVE) seen HAVE) HAVE) DOG) DOG) ( barked again BARKED) BARKED)
The words which label internal nodes in the tree are called headwords and are written in uppercase characters in the above example.
The partial parse -- going left to right -- when predicting barked is
( the ( dog ( I ( ( have n't HAVE) seen HAVE) HAVE) DOG) DOG) .
Consider the prediction of barked: in a standard bigram approach it
would be predicted from seen whereas in our framework barked is
predicted from DOG, since DOG is the most recent exposed headword in
the past. The most recent exposed headword is the closest
word/headword to the left of the predicted word in our representation of the parse tree.
A similar situation arises
when predicting seen: the most recent exposed headword is HAVE and
thus seen is going to predicted from HAVE instead of n't. This
mechanism allows the model to use predictors which are arbitrarily
distant in the past, thus able to model the long distance dependencies
in language.
We have carried out a few experiments in which we calculate the improvement in perplexity when conditioning on the parse tree. The perplexity results are summarized in the following table. The h model uses the most recent exposed headword for predicting the next word while the w model uses the previous word -- standard bigram approach. In the W,H models we allowed back-off to part of speech tags and non-terminal labels, respectively.
Language Language Model Perplexity Model Perplexity --------------------------- -------------------------- w (baseline) 419 W 352 h 410 H 292
In the context of using rich structural information to improve the accuracy of speech recognition systems, we have begun to carefully analyze the errors made by systems on both Switchboard and Wall Street Journal, to help motivate what sorts of syntactic/semantic information can help correct errors currently being made. We are building an initial system that learns to map from the errorful recognizer output into a form closer to the truth, using rich contextual cues to indicate likely errors and to guess what the replacement string should be. Since it is difficult to assign syntactic structure to a nonsensical string that may be output by the recognizer, we are first doing an experiment where we take a syntactic analysis of the truth, and then map this onto the errorful string. If this results in error reduction, we will next turn to the problem of actually annotating (tagging, parsing, etc) errorful strings.
In our efforts to capture longer distance dependencies through dynamic topic models, we have begun by conducting extensive empirical studies into the statistical distributions of different word types conditional on various parameterizations of topic. We have also implemented the hierarchical smoothing procedure for topic sensitive probabilities that we proposed in June 1996, using Broadcast News and NSF abstract corpora, with preliminary results yielding a 17% perplexity reduction over a baseline bigram model. We are currently investigating richer dynamic topic parameterizations optimized to reduce word error rate.
In the widely used mathematical formulation of the problem, the task of the
recognizer is to find a word string
satisfying
where A denotes the acoustic evidence (data) and
denotes a string of n words, each belonging to a fixed and known vocabulary V. In (1), P(A|W) is the probability of observing the acoustics A given that the word string W was uttered, and P(W) is the a priori probability that the speaker will wish to utter W.
A recognizer is in the possession (a) of an acoustic model which can supply an estimate of the value of P(A|W) for every combination of A and W, and (b) of a language model which can supply an estimate of the probability P(W) for every conceivable word string W. Our research concerns the development of sophisticated language models that can provide more accurate estimates of P(W) than the current state-of-the-art which is based on the trigram approximation
The basic idea of our approach is to estimate P(W) by taking advantage of the grammatical structure of the hypothesized utterance W. This should allow the prediction of wi to be based on salient information contained in the unlimited history w1,w2,...,wi-1 and thereby free the model's prediction from its unnatural confinement to the preceding two words.