Postscript Version

Exploiting Nonlocal and Syntactic Word Relationships in Language Models for Conversational Speech Recognition

Frederick Jelinek, Eric Brill, Sanjeev Khudanpur and David Yarowsky

Johns Hopkins University

CONTACT INFORMATION

The Center for Language and Speech Processing
Johns Hopkins University
3400 North Charles St.
Baltimore, MD 21218
Phone: (410) 516-4237
Fax : (410) 516-5050
Email: jelinek@jhu.edu, brill@jhu.edu, khudanpur@jhu.edu, yarowsky@jhu.edu

WWW PAGE

http://www.clsp.jhu.edu

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Language Modeling, Speech Recognition, Natural Language Processing

PROJECT SUMMARY

A statistical model of language is a crucial component of systems which convert between speech, handwriting and text, and in statistical machine translation systems. Most current algorithms for language modeling exhibit an acute myopia, basing their predictions of the next word on only the rigidly fixed region of the few immediately preceding words. When humans are faced with a comparable task they easily outperform these models using the richer linguistic information available to them from more complete context.

Researchers at CLSP propose to investigate and develop novel language modeling techniques that exploit richer contextual information. They are exploring models that capture a variety of syntactic dependencies and that capture longer-distant dependencies through dynamic, hierarchical models of topic, and are combining both of these models with the best current models of local word sequence using the maximum entropy principle.

In order to evaluate the potential gain of using syntactic structure for making a more accurate prediction we need to describe a framework in which syntactic structure is developed in a left-to-right manner, as part of the language model. This framework was described in [2] (browse slides ) and is summarized below.

Consider the sentence:

the dog I have n't seen barked again

The complete parse of the sentence would be:

(( the ( dog ( I ( ( have n't HAVE) seen HAVE) HAVE) DOG) DOG) ( barked again BARKED) BARKED)

The words which label internal nodes in the tree are called headwords and are written in uppercase characters in the above example.

The partial parse -- going left to right -- when predicting barked is

( the ( dog ( I ( ( have n't HAVE) seen HAVE) HAVE) DOG) DOG) .

Consider the prediction of barked: in a standard bigram approach it would be predicted from seen whereas in our framework barked is predicted from DOG, since DOG is the most recent exposed headword in the past. The most recent exposed headword is the closest word/headword to the left of the predicted word in our representation of the parse tree. A similar situation arises when predicting seen: the most recent exposed headword is HAVE and thus seen is going to predicted from HAVE instead of n't. This mechanism allows the model to use predictors which are arbitrarily distant in the past, thus able to model the long distance dependencies in language.

We have carried out a few experiments in which we calculate the improvement in perplexity when conditioning on the parse tree. The perplexity results are summarized in the following table. The h model uses the most recent exposed headword for predicting the next word while the w model uses the previous word -- standard bigram approach. In the W,H models we allowed back-off to part of speech tags and non-terminal labels, respectively.

  Language                           Language 
  Model           Perplexity         Model         Perplexity
  ---------------------------       --------------------------
   w (baseline)     419               W              352
   h                410               H              292

In the context of using rich structural information to improve the accuracy of speech recognition systems, we have begun to carefully analyze the errors made by systems on both Switchboard and Wall Street Journal, to help motivate what sorts of syntactic/semantic information can help correct errors currently being made. We are building an initial system that learns to map from the errorful recognizer output into a form closer to the truth, using rich contextual cues to indicate likely errors and to guess what the replacement string should be. Since it is difficult to assign syntactic structure to a nonsensical string that may be output by the recognizer, we are first doing an experiment where we take a syntactic analysis of the truth, and then map this onto the errorful string. If this results in error reduction, we will next turn to the problem of actually annotating (tagging, parsing, etc) errorful strings.

In our efforts to capture longer distance dependencies through dynamic topic models, we have begun by conducting extensive empirical studies into the statistical distributions of different word types conditional on various parameterizations of topic. We have also implemented the hierarchical smoothing procedure for topic sensitive probabilities that we proposed in June 1996, using Broadcast News and NSF abstract corpora, with preliminary results yielding a 17% perplexity reduction over a baseline bigram model. We are currently investigating richer dynamic topic parameterizations optimized to reduce word error rate.

PROJECT REFERENCES

[1] Charniak, Eugene. Parsing with context-free grammars and word statistics. Technical Report CS-95-28, Department of Computer Science, Brown University, 1995.
[2] Chelba, Ciprian. A Structured Language Model. In Proceedings of ACL 97.
[3] Collins, Michael. A New Statistical Parser Based on Bigram Lexical Dependencies. In Proceedings of ACL 96.
[4] Stolcke et al. ``Structure and Performance of a Dependency Language Model''. In Proceedings of Eurospeech 97.

AREA BACKGROUND

A speech recognizer is a device which automatically transcribes speech into text. It can be thought of as a voice actuated ``typewriter'' in which the transcription is carried out by a computer program and the transcribed text appears on a workstation display.

In the widely used mathematical formulation of the problem, the task of the recognizer is to find a word string W-hat satisfying

W-hat = ARGMAX_W P(A|W)P(W)         (1)

where A denotes the acoustic evidence (data) and

W = w_1,w_2,...,w_n    w_i \elem script-v        (2)

denotes a string of n words, each belonging to a fixed and known vocabulary V. In (1), P(A|W) is the probability of observing the acoustics A given that the word string W was uttered, and P(W) is the a priori probability that the speaker will wish to utter W.

A recognizer is in the possession (a) of an acoustic model which can supply an estimate of the value of P(A|W) for every combination of A and W, and (b) of a language model which can supply an estimate of the probability P(W) for every conceivable word string W. Our research concerns the development of sophisticated language models that can provide more accurate estimates of P(W) than the current state-of-the-art which is based on the trigram approximation

P(W) = Prod_1^n P(w_i|w_i-2,w_i-1)

The basic idea of our approach is to estimate P(W) by taking advantage of the grammatical structure of the hypothesized utterance W. This should allow the prediction of wi to be based on salient information contained in the unlimited history w1,w2,...,wi-1 and thereby free the model's prediction from its unnatural confinement to the preceding two words.

AREA REFERENCES

[1] Charniak, Eugene. Statistical Language Learning. MIT Press, 1993.
[2] Jelinek, Frederick, Information Extraction from Speech and Text, MIT Press, 1997.
[3] Rosenfeld, Ronald A Maximum Entropy Approach to Adaptive Statistical Language Modeling, Computer, Speech and Language, 1996.

RELATED PROGRAM AREAS

Other Communication Modalities, Adaptive Human Interfaces, Intelligent Interactive Systems for Persons with Disabilities

POTENTIAL RELATED PROJECTS

Statistical language models are important components in the recognition of other communication modalities, including handwriting recognition, OCR, sign-language recognition and machine translation. Our research in improved language modeling thus should have direct applications to each of these other modalities. Potential related projects include the incorporation of our algorithms into such systems and working with experts in these areas to overcome the language modelling challenges unique to each. In addition, several of these other communication modalities (including speech) are important components of intelligent interactive systems for persons with disabilities, and thus our improved language modeling algorithms could be incorporated into such systems as well.