Postscript Version

EXPERIMENTS IN INTEGRATING SPEECH RECOGNITION AND NATURAL LANGUAGE PROCESSING

Mary P. Harper and Leah H. Jamieson

School of Electrical and Computer Engineering
Purdue University

CONTACT INFORMATION

1285 The Electrical Engineering Building
School of Electrical and Computer Engineering
Purdue University
West Lafayette, Indiana 47907-1285

Mary P. Harper Leah H. Jamieson
Phone: (765) 494-3652 (765) 494-3653
Fax: (765) 494-6440(765) 494-3371
Email: harper@ecn.purdue.edu lhj@purdue.edu

WWW PAGE

WWW: http://yara.ecn.purdue.edu/~harperhttp://www.ece.purdue.edu/~lhj

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Spoken Language Processing, Speech Recognition, Natural Language Processing, Integrating Speech and Natural Language Processing, Parsing, Hidden Markov Models.

PROJECT SUMMARY

This project (IRI-9704358) addresses two basic problems in spoken language processing: (1) how to integrate the speech recognition and natural language processing components of a system, and (2) how to use domain-specific information to select the correct meaning of an utterance.

Most current speech recognizers use word-level hidden Markov models (HMMs) in conjunction with unigram, bigram, or trigram language models. Additional flexibility can be achieved by using phones as the basic recognition unit. This research proposes a new technique called a SOHMM - Stochastic Observation Hidden Markov Model - for recognizing words from phone candidates. The output distributions of a recurrent neural network phone recognizer are passed to a SOHMM, which forms word hypotheses by modeling the distribution of the observations rather than the observations themselves. This approach has achieved very high accuracy in preliminary tests.

Most current spoken language systems use probabilistic language models to model syntactic and semantic patterns. This project integrates a probabilistic speech recognition component with a flexible and fast natural language component. Prior work has demonstrated that constraint-based parsing is a powerful technique both in terms of its expressivity (the set of languages accepted by a Constraint-Dependency Grammar (CDG) is a superset of the set of languages that can be accepted by Context-Free Grammars) and its computational complexity. Moreover, the approach allows the application of constraints from multiple knowledge sources, including lexical information, prosody, syntax, and semantics in a uniform modular framework. We have also developed a mechanism for handling domain-specific constraints in the CDG framework. This additional knowledge source should help to prune the search space for the correct meaning of an utterance.

Word graphs act as the interface between the SOHMM-based speech recognizer and the CDG-based natural language component. We will refine the model for the SOHMM-CDG spoken language processing system, and address the problem of building and pruning word graphs. Preliminary work suggests that this division of labor should be quite effective. We are currently interfacing the two systems to demonstrate the potential utility of this approach.

PROJECT REFERENCES

Ruxin Chen and L. H. Jamieson, ``Experiments on the Implementation of Recurrent Neural Networks for Speech Phone Recognition,'' Proceedings of the Thirtieth Annual Asilomar Conference on Signals, Systems and Computers, Pacific Grove, California, November 1996, pp 779-782.

M. P. Harper, L. H. Jamieson, C. D. Mitchell, G. Ying, S. Potisuk, P. N. Srinivasan, R. Chen, C. B. Zoltowski, L. L. McPheters, B. Pellom, and R. A. Helzerman, ``Integrating Language Models with Speech Recognition,'' Proceedings of the 1994 American Association for Artificial Intelligence Workshop on Integration the of Natural Language and Speech Processing, Seattle, WA, August 1994, pp. 139-146.

M. P. Harper, R. A. Helzerman, C. B. Zoltowski, B. L. Yeo, Y. Chan, T. Stewart, and B. L. Pellom, ``Implementation Issues in the Development of the PARSEC Parser,'' SOFTWARE - Practice and Experience, Vol. 25, No. 8, August 1995, pp. 831-862.

M. P. Harper and R. A. Helzerman, ``Extensions to Constraint Dependency Parsing for Spoken Language Processing,'' Computer Speech and Language, Vol. 9, No. 3, July 1995, pp. 187-234.

M. P. Harper and R. A. Helzerman, ``Managing Multiple Knowledge Sources in Constraint-Based Parsing of Spoken Language,'' Fundamenta Informaticae, Special Issue on ``Context: Theory and Practice,'' Vol. 23, No. 2-4, June-August 1995, pp. 303-353.

C. D. Mitchell, M. P. Harper, and L. H. Jamieson, `` Stochastic Observation Hidden Markov Models,'' The 1996 International Conference on Acoustics, Speech, and Signal Processing, May 1996, pp. II-617-II-620.

AREA BACKGROUND

State of the art speech recognition systems achieve high recognition accuracies only on tasks that have low perplexity. The perplexity of a task is, roughly speaking, the average number of choices at any decision point. The perplexity of a task is at a minimum when the true language model is known and correctly modeled. A poor language model increases perplexity and lowers performance. To achieve higher recognition accuracy for a given perplexity, it is necessary to improve the acoustic model or utilize more high-level knowledge, such as syntax, prosody, and semantics. Although determining the true language model is often very difficult or impossible, approximate models can often be found which have sufficiently low perplexity so that accurate recognition is feasible. A second, potentially more difficult problem is how to integrate an acoustic model with a language model which includes syntactic, semantic, and domain-specific knowledge sources. The question of how to integrate language models with speech recognition systems is becoming more important as speech recognition technology matures.

The most successful automatic speech recognition systems are those that utilize higher level knowledge sources such as syntax and semantics, in addition to acoustic and lexical knowledge. Hidden Markov modeling has been one of the most successful strategies for acoustic pattern matching. However, this method is generally difficult to integrate with sophisticated language models. Approaches that jointly model the grammar and the acoustic signal have been applied to small problems successfully. Widespread use of these strategies for larger problems has been limited due to computational costs, insufficient training data, or an inadequate language model. N-gram models have been the most widely used approach to integrating a language model with an acoustic model. However, even for small N (2, 3, or 4), millions of words of text are required to estimate the N-grams for moderate to large vocabularies. Even so, many of the N-grams are undertrained and extensive smoothing is required. For speech understanding applications, N-gram models do not provide a parse or semantic representation.

By separating the language model from the acoustic model, it should be possible to use a more accurate language model without increasing computational costs or the amount of acoustic training data required. Decoupling these knowledge sources is possible only if the language model is conditionally independent of the acoustic model given some intermediate knowledge source. N-best sentences or word graphs/lattices can be used to interface the two components. Several modern systems utilize a language model that operates as a postprocessor to a speech recognizer. Decoupling the acoustic and lexical processors also adds flexibility. A variety of language models can be tried with a single acoustic model.

AREA REFERENCES

X. Aubert and H. Ney, ``Large Vocabulary Continuous Speech Recognition Using Word Graphs,'' Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Vol I, 1995, pp. 49-52.

M. P. Harper, L. H. Jamieson, C. D. Mitchell, G. Ying, S. Potisuk, P. N. Srinivasan, R. Chen, C. B. Zoltowski, L. L. McPheters, B. Pellom, and R. A. Helzerman, `` Integrating Language Models with Speech Recognition,'' Proceedings of the 1994 American Association for Artificial Intelligence Workshop on Integration the of Natural Language and Speech Processing, Seattle, WA, August 1994, pp. 139-146.

C.-H. Lee, E. Giachin, L. R. Rabiner, R. Pieraccini, and A. E. Rosenburg, ``Improved Acoustic Modeling For Large Vocabulary Continuous Speech Recognition,'' Computer Speech and Language,, Vol. 6, No. 2, 1992, pp. 103-127.

L. Nguyen, R. Schwartz, Y. Zhao, and G. Zavaliagkos, ``Is N-Best Dead?,'' Proceedings of the ARPA Human Language Technology Workshop, March 1994, pp. 411-414.

P. Placeway, R. Schwartz, P. Fung, and L. Nguyen, ``The Estimation of Powerful Language Models from Small and Large Corpora,'' The 1993 International Conference on Acoustics, Speech, and Signal Processing, 1993, pp. II-33-II-36.

S. Young, ``A Review of Large-Vocabulary Continuous-Speech Recognition,'' IEEE Signal Processing Magazine, Vol. 13. No. 5, September 1996, pp. 45-57.

RELATED PROGRAM AREAS

Other Communication Modalities, Usability and User-Centered Design, Intelligent Interactive Systems for Persons with Disabilities.

POTENTIAL RELATED PROJECTS

This research could also be used for character recognition research, which would benefit from a similar approach.