Postscript Version

Knowledge Acquisition for Natural Language Understanding

Claire Cardie

Department of Computer Science
Cornell University

CONTACT INFORMATION

4142 Upson Hall
Cornell University
Ithaca, NY 14853-7501
Phone: (607) 255-9206
Fax : (607) 255-4428
Email: cardie@cs.cornell.edu

WWW PAGE

http://www.cs.cornell.edu/home/cardie/cardie.html

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

natural language learning, machine learning of natural language, knowledge acquisition, case-based learning

PROJECT SUMMARY

A major obstacle to building robust systems that can read, summarize, and extract information from text is the need for large amounts of linguistic knowledge to handle the myriad syntactic, semantic, and pragmatic ambiguities that pervade virtually all aspects of text analysis. The objective of this research is to address this knowledge engineering bottleneck for natural language processing (NLP) systems. To this end, we are extending a general knowledge acquisition framework, called Kenmore, that allows an NLP system to bootstrap its own knowledge bases directly from text using standard inductive machine learning techniques in conjunction with an annotated corpus and robust sentence analysis. Kenmore has been used with corpora from two real-world domains to learn solutions to a number of problems in natural language processing including part-of-speech tagging, semantic feature tagging, case frame triggering, and relative pronoun disambiguation. In current work, we are extending the framework to handle additional problems in lexical and structural ambiguity resolution and for use with very large text corpora. The research is of both theoretical and practical significance. First, we will begin to determine the conditions under which machine learning techniques can be expected to offer a cost-effective approach to knowledge acquisition for NLP systems, especially in comparison to existing statistical techniques. Second, the work will expand the current system into an integrated tool that relies on machine learning techniques to guide NLP system development.

PROJECT REFERENCES

Improving Minority Class Prediction Using Case-Specific Feature Weights. C. Cardie and N. Howe. Proceedings of the Fourteenth International Conference on Machine Learning, to appear.

Examining Locally Varying Weights for Nearest Neighbor Algorithms, N. Howe and C. Cardie. Proceedings of the Second International Conference on Case-Based Reasoning, to appear.

An Analysis of Statistical and Syntactic Phrases., M. Mitra, C. Buckley, A. Singhal, and C. Cardie. 5TH RIAO Conference, Computer-Assisted Information Searching On the Internet, to appear.

Automating Feature Set Selection for Case-Based Learning of Linguistic Knowledge. C. Cardie. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 113-126, University of Pennsylvania, 1996.

Embedded Machine Learning Systems for Natural Language Processing: A General Framework. C. Cardie. In Wermter, S. and Riloff, E. and Scheler, Gabriele (eds.), Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing, Lecture Notes in Artificial Intelligence, 315-328, Springer, 1996.

Domain-Specific Knowledge Acquisition for Conceptual Sentence Analysis, C. Cardie. Ph.D. Thesis, University of Massachusetts, Amherst, MA, 1994. Available as University of Massachusetts, CMPSCI Technical Report 94-74.

University of Massachusetts/Hughes: Description of the CIRCUS System as Used for MUC-5. W. Lehnert, J. McCarthy, S. Soderland, E. Riloff, C. Cardie, J. Peterson, and F. Feng; C. Dolan, and S. Goldman. Proceedings of the Fifth Message Understanding Conference (MUC-5), Baltimore, MD, Morgan Kaufmann, 1994.

A Case-Based Approach to Knowledge Acquisition for Domain-Specific Sentence Analysis, C. Cardie. Proceedings of the Eleventh National Conference on Artificial Intelligence, 798-803, Washington, DC, AAAI Press / MIT Press, 1993.

Using Decision Trees to Improve Case-Based Learning, C. Cardie. Proceedings of the Tenth International Conference on Machine Learning, 25-32, Amherst, MA, Morgan Kaufmann, 1993.

Corpus-Based Acquisition of Relative Pronoun Disambiguation Heuristics, C. Cardie. Proceedings of the 30th Annual Conference of the Association for Computational Linguistics, 216-223, Newark, DE, Association for Computational Linguistics, 1992.

Learning to Disambiguate Relative Pronouns, C. Cardie. Proceedings of the Tenth National Conference on Artificial Intelligence, 38-43, San Jose, CA, AAAI Press / MIT Press, 1992.

AREA BACKGROUND

Among the most successful and robust systems for reading, summarizing, and extracting information from real-world text are knowledge-based natural language processing (NLP) systems. Knowledge-based NLP systems rely heavily on domain-specific, generally handcrafted knowledge to handle the myriad syntactic, semantic, and pragmatic ambiguities that pervade virtually all aspects of sentence analysis. Not surprisingly, however, generating this knowledge for new domains is time-consuming, difficult, and error-prone, and requires the expertise of computational linguists familiar with the underlying NLP system. This knowledge engineering bottleneck remains one of the biggest problems in designing and building natural language systems and promises only to become worse as natural language systems attempt to understand a wider variety of texts, to produce more complex summaries of the text, and to extract knowledge directly from text in a variety of forms. On the other hand, much of human knowledge is described in written documents, and current NLP technologies can perform limited understanding of relatively complicated texts. Machine learning techniques for inductive learning have also become increasingly available and offer powerful mechanisms for simplifying the knowledge acquisition process. The objective of this research is to address this knowledge engineering bottleneck for natural language processing (NLP) systems by developing algorithms that use inductive learning techniques to allow NLP systems to bootstrap their own knowledge bases directly from text.

AREA REFERENCES

Proceedings of the Conferences on Empirical Methods in Natural Language Processing, 1996 and 1997. Available through the Association for Computational Linguistics (ACL).

Proceedings of the Workshops on Very Large Corpora, 1993-1997. Available through the Association for Computational Linguistics (ACL).

Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing, Wermter, S. and Riloff, E. and Scheler, Gabriele (eds.), Lecture Notes in Artificial Intelligence, Springer, 1996.

RELATED PROGRAM AREAS

Adaptive Human Interfaces, Usability and User-Centered Design.

POTENTIAL RELATED PROJECTS

The state of affairs for many end-users of existing information retrieval (IR) systems, Web search engines, and natural language interfaces to document collections is far from optimal. In order to maintain general-purpose retrieval capabilities, for example, current IR systems attempt to balance performance with respect to precision and recall measures. In response to a user query, for example, the system will return as many useful documents as possible, intermingling useful documents with numerous non-useful documents. Oftentimes, however, users would prefer to see a small set of documents, all of which are deemed useful. This scenario requires a retrieval mechanism that emphasizes precision over recall. Unfortunately, the frustration of end-users does not end once a relevant document is found: existing text retrieval systems provide only the simplest methods for browsing the document (e.g., page by page) and provide no automated means for extracting pertinent information from the text in a usable form.

There are (at least) two ways that natural language learning techniques can be used to improve a user's ability to find and extract information from on-line text. First, we can combine our machine learning approach to natural language understanding with traditional statistical approaches to IR to improve the precision of state-of-the-art IR systems. An IR system locates a relevant text by measuring the degree of vocabulary overlap between the user's information request (i.e., the query) and each document in the collection. In theory, a linguistic analysis of the query and documents should be able to provide additional constraints on a high-precision search --- constraints that would be unavailable to a purely statistical text analyzer. It is one of our goals to use the natural language learning techniques developed in Kenmore to create a trainable high-precision partial parser that can recognize selected linguistic relationships for the IR system with high reliability.

A second direction of research is to develop user-trainable information extraction systems. Once a document is found, users need automated methods for extracting the relevant information from the text in a useable form. While such domain-dependent information extraction systems can be built, tailoring the system for each new domain is difficult, time-consuming, and error prone, and invariably requires the expertise of computational linguists familiar with the underlying language-processing system. We plan to use natural language learning techniques as the central component in a system that will allow end-users to train information extraction systems for their own domains in a matter of days and without the intervention of NLP system developers.