Postscript Version

A Core Lexical Engine: The Contextual Determination of Word Sense

James Pustejovsky and Branimir Boguraev

Department of Computer Science
Brandeis University
Advanced Technologies Group
Apple Computer, Inc.

CONTACT INFORMATION

James Pustejovsky
Department of Computer Science
258 Volen Center for Complex Systems
Waltham, MA 02254-9110
Phone: (617) 736-2709
Fax : (617) 736-2741
Email: jamesp@cs.brandeis.edu

WWW PAGE

http://www.cs.brandeis.edu/~rl lc/rllc.html

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

PROJECT SUMMARY

The goal of this research has been toward the robust contextual determination of word sense for natural language applications. This involves two related subgoals:
  1. the construction of a Core Lexical Engine for diverse applications, domains, and languages.
  2. the acquisition of new lexical entries and the refinement of existing ones for a particular domain or application, through statistically-based corpus acquisition methods.

The Core Lexical Engine consists of three major components:

The theoretical aspects of the work have focused on consolidating these three components into a scalable formalization. This research has demonstrated the applicability of a model of lexical knowledge, to the task of the semi-automatic construction of a core lexical engine, making extensive use of text corpora. The information obtained from both machine-readable dictionaries and large text corpora is equally rich, representative, and important for populating a lexical knowledge base for natural language processing. We are developing a framework for semi-automated lexical acquisition making complementary use of both types of lexical resources.

The initial thrust of the project was on both the theoretical foundations for the lexical representation language used in the project, known as Generative Lexicon (GL) Theory (Pustejovsky, 1991,1995), as well as on development of the acquisition tools for data mining over corpora. How the theoretical work ties together with strategies for automatic lexical acquisition from ``closed'' text corpora was also a central component of our efforts.

Subsequent research on the grant focused on the development of the semantic database, CoreLex. CoreLex is a semantic lexicon structured in such a way that it reflects the lexical semantics of a language in systematic and predictable ways, as expressed in the syntax. It embodies most of the principles of Generative Lexicon Theory (Pustejovsky, 1991, 1995), by representing how senses are related to one another as well as operating with underspecified semantic types.

These assumptions are fundamentally different from existing sources such as Roget and WordNet, both useful and extensive semantic lexicons, but which do not account for any regularities between senses nor do they relate semantic information to syntactic form. Roget and WordNet, however, were both used in the construction of CoreLex, since they are vast resources of lexical semantic knowledge which can be mined, restructured and extended (For discussion of this process, cf. Buitelaar, 1996).

For purposes of semantic tagging, CoreLex can be viewed as a Lexical Knowledge Markup Language (LKML). This type language consists of approximately 400 base types, organized in a subsumption type lattice. The top lattice structure of LKML is similar in many respects to other semantic ontologies (e.g., ACQUILEX, ONTOS), with two important distinctions: (1) formal classifications are distinct from functional classifications, and (2) limited multiple typing is allowed. Both of these strategies are linguistically motivated decisions (cf. Pustejovsky, 1995) and have important consequences for how the text is semantically marked up. of CoreLex have been simplified for tagging purposes.

Unlike other recent attempts to construct semantic tag sets, the LKML incorporates compositional semantic rules used for identifying the recursively rich semantic components of the sentence, just as shallow parsing techniques attempt to identify partial fragments of syntactic structure. These can be seen as providing a "shallow semantic analysis" of text fragments. The result of base semantic typing and recursive typing (identifying larger semantic tags) over a text is a "semi-structured" text, with less structure than a database file, but significantly more information than a text file, such as that used in text-based information retrieval. LKML-markup is the first step towards delivering automated content-based retrieval over text databases.

PROJECT REFERENCES

B. Boguraev and J. Pustejovsky (eds), Corpus Processing for Lexical Acquisition, MIT Press, 1996.

J. Pustejovsky, The Generative Lexicon, MIT Press, 1995.

J. Pustejovsky and B. Boguraev, ``Lexical Semantics in Context", in Journal of Semantics, Special Issues on "Lexical Semantics", edited by J. Pustejovsky and B. Boguraev, 1995.

J. Pustejovsky and P. Bouillon, ``Aspectual Coercion and Logical Polysemy", in Journal of Semantics, 1995.

J. Pustejovsky and B. Boguraev (editors). Lexical Semantics and the Problem of Polysemy, Oxford University Press, Oxford, 1996

Pustejovsky, J., B. Boguraev, M. Verhagen, and P. Buitelaar, "Semantic Tagging and Typed Hyperlinking", AAAI Symposium on NLP and the WWW, Stanford, April, 1997.

AREA BACKGROUND

One of the most pervasive phenomena in natural language is that of lexical ambiguity. This problem confronts language learners and natural language processing systems alike. Both theoretical and computational linguists are aware of the daunting prospect of accounting for ambiguity. The notion of context enforcing a certain reading of a word, traditionally viewed as selecting for a particular word sense, is central both to global lexical knowledge base design (the issue of breaking a word into word senses) and local composition of individual sense definitions. However, current lexicons reflect a particular `static' approach to dealing with this problem: the numbers of and distinctions between senses within an entry are `frozen' into a fixed system's lexicon. Furthermore, definitions make no provisions for the notion that boundaries between word senses may shift with context---not to mention that no lexicon really accounts for any of a range of lexical transfer phenomena.

Traditionally, semantic information in computational lexicons is limited to notions such as selectional restrictions or domain specific constraints, encoded in a `static' representation. This information is typically used in NLP by a simple knowledge manipulation mechanism limited to the ability to match valences of structurally related words. The most advanced device for imposing structure on lexical information is that of inheritance, both at the object (lexical items) and meta (lexical concepts) levels of lexicon. We argue that this is an impoverished view of a computational lexicon and that, for all its advantages, simple inheritance lacks the descriptive power necessary for characterizing fine-grained distinctions in the lexical semantics of words.

Our goal is to develop a semantics that offers a richer and more expressive vocabulary for lexical information. In particular, by performing specialized inference over the ways in which aspects of knowledge structures of words in context can be composed, mutually compatible and contextully relevant lexical components of words and phrases are highlighted. In particular, we demonstrate how lexical ambiguity resolution---now an integral part of the same procedure that creates the semantic interpretation of a sentence itself---becomes a process not of selecting from a pre-determined set of senses, but of highlighting certain lexical properties brought forth by, and relevant to, the current context.

AREA REFERENCES

Cruse, A. Lexical Semantics, Cambridge University Press, 1986.

Hirst, G. Semantic Interpretation and the Resolution of Ambiguity, Cambridge University Press, 1987.

Pustejovsky, J. The Generative Lexicon, MIT Press, 1995.

RELATED PROGRAM AREAS

Adaptive Human Interfaces, Usability and User-Centered Design