Antonio Sanfilippo
Sharp Industries of Europe, Oxford, UK
The intelligent processing of natural language for real world applications requires lexicons which provide rich information about morphological, syntactic and semantic properties of words, are well structured and can be efficiently implemented [Bri92]. These objectives can be achieved by developing tools which facilitate the acquisition of lexical information from machine readable dictionaries and text corpora, as well as database technologies and theories of word knowledge offering an encoding of the information acquired which is desirable for NLP purposes. In the last decade, there has been a growing tendency to use unification-based grammar formalisms [Kay79,KB82,PS87,PS94,ZKC87]
to carry out the task of building such lexicons. These grammar formalisms encode lexical descriptions as feature structures, with inheritance and unification as the two basic operations relating these structures to one another. The use of inheritance and unification is appealing from both engineering and linguistic points of view as these operations can be formalized in terms of lattice-theoretic notions [Car92b] which are amenable to efficient implementation and are suitable to express the hierarchical nature of lexical structure. Likewise, feature structures have a clear mathematical and computational interpretation and provide an ideal data structure to encode complex word knowledge information.
Informally, a feature structure is a set of attribute-value pairs, where values can be atomic or feature structures themselves, providing a partial specification of words, affixes and phrases. Inheritance makes it possible to arrange feature structures into a subsumption hierarchy so that information which is repeated across sets of word entries needs only specifying once [Fli87,PS87,San93]. For example, properties which are common to all verbs (e.g., part of speech, presence of a subject) or subsets of the verb class (presence of a direct object for verbs such as amuse and put; presence of an indirect object for verbs such as go and put) can be defined as templates. Unification provides the means for integrating inherent and inherited specifications of feature structure descriptions.
In general, unification is monotonic: all information, whether inherently specified or inherited, is preserved. Consequently, a valid lexical entry can never contain conflicting values. Unification thus provides a way to perform a consistency check on lexical descriptions. For example, the danger of inadvertently assigning distinct orthographies or parts of speech to the same word entry is easily avoided as the unification of incompatible information leads to failure. An even more stringent regime of grammar checking has recently been made available through the introduction of typed feature structures [Car92b]. Through typing, feature structures can be arranged into a closed hierarchy so that two feature structures unify only if their types have a common subtype. Typing is also used to specify exactly which attributes are appropriate for a given feature structure so that arbitrary extensions of feature structures are easily eschewed.
A relaxation of monotonicity, however, is sometimes useful in order to capture regularities across the lexicon. For example, most irregular verbs in English follow the same inflectional patterns as regular verbs with respect to present and gerundive forms, while differing in the simple past and/or past participle. It would therefore be convenient to state that all verbs inherit the same regular morphological paradigm by default and then let the idiosyncratic specifications of irregular verbs override inherited information which is incompatible.
Default inheritance in the lexicon is desirable to achieve compactness and simplicity in expressing generalizations about various aspects of word knowledge [Fli87,Gaz87], but it can be problematic if used in an unconstrained manner. For example, it is well known that multiple default inheritance can lead to situations which can only be solved ad hoc or nondeterministically when conflicting values are inherited from the parent nodes [THT87]. Although a number of proposals have been made to solve these problems, a general solution is still not available so that the use of default inheritance must be tailored to specific applications.
Another difficult task in lexicon implementation, perhaps the most important with regard to grammar processing, concerns the treatment of lexical ambiguity. Lexical ambiguity can be largely related to our ability to generate appropriate uses of words in context by manipulation of semantic and/or syntactic properties of words. For example, accord is synonymous with either agree or give/grant depending on its valency, move can also be interpreted as a psychological predicate when used transitively with a sentient direct object, and enjoy can take either a noun or verb phrase complement when used in the experience sense:
Examples of this sort show that our ability to extend word use in context is often systematic or conventionalized. Traditional approaches to lexical representation assume that word use extensibility can be modeled by exhaustively describing the meaning of a word through closed enumeration of its senses. Word sense enumeration provides highly specialized lexical entries, but