Chapter 3: Language Analysis and Understanding
Annie Zaenen
& Hans Uszkoreit
Rank Xerox Research Centre, Grenoble, France
Deutsches Forschungszentrum für Künstliche Intelligenz
and Universität des Saarlandes, Saarbrücken, Germany
We understand larger textual units by combining our understanding of smaller ones. The main aim of linguistic theory is to show how these larger units of meaning arise out of the combination of the smaller ones. This is modeled by means of a grammar. Computational linguistics then tries to implement this process in an efficient way. It is traditional to subdivide the task into syntax and semantics, where syntax describes how the different formal elements of a textual unit, most often the sentence, can be combined and semantics describes how the interpretation is calculated.
In most language technology applications the encoded linguistic knowledge, i.e., the grammar, is separated from the processing components. The grammar consists of a lexicon, and rules that syntactically and semantically combine words and phrases into larger phrases and sentences. A variety of representation languages have been developed for the encoding of linguistic knowledge. Some of these languages are more geared towards conformity with formal linguistic theories, others are designed to facilitate certain processing models or specialized applications.
Several language technology products that are on the market today employ annotated phrase-structure grammars, grammars with several hundreds or thousands of rules describing different phrase types. Each of these rules is annotated by features and sometimes also by expressions in a programming language. When such grammars reach a certain size they become difficult to maintain, to extend and to reuse. The resulting systems might be sufficiently efficient for some applications but they lack the speed of processing needed for interactive systems (such as applications involving spoken input) or systems that have to process large volumes of texts (as in machine translation).
In current research, a certain polarization has taken place. Very
simple grammar models are employed, e.g., different kinds of
finite-state grammars that support highly efficient
processing. Some approaches do away with grammars altogether and use
statistical methods to find basic linguistic patterns.
These approaches are discussed in section
.
On the other end of the scale, we find a
variety of powerful linguistically sophisticated representation
formalisms that facilitate grammar engineering. An
exhaustive description of the current work in that area would be well
beyond the scope of this overview. The most prevalent family of
grammar formalisms currently used in computational linguistics,
constraint based formalisms, is described in short in
section
. Approaches to lexicon construction
inspired by the same view are described in
section
.
Recent developments in the formalization of semantics are
discussed in section
.
The computational issues related to different types of
sentence grammars are discussed in section
.
Section
evaluates how successful the different
techniques are in providing robust parsing results, and
section
addresses issues raised when
units smaller than sentences need to be parsed.