Fernando Pereira
AT&T Bell Labs, Murray Hill, New Jersey, USA
The complex hidden structure of natural-language sentences is manifested in two different ways: predictively, in that not every constituent (for example, word) is equally likely in every context, and evidentially, in that the information carried by a sentence depends on the relationships among the constituents of the sentence. Depending on the application, one or the other of those two facets may play a dominant role. For instance, in language modeling for large-vocabulary connected speech recognition, it is crucial to distinguish the relative likelihoods of possible continuations of a sentence prefix, since the acoustic component of the recognizer may be unable to distinguish reliably between those possibilities just from acoustic evidence. On the other hand, in applications such as machine translation or text summarization, relationships between sentence constituents, such as that a certain noun phrase is the direct object of a certain verb occurrence, are crucial evidence in determining the correct translation or summary. Parsing is the process of discovering analyses of sentences, that is, consistent sets of relationships between constituents that are judged to hold in a given sentence, and, concurrently, what the constituents are, since constituents are typically defined inductively in terms of the relationships that hold between their parts.
It would not be possible to model or parse sentences without mechanisms to compute the properties of larger constituents from the properties of their parts, appropriately defined, since the properties of new sentences, which are unlikely to have been seen before, can only be inferred from knowledge of how their parts participate in the sentences we have observed previously. While this point may seem obvious, it has deep consequences both in language modeling and parsing. Any language model or parser must include a generative mechanism or grammar that specifies how sentences are built from their parts, and how the information associated to the sentence derives from the information associated to its parts. Furthermore, to be able to cope with previously unseen sentences, any such system must involve generalization with respect to the data from which the language model or parser was developed.
It is useful to think of the grammar in a language model or parser as the specification of a configuration space in which the configurations represent stages of constituent combination, and transitions between configurations describe how constituents are combined in deriving larger constituents. For instance, the configurations may be the states of a finite-state machine and the transitions represent how words may be appended to the end of a sentence prefix. In the more complex case of phrase-structure grammars, configurations represent sequences of phrases ( sentential forms), and transitions the possible combinations of adjacent phrases into larger phrases. A derivation of a sentence according to the grammar is a path in the configuration space of the grammar from an initial configuration to a final configuration in which all the elementary constituents are consumed. (We will call such elementary constituents words in what follows, although the informal notion of word may not correspond to the appropriate technical definition, especially in languages with complex morphology.)
Even with respect to procedural parsers and language models which are not normally described as containing a grammar,
such as certain deterministic [Mar80,Hin93] and
probabilistic parsers
[BJL
93,MM91], it is useful
to identify the implicit grammar defined by the possible
derivation moves which the parser can use under the control of its
control automaton. For instance, in a parser based on a
pushdown automaton such as a shift-reduce parser [Shi83,Per85], the grammar
corresponds to the possible transitions between stack and input
configurations, while the automaton's finite-state control determines
which transitions are actually used in a derivation.
The choice of a grammar for a particular parsing or language modeling application involves two conflicting requirements: precision and coverage. By precision we mean how well the grammar encodes constraints on possible sentences and possible meaningful relationships carried by those sentences. By coverage we mean what proportion of actual sentences have a reasonable derivation in the grammar. We are interested in precision because a more precise grammar is better able to rule out bad sentence hypotheses in predictive tasks and bad meaningful relationship hypotheses in evidential tasks. We are interested in coverage so that our systems will handle appropriately a wide range of actual spoken or written language. But as we increase precision by encoding more constraints in the grammar, we tend to lose coverage of those actual sentences that violate some of the constraints while being still acceptable to language users. The reason for the problem is that the most powerful constraints are idealizations of the actual performance of language users. The tension between precision and coverage is central to the design tradeoffs we will now survey.
We see thus a sentence parser or language model as consisting of a grammar and a search procedure which, given an input sentence, will apply transitions specified by the grammar to construct derivations of the sentence and associated analyses. In cases where the input sentence is uncertain, such as speech recognition, we may further generalize the above picture to a simultaneous search of the configuration space for the grammar and of a space of sentence hypotheses, represented for instance as a word lattice [MBDW93].
The computational properties of parsers and language models depend on two main factors: the structure of the search space induced by the grammar, and the exhaustiveness of the search procedure.
Under the above definition of grammar, transitions from a
configuration may have to take into account the whole
configuration. However, most useful grammar classes have some degree
of locality in that transitions involve only on a
bounded portion of a configuration. In that case, derivations can be
factored into sub-derivations concerning independent parts of
configurations, allowing independent sub-derivations to be shared
among derivations, for potentially exponential reductions in the size
of the search space. The search algorithm can then tabulate
each sub-derivation and reuse it in building any derivation that
shares that sub-derivation.
Such tabular algorithms are
widely used in parsing and language modeling with appropriate kinds of
grammars
[You67,Kay86,Ear70,Lan74,GHR80,Tom87],
because they support exhaustive search algorithms with
polynomial space and time with respect to sentence
length. Furthermore, tabular algorithms can be readily
extended to dynamic programming algorithms to search
for optimal derivations with respect to appropriate evaluation
functions on derivations, as we will see below.
Finite-state grammars have a straightforward tabular algorithm in which table entries consist of a state and an input position (such a table is called a trellis in the speech recognition literature). Context-free grammars are the standard example of a phrase structure grammar class whose derivations can be tabulated. In a bottom-up (from words to sentences) derivation for a context-free grammar, the portion of the derivation that corresponds to the recognition of a constituent labeled by a given nonterminal can be simply represented by a table entry giving the nonterminal as a possible label of a substring of the input [You67]. Although this information leaves out the sequence of steps of the actual derivation, all derivations of that phrase can be easily reconstructed from the set of all table entries derivable for a given input string (Pointers can also be used to keep track of what table entries are used in deriving other entries.) [You67,Ear68].
Other grammar classes such as the mildly-context-sensitive grammars [JVSW91] and some constraint-based grammars are also suitable for tabular procedures [Shi92].
Even if the grammar allows tabular search, it may not be
computationally feasible to explore the entire search space because of
the effect of grammar size on search space size. For example, the
simple finite-state language models used in large-vocabulary speech recognition may have millions of states and transitions. Since
each state is potentially considered at each input position, the
computation per word recognized is too large for real-time
performance. Many techniques have been explored in speech recognition
to deal with this problem
[BJM83,KHG
93,Pau92,MBDW93,NSKP93].
In general, the techniques to avoid exploring the entire grammar search space fall into two main classes, pruning and admissible search. In pruning, an evaluation function applied to configurations determines whether they will be expanded further. Since the evaluation function cannot predict the future (to do so accurately it would have to explore the entire search space), pruning may in fact block the correct derivation. The choice of evaluation function is thus a tradeoff between reduced search space (and thus reduced runtime and memory requirements) and the risk of missing the correct analysis (or even every analysis). Currently there is no theory of pruning tradeoffs relating bounds on risk of error to the form of the evaluation function, so the design of evaluation functions is an empirical art.
Although pruning away the correct derivation is as a problem in practical applications, in psycholinguistic modeling it may in fact correspond to failures of human sentence processing, for instance garden paths. Deterministic parsers [Mar80,Hin93] take pruning to an extreme in using elaborate evaluation functions to select exactly one course of derivation. Dead ends are then supposed to model the situations in which human subjects are forced to recover from parsing failures. Other models, particularly those based on neuronal notions of activation and lateral inhibition, may allow a local race between alternative expansions of a configuration but inhibit all but one of the alternatives within a bounded number of steps [Ste93,MH90,Pri88].
Admissible search procedures do not block potential derivations. Instead, they order sub-derivations so that the ones that are more likely to be expanded to the best complete derivations will be considered before less promising ones. The difficulty is of course to define ordering criteria with high probability of reaching the best derivations before exploring a large set of useless configurations. Of particular interest here are A*-type algorithms [Nil80] which expand the configuration with lowest cost estimate, where the estimate is required to be a lower bound of the true cost (under a cost model appropriate for the task, see below) and identical to the true cost for complete derivations. The first complete derivation reached by an A* algorithm is then guaranteed to have the lowest cost. However, since it is difficult to choose cost estimates that narrow the search sufficiently, more aggressive estimates that may overshoot the true cost are often used, with the result that the first complete derivation may not be the best one.
The selection of evaluation functions for pruning or admissible search is clearly closely tied to the precision-coverage tradeoff.
A wide range of grammar classes have been investigated in
parsing and language modeling, depending on the nature of the application and on particular insights on
language structure and sentence distribution. Grammar classes have been characterized along many different
theoretical dimensions. What is known in those areas about certain
important grammar classes is described elsewhere in this document
.
Here, we consider a more informal and empirical dimension of variation that has great impact in the development of parsers and language models: how much of the required predictive and evidential power belongs to the grammar itself and how much resides in the search procedure controlling the use of the grammar. Choices along this dimension often involve philosophical disagreements on whether language is fundamentally governed by an innate system of rules (the rationalist position most closely identified with Chomsky) or rather a system of statistical regularities, associations and constructions derived by learning (the empiricist position informing much work in statistical language modeling). But they also relate to different choices with respect to the coverage/precision tradeoff.
At one end of the spectrum, which is often associated with empiricist work, extremely unconstraining grammars are controlled by search evaluation functions automatically learned from language data. An extreme example are finite-state n-gram grammars, in which states encode information on the last n-1 observed words, have been used with practical success in speech recognition [JMR92]. In these grammars every sequence of words is considered a possible sentence, but probabilities are assigned to state transitions to model the relative likelihoods of different strings. As we will see, the association of probabilities to transitions is a useful technique in a wide range of grammatical settings.
While n-gram grammars have proven very useful for language modeling, derivation steps do not correspond in any direct way to possible meaningful relations in the sentence, for instance, those between a main verb and its arguments. Parsing requires more complex grammars, in which derivation steps are associated to possible relations of interest. Even in language modeling, distributional regularities associated to meaningful relationships may be an important source of additional predictive power [Hin90,HR91,DMM93,LST92].
Grammatical representations of meaningful relationships may be usefully classified into three main classes: linguistic grammars, task-oriented grammars and data-oriented grammars. Linguistic grammars and task-oriented grammars have been in use since the beginning of computational linguistics. Data-oriented grammars, in their finite-state form discussed above, go back to the beginning of statistical studies of language by Markov, but data-oriented grammars capable of representing meaningful relationships have only recently started being investigated.
Most formal linguistic theories have been used at some time or other as the basis for computational grammars. The main issues in applying linguistic theory to the development of computational grammars: coverage, predictive power and computational requirements.
Linguistic theories are typically developed to explain puzzling aspects of linguistic competence, such as the relationships between active and passive sentences, the constraints on use of anaphoric elements, or the possible scopes of quantifying elements such as determiners and adverbs. However, actual language involves a wide range of other phenomena and constructions, such as idioms, coordination, ellipsis, apposition and extraposition, which may not be germane to the issues addressed by a particular linguistic theory or which may offer unresolved challenges to the theory. Therefore, a practical grammar will have to go far beyond the proposals of any given theory to cover a substantial proportion of observed language. Even then, coverage gaps are relatively frequent and difficult to fill, as they involve laborious design of new grammar rules and representations.
Linguistic grammars, being oriented towards the description of linguistic competence, are not intended to model distributional regularities arising from pragmatics, discourse and conventional use that manifest themselves in word and construction choice. Yet those are the regularities that appear to contribute most to the estimation of relatively likelihoods of sentences or analyses. The encoding of distributional predictions must be left thus to the search procedure, in the form of an appropriate evaluation function. However, the configurations generated by a grammar may not carry the most useful information in evaluating them. For example, whether a particular prepositional phrase modifies a direct object noun phrase or the main verb depends heavily on the actual verb, noun, preposition and prepositional object [HR91], but a traditional phrase-structure grammar does not make that information available in the syntactic categories of noun phrase, verb phrase and prepositional phrase. Therefore, in a phrase-structure setting whole derivations rather than individual configurations would have to be evaluated. But this would preclude the factorization of derivations that leads to tractable search as noted above. These considerations explain in part the recent growing interest in lexicalized grammatical frameworks such as dependency grammar [Mel88,Hud90,ST91], slot grammar [McC80,McC89], categorial grammar [Lam58,AS82,Moo88], Head-Driven Phrase-Structure Grammar (HPSG) [PS87] and lexicalized tree-adjoining grammar [Sch90], all of which lead to configurations made up of lexical items and direct relationships between them.
The best formal explanation of a particular aspect of linguistic competence has no necessary correlation with computational efficiency. For instance, modern versions of transformational grammar based on the theory of government and binding or its more recent developments involve either very complex search procedures or very complex compilation procedures into formalisms with better search properties [Sta92,Fon92,Joh92]. Similar problems have been noted with respect to HPSG and certain varieties of categorial grammar.
While direct use of formalized linguistic theories for parsing and language modeling seems computationally problematic, much progress has been made in the development of tractable grammatical formalisms capable of encoding important aspects of linguistic theory. The class of mildly-context sensitive formalisms [JVSW91], of which tree-adjoining grammars [JLT75,Jos85] and combinatory categorial grammar [AS82] are two notable instances, has polynomial-time and space parsing algorithms, and can encode important aspects of transformational and categorial linguistic analysis. Constraint-based grammar formalisms can be intractable or even undecidable in general [Car92b], but special cases of interest are often efficiently parsable [Als92]. For instance, lexical-functional grammar combines a context-free skeleton with constraints describing non-constituent syntactic properties. Although the combination is intractable in general, a carefully designed constraint-application schedule can make it possible to parse with linguistically-plausible grammars in such a way that the intractability does not arise [MK89].
However, even polynomial-time algorithms may not be sufficiently fast
for practical applications, given effect of grammar size on parsing
time. Search reduction techniques like those described in section
would then be needed to keep performance
within reasonable bounds, at the risk of worse coverage.
For most current applications in text summarization,
information retrieval and speech understanding, the predictive and evidential power of a general-purpose grammar and a
general control mechanism are insufficient for reasonable performance
in the task. Furthermore, even when parameters of the grammar and
control mechanism can be learned automatically from training corpora, the required corpora do not exist or are too small for
proper training. The alternative is then to devise grammars that
specify directly how relationships relevant to the task may be
expressed in natural language. For instance, one may use a
phrase-structure grammar in which nonterminals stand for task concepts
and relationships (for example, flight or leave in an airline
reservation task) and rules specify possible expressions of those
concepts and relationships [Sen92,War91b]. Such
semantic grammars have often been used for
database access tasks. More generally, a
knowledge-representation language (for instance, a
frame language) can be used to specify the possible
relationships between concepts, and relatively low-power grammatical
descriptions (often finite-state) describe natural-language
expressions that give strong evidence for concepts and relationships
[JR93,HAB
93].
Task-oriented grammars provide very strong guidance to a parser, but
that guidance is bought at the expense of generality and coverage,
since the detailed specifications they rely on may often fail to fit
naturally-occurring language. Therefore, parsing algorithms for
task-oriented grammars are usually allowed to relax the grammar
by ignoring portions of the input that do not fit the given grammar
[War91a,JAB
91]. This can increase coverage
usefully in applications such as limited-domain speech understanding
and text-summarization, where there are very strong
expectations of what are the relevant inputs, but the increase of
coverage is in general at the expense of precision.
In so far as linguistic grammars and task-oriented grammars provide strong constraints for modeling and parsing, they risk low coverage because the constraints limit the transitions between configurations, and thus the availability of derivations for strings. In a task-oriented setting this problem can be alleviated, as we have seen, but as far as we know relaxation is a sensible policy only for highly-constrained tasks. An alternative way of increasing coverage is to start with less constraining grammars, and rely on an evaluation function to select the most likely derivations in the more densely connected search space that results from the less constraining grammar. However, this requires finding an appropriate evaluation function. In a data-oriented framework, a learning or training procedure tries to determine the evaluation function that produces best results on an appropriate training corpus. For example, an n-gram grammar allows any word sequence, but transitions are given probabilities derived from how often states were reached and transitions crossed running the grammar over a training corpus [JMR92]. As another example, frequencies of rule and nonterminal use can be used to estimate rule probabilities for an underconstrained context-free grammar [Bak79,LY90,PS92].
Although there have been some successes in training evaluation functions for previously-designed grammars
[FJC
89,BLR92], training with respect to
a fixed grammar has the problem that either the grammar allows many
transitions that are never observed in reality, forcing the evaluation
function to be more complex to rule them out effectively, or it is too
restrictive and does not allow transitions that actually occur. That
difficulty has motivated the investigation of grammatical frameworks
and learning algorithms that will concurrently learn a grammar and an
appropriate evaluation function
[SHA89,Bod93,Hin92,SO93].
One particular class of such procedures constructs a
dictionary of commonly observed substrings or sub-analyses
that can be combined by a small set of rules to yield the observed
sentences or analyses, with the evaluation function discriminating
between alternatives ways of reconstructing a sentence or analysis
from the fragments in the dictionary
[Bod93,Hin92]. A variety of predictive power and
grammar size criteria (for example, Bayesian,
minimum-description length) may then used to find good tradeoffs
between grammar (dictionary) size, prediction of the training set, and
generalization to new material.
In the overall view of parsing and language modeling given above, a parser or language model searches the configuration space defined by a grammar for possible derivations of the sentence(s) under analysis. Since the grammar by itself is unlikely to encode all the semantic, pragmatic and discourse information relevant to distinguishing plausible analyses from implausible ones, the search needs to be guided by an evaluation function that assigns plausibility scores to derivations. An especially important case is that of probabilistic grammars, which associate to each transition the conditional probability of taking that transition from a configuration given that the configuration was reached. Such grammars are based on a Markovian or conditional independence assumption that the probability of a (partial) derivation depends just on its penultimate configuration and the transition taken from it. Then the probability of a derivation is just the product of the probability of its initial configuration by the product of the probabilities of the transitions in the derivation.
When transitions are directly associated to observable events (for example, extension of a partial sentence by one word in a finite-state model), transition probabilities can be estimated by simply counting the number of times the transition is taken for all possible derivations of all sentences in a training corpus. In general, however, the transition probabilities are not associated to directly observable events. In that case iterative procedures may be used to find a the transition probabilities that maximize the probability that the training corpus was observed [DLR77,BP66,Bak79]. For language modeling, the training corpus may just be a set of sentences, while for parsing a set of sentences tagged with constraints on possible grammatical relationships (for example, phrase boundaries) is often preferable [BLR92,PS92].
While probabilistic evaluation functions dominate in language modeling, where they are used to estimate the likelihood that a certain word sequence was uttered, other types of evaluation function are often used in parsing, especially those based on the degree of agreement of the best scoring analyses and analyses in a training corpus [AC94].
Computationally, the critical property of an evaluation function is whether it is compatible with tabular algorithms for searching the derivation space, in the sense that the score of a derivation is determined by the scores of the subderivations into which the derivation is factored by tabulation. For probabilistic functions, this amounts to a strengthened Markovian condition for derivations, which for instance is satisfied by stochastic context-free grammars [BT73,Bak79], certain kinds of parsers for constraint-based grammars [BC93] and stochastic tree-adjoining grammars [Sch92]. In such cases, the tabular search algorithms can converted into dynamic programming algorithms [Tei73,Bak79,Lan89,JLM90,LST92,Sch92] to search efficiently for best-scoring derivations.
The issue that dominates current work in parsing and language modeling is to design parsers and evaluation functions with high coverage and precision with respect to naturally occurring linguistic material (for example, news stories, spontaneous speech interactions). Simple high-coverage methods such as n-gram models miss the higher-order regularities required for better prediction and for reliable identification of meaningful relationships, while complex hand-built grammars often lack coverage of the tail of individually rare but collectively frequent sentence structures (cf. Zipf's law). Automated methods for grammar and evaluation function acquisition appear to be the only practical way to create accurate parsers with much better coverage. The challenge is to discover how to use linguistic knowledge to constrain that acquisition process.