next up previous contents index
Next: 3.8 References Up: 3 Language Analysis and Previous: 3.6 Sentence Modeling and

3.7 Robust Parsing

Ted Briscoe
Computer Laboratory, Cambridge University, Cambridge, UK

Despite over three decades of research effort, no practical domain-independent parser of unrestricted text has been developed. Such a parser should return the correct or a useful close analysis for 90% or more of input sentences. It would need to solve at least the following three problems, which create severe difficulties for conventional parsers utilizing standard parsing algorithms with a generative grammar:

  1. chunking, that is, appropriate segmentation of text into syntactically parsable units;
  2. disambiguation, that is, selecting the unique semantically and pragmatically correct analysis from the potentially large number of syntactically legitimate ones returned; and
  3. undergeneration, or dealing with cases of input outside the systems' lexical or syntactic coverage.
Conventional parsers typically fail to return any useful information when faced with problems of undergeneration or chunking and rely on domain-specific detailed semantic information for disambiguation.

The problem of chunking is best exemplified by text sentences (beginning with a capital letter and ending with a period) which---and this sentence is an example---contain text adjuncts delimited by dashes, brackets or commas which may not always stand in a syntactic relation with surrounding material. There has been very limited work on this issue---[Hin83a] describes a system which copes with related problems, such as false starts and restarts in transcribed spontaneous speech, whilst [Jon94] describes a parser which makes limited use of punctuation to constrain syntactic interpretation. Nevertheless, an analysis of the 150K word balanced Susanne Corpus [Sam94] reveals that over 60% of sentences contain internal punctuation marks and of these around 30% contain text-medial adjuncts. Thus the problem is significant, and further research is required building on linguistic accounts of punctuation [Nun90].

Disambiguation using knowledge-based techniques requires the specification of too much detailed semantic information to yield a robust domain-independent parser. Yet analysis of the Susanne Corpus with a crude parser suggests that over 80% of sentences are structurally ambiguous. Several parsers yielding a single canonical parse have been developed [Mar80,Hin83b,dM90]. These are often applied to a (partially) disambiguated sequence of lexical syntactic categories. Simplifying the input to the parser in this way circumvents many problems of lexical coverage suffered by systems which require rich sets of syntactic subcategories encoding for example valency of verbs [Jen91] as well as capitalizing on the relative success and practicality of lexical category disambiguation. Canonical parsers often represent many ambiguities implicitly [MHF83], rather than enumerating possible analyses, and use heuristic disambiguation rules [Hin89]. Such techniques have yielded useful parsers for limited domains but their development is labour intensive and few general principles for their construction have emerged. In recent attempts to manually construct large treebanks of parsed texts, canonical parsing has been used as a first but small step of disputed merit [MHF83,LG91].

The availability of treebanks and, more generally, large bodies of machine-readable textual data has provided impetus to statistical approaches to disambiguation. Some approaches use stochastic language modeling inspired by the success of HMM-based lexical category disambiguation. For example, probabilities for a probabilistic version of a context-free grammar (PCFG) can be (re-)estimated from treebanks or plain text [FJC89,SJM90] and used to efficiently rank analyses produced by minimally-modified tabular parsing algorithms. These techniques yielded promising results but have been largely supplanted by statistical parse decision techniques in which the probabilistic model is sensitive to details of parse context [MW92,BC93,BLR92] and integrated more closely with the parsing algorithm than the grammar. These systems have yielded results of around 75% accuracy in assigning analyses to (unseen) test sentences from the same source as the unambiguous training material. The barrier to improvement of such results currently lies in the need to use more discriminating models of context, requiring more annotated training material to adequate estimate the parameters of such models. This approach may yield a robust automatic method for disambiguation of acceptable accuracy, but the grammars utilized still suffer from undergeneration, and are labour-intensive to develop.

Undergeneration is a significant problem, in one project, a grammar for sentences from computer manuals containing words drawn from a restricted vocabulary of 3000 words which was developed over three years still failed to analyze 4% of unseen examples [BLR92]. This probably represents an upper bound using manual development of generative grammars; most more general grammars have far higher failure rates in this type of test. Early work on undergeneration focussed on knowledge-based manual specification of error rules or rule relaxation strategies [KS81,JH83]. This approach, similar to the canonical parse approach to ambiguity, is labour-intensive and suffers from the difficulty of predicting the types of error or extragrammaticality liable to occur. More recently, attempts have been made to use statistical induction to learn the correct grammar for a given corpus of data, using generalizations of HMM maximum-likelihood re-estimation techniques to PCFG [LY90]. This extends the application of stochastic language modeling from disambiguation to undergeneration by assuming the weakest grammar for a given category set---that is, the one which contains all possible rules that can be formed for that category set---and using iterative re-estimation of the rule probabilities to converge on the subset of these rules most appropriate to the description of the training corpus.

There are several inherent problems with these statistical techniques which have been partially addressed by recent work. Re-estimation involves considering all possible analyses of each sentence of the training corpus given an (initially) weak grammar, the search space is large and the likelihood of convergence on a useful grammar is low. [PS92,SRO93] show that constraining the analyses considered during re-estimation to those consistent with manual parses of a treebank reduce computational complexity and leads to a useful grammar. [BW93,Bri94] demonstrate that similar results can be obtained by imposing general linguistic constraints on the initial grammar and biasing initial probabilities to favour linguistically motivated core rules, while still training on plain text. Nevertheless, such techniques are currently limited to simple grammars with category sets of a dozen or so non-terminals or to training on manually parsed data. The induced PCFG can also be used to rank parses and results of around 80% fit between correct and automatically-generated analyses have been obtained. It is not possible to directly compare these results with those from pure disambiguation experiments, but there is no doubt that although these systems are achieving 100% or very close grammatical coverage, the use of the resulting PCFG language model for disambiguation only yields fully correct analyses in around 30% of cases.



next up previous contents
Next: 3.8 References Up: 3 Language Analysis and Previous: 3.6 Sentence Modeling and