Postscript Version

Improved Statistical Language Models

Eugene Charniak

Department of Computer Science
Brown University

CONTACT INFORMATION

Department of Computer Science
Brown University Box 1910
Providence RI 02912
Phone: (401) 863-7636
Fax : (401) 863-7657
Email: ec@cs.brown.edu

WWW PAGE

www.cs.brown.edu/~ec

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Statistical Language Processing, Parsing, Corpus-Based Language Learning, Natural Language Processing, Computational Linguistics

PROJECT SUMMARY

Our project is concerned with using and extending Probabilistic Context-Free Grammars (PCFGs) to learn how to parse English. (For some background on PCFGs, see the "Area Background" section.) In this summary we concentrate on three interesting results from the last two years:
  1. a simple technique for grammar induction,
  2. techniques for speeding up PCFG parsing, and
  3. moving beyond standard PCFGs by incorporating word statistics into the probability model.
We look at each of these in turn.

One of the more unexpected results from our research over the last few years is a particularly simple way of learning a context-free grammar from a tree-bank - a corpus of hand-parsed sentences. The basic idea is that one can simply read off the grammar from the parsed sentences to produce a tree-bank grammar.

For example, suppose the corpus included the parse

(S (NP (DET The) (NOUN dog))
   (VP (VERB ate)))
Looking at the top level, we see that since the topmost s sentence expands into a np followed by a vp, there must be the grammar rule s -> np vp. In exactly the same way, there must be a grammar rule np -> det noun for expanding the noun-phrase "the dog." Grammar learning consists of going through the tree-bank reading off the rules, recording them, and counting how often each is used. At the end the counts are turned into the probabilities for each rule.

This is so simple that many people have thought of it before, but all have assumed that grammars induced in this fashion would would work very poorly because they would have very bad coverage, since there would be many rules that do not occur in the tree-bank and thus would be missing from the grammar. Indeed, unbeknownst to us at the time another researcher reported on experiments in which tree-bank grammars perform very poorly. In Charniak96 we showed that, though such grammars do miss many uncommon rules, such rules are so uncommon that they have little effect on the grammar's ability to parse, and that parsing using such grammars is at least as effective as more complicated schemes. We have no explanation of the disparity of these results with the negative results reported by others.

We have also been looking at ways to improve the speed of parsing by using the probabilistic information provided by PCFG. Chart-parsers, a standard mechanism for CFG parsing, use an agenda of partial (or completed) constituents that have been constructed but have not yet been incorporated into larger constituents. For example, we might have constructed the noun-phrase "the dog" but not yet put it into the larger sentence constituent. Several researchers have noted that we could speed up chart-parsing by removing constituents from the agenda according to some figure of merit, an estimate of how likely this constituent is to be part of the correct parse. One standard figure of merit is the constituent's inside probability which is defined as the probability of the words dominated by the constituent given the type of the constituent (e.g., its an np). Another figure of merit is the normalized inside probability. This is the nth root of the inside probability, where n is the number of words in the constituent. This is considered superior because the inside probability by itself tends to get small for long constituents, simply because any long sequence of words is very improbable.

In Caraballo96 and Caraballo-forthcoming we suggest using a metric we call the boundary trigram model. This significantly outperforms earlier metrics, as shown in the following table:
Model % Edges
Inside probability 97.5
Normalized inside 31.6
Boundary trigram 13.9
Here we parsed sentences of length 1-30 to completion and noted how many edges were used by our chart parser. We then used each of the figures of merit, and found how many edges were required for the parser to find parses corresponding to 95% of the total probability for the parse. As can be seen, the boundary trigram model required less than half the number of edges used by the normalized inside method and, while more complicated, it did not require significantly more time per edge, so the time savings were comparable. Furthermore, these numbers were for a grammar that was comparatively easy to parse. The tree-bank grammar described above is much harder - so hard that it proved impossible to parse most sentences to completion, so the above statistics were unavailable. To compare methods on the tree-bank grammar we looked at how many edges were required to find the first parse for average-length sentences (length 18-26). The boundary-trigram method needs about 1800 where the normalized inside requires about ten times more.

The boundary trigram model consists of three components: (1) the inside probability of the constituent (as in the other two models), (2) boundary statistics that indicate the probability that this kind of constituent appears given the words just before and after it; e.g., what is the probability that an np is starting, given that the previous constituent is, say, an aux (reasonably high), or is a det (very low), and (3) an independent estimate of how difficult it should be to find a parse for this sequence of words, based upon a trigram model of the parts of speech for the words.

Lastly we looked at parsing techniques that improve upon the simple PCFG model by collecting and exploiting probabilities that relate syntactic structure to the particular words in the sentence. The basic idea here stems from the common linguistic intuition that constituents have heads - the most important lexical item of the constituent. The head of a noun phrase is the main noun (typically the rightmost), the head of a verb phrase is the main verb, etc. As these examples indicate, heads can, to a first approximation, be found deterministically for each possible structure for a constituent (though a constituent can have more than one head if it can be given several different structures). Heads are inherited from one constituent to the next. For example, the head of a sentence is inherited from the head of the verb phrase that makes up the sentence.

The use of heads allows us to make the probabilities dependent on them. For example, in a standard PCFG each grammar rule has a probability for that rule given only that we are expanding a constituent of the correct type. In our new version the probability is also conditioned on the head word of the constituent. For example, consider the rule vp -> verb np np (as in the sentence "Sue gave Ann the slippers."). The probability of this rule is rather low, .0015 in one of our current grammars. But consider the probability of this rule given one knows that the head of the verb phrase is "gave." Since "give" is a verb that can have two noun-phrase objects, one would expect the conditional probability to be much higher, and it is, .315.

In Charniak97 we show how this and similar techniques allow considerably improved parsing results and compare these results to those of other researchers exploring similar ideas. (We note that Charniak97 was one of four papers selected as best paper for the 1997 AAAI conference.

PROJECT REFERENCES

Caraballo, Sharon, and Charniak, Eugene. Figures of merit for best-first probabilistic chart parsing Proceedings of the Conference on Empirical Methods in Natural Language Processing (1996)

Caraballo, Sharon, and Charniak, Eugene. Figures of merit for best-first probabilistic chart parsing Computational Linguistics (Forthcoming)

Charniak, Eugene et. al. Taggers for parsers Artificial Intelligence, 85 No. 1-2 (1996)

Charniak, Eugene. Tree-bank grammars Proceedings of the Thirteenth National Conference on Artificial Intelligence, AAAI Press/MIT Press (1996)

Charniak, Eugene. Expected-frequency interpolation Technical Report CS96-37, Department of Computer Science, Brown University (1996)

Charniak, Eugene. Statistical parsing with a context-free grammar and word statistics Proceedings of the Fourteenth National Conference on Artificial Intelligence (1997)

Ersan, Murat and Charniak, Eugene. A statistical syntactic disambiguation program and what it learns Symbolic, Connectionist, and Statistical Approaches to Learning for Natural Language Processing, Stefan Wermter, Ellen Riloff, and Gabriele Scheler (Eds) Springer, New York (1996)

AREA BACKGROUND

In this section we give a brief introduction to statistical parsing, as these techniques underlie the research described in the previous section.

Syntactic parsing is the process of assigning a "phrase marker" to a sentence. For example, a parse of "The dog ate" might look like this:

(S (NP (DET The) (NOUN dog))
   (VP (VERB ate)))
Here np stands for "noun phrase," vp for "verb phrase" and det for "determiner." It is generally accepted that finding this sort of structure is useful in determining the meaning of a sentence. For example, consider a sentence like "Salespeople sold the dog biscuits." Here are two possible parses.
(S (NP The salespeople)
   (VP sold (NP the dog biscuits)))

(S (NP The salespeople)
   (VP sold (NP the dog)
	    (NP biscuits)))
Note that the two have different meanings: in the first the salespeople are selling dog biscuits, while in the second they are selling biscuits to dogs. Thus finding the correct parse is analogous to deciding on the correct meaning.

This example also exemplifies a major problem in parsing, syntactic ambiguity - a sentence having two or more parses. In such cases it is necessary for the parser (or the understanding system in which the parser is embedded) to choose the correct one among the possible parses.

This example is, however, misleading in one fundamental respect: it implies that we can assign a meaning to each of the possible parses. For most grammars (certainly for the ones we have been dealing with), this is not the case. Our grammars would assign dozens, possibly hundreds of parses to this sentence, ranging from the reasonable to the uninterpretable, with the majority at the uninterpretable end of things. To take but one example, our grammar has the rule np -> np np. This would be used in the analysis of a noun phrase like "10 dollars a share" where the two nps "10 dollars" and "a share" are part of the same np. The point here is that this rule would allow the third parse of the sentence shown here:

(S (NP The salespeople)
   (VP sold (NP (NP the dog)
                (NP biscuits))))
Note that this parse does not have an obvious meaning associated with it. In fact, most of the parses that wide-coverage grammars find are like this one - literally senseless.

Within the statistical parsing community, the general position is that syntactic ambiguity is rampant and essentially inseparable from the problem of parsing. It is the responsibility of the parser to come up with the best parse and if it makes mistakes, well, future research will show us how to do better. More specifically, statistical parsers work by assigning a probability to all possible parses to a sentence, locating the most probable parse, and then presenting that parse as the answer. Thus to construct a statistical parser one must figure out how to (a) find possible parses and (b) assign probabilities to them.

One of the simplest mechanisms for this is based upon probabilistic context-free grammars or PCFGs which is simply a context-free grammar in which every rule is assigned a probability. These probabilities are to be interpreted as the probability of expanding a constituent, say an np, using this particular rule, as opposed to any of the other rules that could be used to expand this kind of constituent. For example, here is a toy PCFG that will generate the above three parses for our ambiguous sentence:

s  -> np vp      (1.0)   np -> det noun      (0.5)
vp -> verb np    (0.8)   np -> noun          (0.3) 
vp -> verb np np (0.2)   np -> det noun noun (0.15) 
                         np -> np np         (0.05) 
Note, for example, how the probabilities of all of the rules for np sum to one, since that an np is expanded using some rule of the grammar has probability one.

Given the probability of individual rules, we calculate the probability of an entire parses would be ranked higher.

PCFGs have many virtues. First and foremost, they are the obvious extension to the statistical domain of the ubiquitous context-free grammars with which most computer scientists and linguists are already familiar. Second, the parsing algorithms used for context-free parsing carry over to PCFGs (in particular, it is possible to find all possible parses in n3 time where n is the length of the sentence). It is also the case that, given the standard compact representation of CFG parses, it is possible to find the most probable parse in n3 time as well, so PCFGs have the same time complexity as their non-probabilistic brethren.

AREA REFERENCES

Charniak, Eugene. Statistical Language Learning MIT Press, Cambridge (1993)

Church, Kenneth W. and Mercer, Robert L. Introduction to the special issue on computational linguistics using large corpora Computational Linguistics 19 (1993)

RELATED PROGRAM AREAS

Adaptive Human Interfaces, Intelligent Interactive Systems for Persons with Disabilities