Postscript Version

SGER: Using Text Coherence and Verbal Valence in Long-Distance N-grams

Dan Jurafsky

Department of Linguistics and Institute of Cognitive Science
University of Colorado, Boulder

CONTACT INFORMATION

Dan Jurafsky
Department of Linguistics
308 Woodbury Hall
University of Colorado, Boulder, CO 80309-0295
Phone: 303-492-1300
Fax: 303-492-4416
Email: (jurafsky@colorado.edu)

WWW PAGE

http://stripe.Colorado.EDU/~jurafsky

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

Speech Recognition, Stochastic Grammar, Probabilistic Grammar, Language Modeling, Statistical Parsing, Verbal Valence, Latent Semantic Analysis

PROJECT SUMMARY

As speech recognition advances to new domains like spontaneous human-to-human speech, our traditional n-gram language models have proved insufficiently powerful. But the major bottleneck in augmenting n-grams with more sophisticated linguistic knowledge is finding a way to build good stochastic models of this knowledge. In this 1-year exploratory project I and my student colleagues Noah Coccaro and Doug Roland are studying probabilistic models of two important pieces of syntactic/semantic knowledge: the probabilistic relationships between verbs and their arguments (probabilistic valence) and semantic word association: the tendency of nearby words in a text to be semantically related. In previous work we have shown how to use Stochastic Context-Free-Grammars directly as the language model of a speech recognizer either alone or in combination with N-grams. (Jurafsky et al 1994, Jurafsky et al 1995). In this project we are trying to categorize two other kinds of probabilistic knowledge and use them to augment N-gram language models. Although the project has only been underway for 5 months, we have a number of interesting preliminary results on both areas.

Verbs place strong constraints on the syntax and semantics of their arguments (and often even strongly select for particular words). Many modern statistical parsers (Lafferty 1992, Collins 1996, Charniak 1997) are based on such verb-argument or head-dependency probabilities, and models of human parsing rely on them as well (Clifton et al. 1984, Ford et al. 1982, Jurafsky 1996, Mitchell & Holmes 1985, Shapiro et al. 1993, Tanenhaus & Carlson 1988, Tanenhaus et al. 1990) We are examining how these probabilities can be computed from corpora and from psycholinguistic experiments, comparing the differences between these models, and producing a set of probability norms suitable for augmenting trigram language models or for use in probabilistic parsers.

In our first study, we compared two methods of computing verbal valence probabilities for 30 verbs: psycholinguistic single-sentence production experiments (Connine et al. 1984) and on-line corpora (the Penn TreeBank version of the Brown Corpus; Marcus et al. 1993). We two strong results:

First, we found that there were major differences betweeen corpus valence probabilities and experimental valence probabilities; the corpus displayed increased use of passives, ellipsis of arguments, use of quotes, and others. But we found that these differences could be accounted for by a single factor: discourse context effects. When discourse effects were removed, the corpus and the experimental data were not statistically different. The following chart summarizes the differences between these two data sources:


Here are preliminary frequency results; we are currently using the 16 argument-structure types defined by Connine et al. (1984) (plus one extra: quotes) to make it easier to compare with psychological data, but we expect to recompute these frequencies with more fine-grained categories like those of COMLEX:

Data From Brown Corpus For 30 Verbs

VERB 1. [O] 2. [PP] 3. [inf-S] 4. [inf-S] /PP/ 5. [wh-S] 6. [that-S] 7. [verb-ing] 8. [perception compl.] 9. [NP] 10. [NP][NP] 11. [NP][PP] 12. [NP][inf-S] 13. [NP][wh-S] 14. [NP][that-S] 16. passive Quotes
ask 68 64 6 4 23 7 1 0 71 8 28 73 21 1 37 101
beg 3 4 1 1 0 0 0 0 7 0 0 8 0 1 0 3
buy 7 6 0 0 1 0 0 0 91 9 26 1 0 0 10 2
call 41 84 0 5 0 0 0 0 88 135 23 1 0 0 132 13
choose 12 12 12 18 1 0 0 0 54 1 11 3 0 0 27 0
clean 4 1 0 1 0 0 0 0 22 1 2 0 0 0 5 0
debate 1 1 0 0 0 0 0 0 3 0 0 0 0 0 4 0
drive 25 46 0 0 0 0 0 0 44 1 41 0 0 0 17 0
fight 21 32 1 1 1 0 0 0 40 0 8 0 0 0 12 0
fly 14 34 3 0 0 0 0 0 6 0 4 0 0 0 1 0
hear 14 44 0 0 9 13 0 77 181 6 27 0 0 0 31 1
help 26 19 16 20 1 0 6 69 83 2 19 15 0 0 5 0
hire 1 1 0 0 0 0 0 0 26 0 1 0 0 0 10 0
know 174 63 4 0 214 156 0 1 380 3 37 6 2 0 89 0
leave 84 36 0 3 4 1 1 12 254 14 97 8 0 0 48 0
order 2 1 0 0 0 3 0 4 20 0 6 18 0 0 16 1
paint 13 12 0 0 0 0 0 0 16 2 6 0 0 0 9 0
play 39 71 0 0 0 0 0 0 111 5 66 1 1 0 9 0
promise 4 0 10 14 0 2 0 0 11 4 0 1 0 1 4 3
race 2 14 0 1 0 0 0 0 1 0 0 0 0 0 0 0
read 23 18 0 0 2 2 0 0 134 2 21 0 0 0 16 12
refuse 8 0 29 31 0 0 0 0 8 1 1 1 0 0 4 0
signal 0 1 0 0 0 1 0 0 5 0 1 0 0 0 0 0
sing 26 17 0 0 0 0 0 1 32 3 10 0 0 0 6 1
study 11 11 1 0 1 0 0 1 90 0 13 0 0 0 20 0
teach 11 9 0 0 1 0 0 0 24 25 6 13 4 5 13 0
tell 7 23 0 0 15 4 0 0 142 77 54 25 44 79 63 69
visit 4 8 0 0 0 0 0 0 81 2 13 1 1 0 4 0
watch 13 18 0 1 2 1 0 54 70 1 21 0 0 0 2 0
write 38 74 0 1 2 12 1 0 132 17 35 1 0 4 60 72

The second area we are examining is semantic text coherence. Texts and discourses tend to be semantically coherent; in particular the words that occur in a text tend to be semantically related to each other. Thus the meanings of previous words place probabilistic constraints on the meaning of future words. We are applying a linear associative model of word meaning called Latent Semantic Analysis. Latent Semantic Analysis has been used successfully for many years in the domain of information retrieval. In short, a term (word) by document matrix is created from a corpus of texts of individual documents. Each entry in the matrix reflects the count of times that word occurred in that document. Documents can be whole articles, paragraphs, or individual sentences. Each document is then represented by a vector of probabilities of word occurrences, or conversely each word is represented by a vector of probabilities of document occurrences.

These vectors can be used as rough semantic similarity indices. The cosine of the angle between the vectors corresponding to two words is a rough metric of the semantic similarity of the words. But the raw vectors are very subject to noise, and to the particularities of the input corpus. The essential step in the LSA algorithm is to use Singular Value Decomposition to generalize the matrix by maintaining only the most significant dimensions. Empirical studies has shown that for information retrieval, keeping the most significant 300 dimensions and ignoring the rest produces a set of generalized vectors which give a 30\% improvement over standard vector methods for information retrieval.

For speech recognition, we apply the algorithm by adding together all the vectors from each word of the previously spoken context to create a ``pseudo-document", against which next-word hypotheses are compared. The cosine of the angle between the next word and the average of the previous words in the context is a measure of the semantic similarity. The benefit of using LSA over word trigger pairs comes from its reduced dimensionality. Words that might never actually co-occur in a training text would have a fairly close vector representation if the words they co-occur with are similar. For example, the words "computer" and "workstation" might never co-occur in the training text. But they will appear in similar contexts, and thus will appear in a similar place in the n-dimensional space of word contexts. Then through the SVD they will have a similar representation. We are workiing on integrating this information with trigram language models (exploring various mixture-of-experts algorithms),

In our baseline experiments we have also been using the LSA method of vector smoothing as a bigram-smoothing method. First, a square matrix of bigram frequencies from the training data is created. Then, we perform an SVD on this matrix, which produces a left singular matrix, a diagonal of singular values, and a right singular matrix. The left and right singular matrices are cut off at an appropriate dimension, such as 300. This dimension reduction removes noise from the matrix. The left, diagonal, and right matrices are then multiplied to recreate an approximation of the original matrix. The effect of this is to smooth the bigrams associated with a word with bigrams from words with similar distributions. For example, in the training data, (Wall Street Journal) 'yen' did not occur after some words, such as 'million', but it did occur after 'million' in the test data. 'dollar', which occured in many of the same places that 'yen' occured in did follow 'million' in the training data. By doing the smoothing, the frequency for the 'million' 'yen' bigram was increased, since it followed the behavior of 'dollar'.

In our first experiments, we found that this smoothing could also cause problems. For example, "japanese" often preceded "yen", but never preceded "dollar". After smoothing, the frequency count for "japanese" "yen" was greatly reduced. In response to this, we used the smoothing to only increase counts above what occured in the training data. This limits the effect of the smoothing to cases where we haven't seen enough of something, but avoids using the smoothing to make predictions about having seen something too often. This increased performance significantly.

Current experiments show only a very slight perplexity improvement over not doing our form of smoothing. We are currently examining ways to get more significant reductions; we suspect the Good-Turing smoothing we are doing is masking the effects of the SVD smoothing. We also plan to try to restrict the effects of smoothing for bigrams where we are confident the zero is a true zero (such as bigrams involving two fairly common words. We also might expect to see more improvement for domains with less homogenous writing/speaking styles, such as Switchboard.

PROJECT REFERENCES

Jurafsky, Daniel. 1996a. A Probabilistic Model of Lexical and Syntactic Access and Disambiguation. Cognitive Science 20 137-194 .

Jurafsky, Daniel, Chuck Wooters, Gary Tajchman, Jonathan Segal, Andreas Stolcke, Eric Fosler, and Nelson Morgan. 1995. Using a Stochastic Context-Free Grammar as a Language Model for Speech Recognition. In Proceedings of ICASSP-95, Detroit, MI, pp 189-192

Jurafsky, Daniel, Chuck Wooters, Gary Tajchman, Jonathan Segal, Andreas Stolcke, Eric Fosler, and Nelson Morgan. 1994. Integrating Experimental Models of Syntax, Phonology, and Accent/Dialect in a Speech Recognizer (in AAAI-94 workshop)

Roland, Doug and Daniel Jurafsky. 1997. "Computing Verbal Valence Frequencies: Corpora Versus Norming Studies". Presented at the CUNY conference on Human Sentence Processing.

AREA BACKGROUND

The general area of this project is speech recognition language models. We often view the problem of speech recognition as follows: given some acoustic input, consider all of the possible sentences of English and pick the one with maximum posterior probability given the acoustic input. One of the factors in computing the probability of a sentence is the sentence's prior probability, which we consider to be the syntactic probability of all the words appearing, and in just the order they appeared. The standard way to compute these probabilities is with N-gram grammars. In a bigram, the simplest form of an N-gram grammar, we compute the probability of a sentence by making the (false) simplifying assumption that the probability of each word occurring is only dependent on the immediately preceding word, and thus that the probability of a sentence can be computed by multiplying each of these bigram probabilities:

P(I eat dinner) = P(I|start) P(eat|I) P(dinner|eat)

State-of-the-art speech recognition systems generally use trigrams or class-based trigrams. (In class-trigrams, we modify the standard trigram algorithm by computing the probability of a certain part of speech following another part of speech). There are a number of modern augmentations to trigrams (cache language models, trigger language models).

AREA REFERENCES

The journal Computer Speech and Language

The journal Computational Linguistics

RELATED PROGRAM AREAS

Adaptive Human Interfaces, Intelligent Interactive Systems for Persons with Disabilities

POTENTIAL RELATED PROJECTS

This project is linked with the FrameNet project (NSF IRI-9618838, "Tools for Lexicon Building", PI Charles Fillmore March 1997-February 2000). The FrameNet project is building a database of semantic frame information and semantic and syntactic argument structure for 5,000 lexical units of English. As a subcontract, CU Boulder will be providing frame-element probabilities for FrameNet's entries. Our work on the TreeBank for this SGER has already bootstrapped a number of the regular expressions for tagged-corpora that we will be using to compute FrameNet probabilities.

Statistical Language Models like N-grams are also used in Augmentative Communication (AC) (related to the Intelligent Interactive Systems for Persons with Disabilities program area); the syntactic and semantic augmentations we are working on for speech recognition should apply equally well to AC.