Postscript Version

Intonational Correlates of Discourse Structures

Barbara J. Grosz

Division of Engineering and Applied Sciences
Harvard University

CONTACT INFORMATION

Engineering Sciences Laboratory
40 Oxford Street, Room 413
Cambridge, MA 02138
Phone: (617) 495-3673
Fax : (617) 496-1066
Email: grosz@eecs.harvard.edu

WWW PAGE

http://www.eecs.harvard.edu/grosz

PROGRAM AREA

KEYWORDS

Intonation
Discourse Structure
Accent
Attentional State
Speech Synthesis

PROJECT SUMMARY

In a theory of discourse structure, developed with Sidner (Grosz and Sidner 1986), we distinguish among three components of discourse structure: linguistic structure, intentional structure, and attentional state. Linguistic structure groups utterances into discourse segments. Intentional structure consists of discourse segment purposes and the relations between them. Attentional state, an abstraction of the discourse participants' focus of attention, records the objects, properties, and relations that are salient at a given point in the discourse. Our current research addresses open problems in all three components; this project concerns interactions among them, but is particularly focused on linguistic structure.

At the linguistic level, we are investigating several fundamental problems in the production of appropriate intonational variation in computer synthesized speech. We are studying the reliability of discourse structural analyses based on the Grosz and Sidner theory of discourse structure and analyzing a corpus of spoken language to determine the intonational indicators of discourse structure. We expect the results to extend the state of the art in speech synthesis by enabling control of intonational variation based on a more substantial discourse model and by providing empirical studies of the relationship between intonation and discourse structure.

The project aims to provide the basis for systems to produce utterances with intonation appropriate to discourse context. The inability of systems to do so is a major impediment to the construction of computer systems that can employ speech to communicate with users. This lack does not stem from the inability of speech synthesis systems to produce natural-sounding intonation; input to such systems as DEC-Talk and the AT&T synthesizer can be hand annotated to produce quite natural speech. Rather, no system currently represents the discourse-level information necessary to assign appropriate intonational features automatically, either from text analysis (text-to-speech) or from an abstract representation of the message to be conveyed (message-to-speech). Nor do algorithms exist that could make use of such information to associate intonational features such as pitch range, pausal duration, and speaking rate appropriately with words and phrases. These system deficiencies result primarily from the absence of fundamental research relating intonational variation and discourse structure. A major goal of this project is to begin to fill this gap in our understanding.

In particular, we have investigated two central questions of discourse processing for spoken language: (1) Can discourses, whether written or spoken, be segmented reliably and consistently into units appropriate for determining discourse meaning, and if so on what basis? (2) What are the intonational indicators of discourse structure? Three types of empirical investigation have been undertaken:
(1) studies of the reliability of discourse structural analyses based on the Grosz and Sidner (1986) theory of discourse structure; (2) studies of the intonational indicators of discourse structure; (3) studies of intonational prominence (accent) and its interaction with syntactic structure and attentional state.

Three major components of this project were the collection of a corpus of direction-giving monologues, the development of a set of instructions for annotation of discourse structure, and several analyses of the segmented and labeled corpus.

  • The Boston Directions Corpus (Nakatani, Grosz, Hirschberg, 1995) is a set of elicited monologues produced by multiple non-professional speakers. Each speaker was given written instructions to perform a series of increasingly complex direction-giving tasks; these tasks range from getting between stations on a single line to planning a tourist journey with stops at multiple Boston sights. The speech was transcribed and speech errors removed; speakers returned to read the transcribed speech. Spontaneous and read versions of the same speaker and task could then be analyzed and compared.

  • The annotation instructions (Nakatani, Grosz, Ahn, and Hirschberg, 1995) were designed for use by "naive" segmenters, i.e., people who have not studied discourse theory or discourse processing methods. Harvard undergraduates with no prior discourse theory background used them to label both spontaneous and read versions of the directions monologues.

  • Analyses of intonational correlates of discourse structure: acoustic-prosodic features of segment initial, medial, and final utterances have been investigated. Two kinds of studies have been done. The first used segmentations by expert segmenters (people who had studied discourse theory); results of these analyses are reported in two papers (Nakatani, Grosz, Hirschberg, 1995; Hirschberg & Nakatani, 1996). A second analysis is currently underway using the larger set of segmentations by naive segmenters.

    Starting from a pilot study that showed that accent information could be tightly integrated with other factors involved in local and global attentional modeling, namely form of expression and grammatical function, we conducted word-based and constituent-based machine learning experiments on accent prediction in the Boston Directions Corpus and demonstrated that these higher-level features could be combined with a wide range of lower-level linguistic features, such as lexical category and word lemma information, to significantly improve on citation form accent assignment using an automatically trainable system (Nakatani, 1997).

    Our results indicate the following:

  • Segmentation instructions can be developed that are useful for "naive" segmenters; however, the agreement among such labelers is less than for expert labelers.
  • The availability of speech significantly affects the reliability of discourse segmentation (Nakatani, Hirschberg, Grosz, 1995).
  • Spontaneous speech can be as reliably segmented as read speech (Hirschberg & Nakatani, 1996).
  • Intonation is used by speakers to convey information at the discourse level, in particular to indicate segement beginnings and endings (Nakatani, Grosz, Hirschberg, 1995; Hirschberg & Nakatani, 1996).
  • Computational models of attentional state provide a good basis on which to explain the use of prominence (accent) in natural speech; our original theoretical proposals concerning the discourse focusing nature of prominence and its interactions with lexical, semantic, syntactic and other linguistic factors were shown to be supported by distributional analysis, and in turn, supported the development of algorithms integrating prominence into discourse processing (Nakatani, 1997).

    PROJECT REFERENCES

    Christine Nakatani, Julia Hirschberg and Barbara Grosz, "Discourse Structure in Spoken Language: Studies on Speech Corpora." Working Notes of the AAAI-95 Spring Symposium on Empirical Methods in Discourse Interpretation [American Association for Artificial Intelligence, Menlo Park, CA] (1995) 106-112.

    Christine H. Nakatani, Barbara J. Grosz, David D. Ahn and Julia Hirschberg, "Instructions for Annotating Discourse." Harvard University (1995) TR-21-95.

    Julia Hirschberg and Christine H. Nakatani, "A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues." Proceedings of the Annual Meeting of the Association for Computational Linguistics (1996).

    Christine Nakatani, "Integrating Prosodic and Discourse Modeling." Computing Prosody [Springer-Verlag, Sagisaka, Campbell and Higuchi, eds.] (1997).

    Christine Nakatani, "The Computational Processing of Intonational Prominence: A Functional Prosody Perspective." Ph.D. Thesis, Harvard University (1997).

    AREA BACKGROUND

    Better understanding of intonational characteristics at the discourse level are important both for identifying discourse structure from speech and for enhancing the naturalness of synthetic speech. Early work in this area (Hirschberg & Pierrehumbert, 1986) provides several examples of the interaction between intonation and discourse structure. Numerous findings have shown that discourse-level meaning can be conveyed by acoustic-prosodic properties such as pitch range and pausal duration. (Nakatani, Grosz, Hirschberg, 1995 contains an extensive list of references.) Early studies relied on intuitive notions of topic-structure and the like, but recently studies have utilized an independent definition of discourse structure (Grosz & Hirschberg, 1992 inter alia).

    AREA REFERENCES

    J. Allen, Natural Language Understanding [Benjamin/Cummings Publishing Company, Inc., Reading, MA] 1995.

    B.J. Grosz, M. Pollack and C. Sidner, ``Computational Models of Discourse." Foundations of Cognitive Science, Michael Posner, ed. [MIT Press, Bradford Books] 1989.

    Barbara J. Grosz, Karen Sparck Jones and Bonnie Lynn Webber, Readings in Natural Language Processing [Morgan Kaufmann Publishers, Inc., Los Altos, CA] 1986.

    Barbara Grosz and Candace Sidner, "Attention, Intentions, and the Structure of Discourse." Computational Linguistics, 12 (1986) 3, 175-204.

    Julia Hirschberg and Jane Pierrehumbert, "The Intonational Structuring of Discourse." Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics (1986).

    Barbara Grosz and Julia Hirschberg, "Some Intonational Characteristics of Discourse Structure." Proceedings of the 1992 International Conference on Spoken Language Processing (ICSLP-92), John Ohala, et al. eds. [Personal Publishing Ltd., Edmonton, Canada] (1992) 429-432.

    F.C. Pereira and B.J. Grosz, eds., Natural Language Processing [The MIT Press, Cambridge] 1994.

    RELATED PROGRAM AREAS

    Other Communication Modalities
    Adaptive Human Interfaces
    Intelligent Interactive Systems for Persons with Disabilities