Postscript Version

Evaluating the Use of Prosodic Information in Speech Recognition and Understanding

Mari Ostendorf

Electrical and Computer Engineering Department
Boston University

CONTACT INFORMATION

8 St. Mary's St.
Boston, MA 02215
Phone: (617) 353-5430
Fax : (617) 353-8437
Email: mo@bu.edu

WWW PAGE

Lab: http://raven.bu.edu
Project: http://raven.bu.edu/projects/prosody.html

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

Prosody, speech understanding, speech recognition, disfluencies.

PROJECT SUMMARY

The goal of this project was to investigate the use of different levels of prosodic information in speech recognition and understanding. Our approach was multi-disciplinary, combining linguistic theory, speech knowledge and statistical modeling techniques. The research involved: 1) determining a representation of prosodic information suitable for use in speech understanding systems, 2) conducting distributional and acoustic analyses of speech corpora to better understand prosodic phenomena and define the structure of the computational models, 3) developing reliable algorithms for detection of the prosodic markers in speech, 4) investigating architectures for integrating prosodic cues in speech understanding systems, and 5) assessing potential performance improvements by evaluating prosody algorithms in an actual spoken language system (SLS). The project investigated three different aspects of prosody: the marking of prominent syllables and phrase boundaries and the relationship of these cues to syntactic structure, the association of prosodic features with disfluencies in spontaneous speech, and the use of prosody as a cue to higher level dialog structure. Specific contributions of this project include:

PROJECT REFERENCES

Selected publications supported all or in part by this grant: N. Veilleux and M. Ostendorf, ``Prosody/parse scoring and its application in ATIS,'' Proc. ARPA HLT Workshop, pp. 335-340, 1993.

C. W. Wightman and M. Ostendorf, ``Automatic labeling of prosodic patterns,'' IEEE Trans. Speech and Audio Processing, 2(4) 469-481, 1994.

S. Shattuck-Hufnagel, M. Ostendorf and K. Ross,``Pitch accent placement within lexical items in American English,'' J. Phonetics, 22 357-388, 1995.

E. Shriberg, 1994. ``Preliminaries to a theory of speech disfluencies,'' UC Berkeley Ph.D. thesis.

M. Ostendorf, P. J. Price and S. Shattuck-Hufnagel, 1995. ``The Boston University radio news corpus,'' Boston University Technical Report No. ECS-95-001. (available by anonymous ftp from raven.bu.edu)

P. Price and M. Ostendorf, 1996. ``Combining linguistic with statistical methods in modeling rosody,'' in J. L. Morgan and K. Demuth (Eds.), Signal to syntax: Bootstrapping from speech to grammar in early acquisition, Hillsdale, NJ: Lawrence Erlbaum Associates.

L. Dilley, S. Shattuck-Hufnagel and M. Ostendorf, ``Glottalization of Vowel-Initial Syllables as a Function of Prosodic Structure,'' J. Phonetics, 24 423-444, 1996.

A. Stolcke and E. Shriberg, ``Statistical language modeling for speech disfluencies,'' Proc. ICASSP, I:405-408, 1996.

M. Ostendorf and K. Ross, ``A Multi-Level Model for Recognition of Intonation Labels,'' in Computing Prosody, Y. Sagisaka, N. Campbell and N. Higuchi (Eds.), 291-308, Springer-Verlag, NY: 1997.

M. Swerts and M. Ostendorf, ``Prosodic Indications of Discourse Structure in Human-Machine Interactions,'' Speech Communications, v. 22, 1997.

M. Ostendorf, ``Linking Speech Recognition and Language Processing Through Prosody,'' CC-AI, to appear.

E. E. Shriberg, R.A. Bates and A. Stolcke, ``A prosody-only decision-tree model for disfluency detection,'' Proc. EUROSPEECH, 1997.

M. Siu and M. Ostendorf, ``Variable N-gram Language Modeling and Extensions for Conversational Speech," Proc. EUROSPEECH, 1997.

AREA BACKGROUND

Spoken language is the primary mode of communication between humans for interactive problem solving, and therefore spoken language understanding is a vital technology for making computers accessible to a broad range of the population. As human listeners, we bring many sources of information to bear on the problem of interpreting an utterance, including syntax, semantics, our knowledge of the world and conversational context, as well as prosody. Prosodic phrase structure and prominence patterns often provide the link between acoustic realization and linguistic interpretation of a word, giving clues as to how to parse a word string, which element is in focus, whether a point is in question, and whether there has been a change in topic. Despite the fact that prosody provides such important information, it has been little used so far in spoken language understanding systems. One reason is simply that prosody modeling is a difficult problem, with acoustic cues depending on prosodic structures operating at many different time scales. A second reason may be that current speech systems handle only constrained domains, where the information provided in prosody is often redundant with semantic cues in speech understanding systems or is less useful because the speech recognition task involves read speech. However, as systems move towards less constrained and more natural interaction, the additional information provided by prosody will become increasingly important.

The key problems that must be solved to make effective use of prosody in human-computer communication include: identification of the important abstract prosodic patterns and their acoustic correlates, automatic prosodic pattern recognition, analysis and computational modeling of the mapping between prosodic patterns and linguistic constructs, and integrating the detected prosodic information with existing speech recognition and understanding systems. Solving these problems, which span both technical and linguistic disciplines, benefits from a multi-disciplinary approach. Progress has been made in all of these areas. Through a multi-site collaborative effort, a standard prosodic labeling system has been proposed for American English (Pitrelli et al, 1994), which makes it possible for different researchers to share data and results. Much of the automatic prosody recognition algorithm development has built on statistical models from speech recognition, including (Wightman and Ostendorf, 1994; Ostendorf and Ross, 1997). There have been numerous studies of the mappings between acoustic correlates, abstract prosodic structure and linguistic structure. For prosody and discourse structure in particular, a good summary can be found in (Hirschberg, 1993). A variety of architectures have been explored for integrating prosody into speech understanding system, as overviewed in (Ostendorf, 1997); notable success stories include (Kompe et al., 1993; Batliner et al., 1996). In speech recognition studies, there have been several efforts at improving on HMM duration models, as well as more recent work on characterizing disfluencies in language modeling (Stolcke & Shriberg, 1996) and using prosody to detect disfluencies (Shriberg et al., 1997). Although more work is still needed on all fronts, much of the groundwork has been laid in the field, opening up numerous possibilities for application of prosody modeling in actual speech understanding systems.

AREA REFERENCES

(Excluding those listed above)

J. Pitrelli, M. Beckman, and J. Hirschberg, ``Evaluation of prosodic transcription labeling reliability in the ToBI framework,'' Proc. ICSLP, pp. 123-126, 1994.

J. Hirschberg, ``Studies of intonation and discourse,'' Proc. ECSA Workshop on Prosody, pp. 90-95, 1993.

R. Kompe et al., ``Prosody takes over: a prosodically guided dialog system,'' Proc. Eurospeech, 2003-2006, 1993.

A. Batliner et al., ``Prosody, empty categories and parsing -- a success story,'' Proc. ICSLP, 2:1169-1172, 1996.

RELATED PROGRAM AREAS

Other Communication Modalities.
Adaptive Human Interfaces.
Intelligent Interactive Systems for Persons with Disabilities.