Postscript Version
Evaluating the Use of Prosodic Information in Speech Recognition and Understanding
Mari Ostendorf
Electrical and Computer Engineering Department
Boston University
CONTACT INFORMATION
8 St. Mary's St.
Boston, MA 02215
Phone: (617) 353-5430
Fax : (617) 353-8437
Email: mo@bu.edu
WWW PAGE
Lab:
http://raven.bu.edu
Project:
http://raven.bu.edu/projects/prosody.html
PROGRAM AREA
Speech and Natural Language Understanding
KEYWORDS
Prosody, speech understanding, speech recognition, disfluencies.
PROJECT SUMMARY
The goal of this project was to investigate the use of different
levels of prosodic information in speech recognition and
understanding. Our approach was multi-disciplinary, combining
linguistic theory, speech knowledge and statistical modeling
techniques. The research involved: 1) determining a representation of
prosodic information suitable for use in speech understanding systems,
2) conducting distributional and acoustic analyses of speech corpora
to better understand prosodic phenomena and define the structure of
the computational models, 3) developing reliable algorithms for
detection of the prosodic markers in speech, 4) investigating
architectures for integrating prosodic cues in speech understanding
systems, and 5) assessing potential performance improvements by
evaluating prosody algorithms in an actual spoken language system
(SLS). The project investigated three different aspects of
prosody: the marking of prominent syllables and phrase boundaries
and the relationship of these cues to syntactic structure, the
association of prosodic features
with disfluencies in spontaneous speech, and the
use of prosody as a cue to higher level dialog structure.
Specific contributions of this project include:
- Transcription systems: Developed and documented a system
for prosodic transcription and a system
for disfluency transcription -- both of
which have influenced other transcription efforts -- and used
these systems to label various corpora.
- Prosodic phrases and prominences: Analyzed the relationship
between symbolic prosodic events (phrases and prominences) and
syntactic structure and the acoustic cues
to these events; developed algorithms
for detecting such prosodic events
and architectures for using them to improve parsing accuracy and/or
speed; and developed a model of duration
for use in speech recognition that combines prosodic and phonetic
contextual effects.
- Disfluencies: Conducted distributional analyses to
determine important classes of disfluencies;
determined acoustic cues to some of these classes; developed algorithms
for detecting disfluencies from acoustic and textual cues;
and investigated mechanisms for
accounting for the presence of disfluencies in language modeling for
speech recognition.
- High level structure: Studied acoustic and textual cues
to discourse structure in human-computer dialogs;
and analyzed the acoustic cues to speaking ``style'' with the goal
of systematically modeling regions of phonetic reduction in pronunciation.
PROJECT REFERENCES
Selected publications supported all or in part by this grant:
N. Veilleux and M. Ostendorf, ``Prosody/parse scoring and its
application in ATIS,'' Proc. ARPA HLT Workshop, pp. 335-340, 1993.
C. W. Wightman and M. Ostendorf, ``Automatic labeling of
prosodic patterns,'' IEEE Trans. Speech and Audio Processing,
2(4) 469-481, 1994.
S. Shattuck-Hufnagel, M. Ostendorf and K. Ross,``Pitch accent
placement within lexical items in American English,'' J. Phonetics,
22 357-388, 1995.
E. Shriberg, 1994. ``Preliminaries to a theory of speech
disfluencies,'' UC Berkeley Ph.D. thesis.
M. Ostendorf, P. J. Price and S. Shattuck-Hufnagel, 1995. ``The Boston
University radio news corpus,'' Boston University Technical Report
No. ECS-95-001. (available by anonymous ftp from raven.bu.edu)
P. Price and M. Ostendorf, 1996. ``Combining linguistic with
statistical methods in modeling rosody,'' in J. L. Morgan and
K. Demuth (Eds.), Signal to syntax: Bootstrapping from speech to
grammar in early acquisition, Hillsdale, NJ: Lawrence Erlbaum
Associates.
L. Dilley, S. Shattuck-Hufnagel and M. Ostendorf, ``Glottalization of
Vowel-Initial Syllables as a Function of Prosodic Structure,''
J. Phonetics, 24 423-444, 1996.
A. Stolcke
and E. Shriberg, ``Statistical language modeling for speech
disfluencies,'' Proc. ICASSP, I:405-408, 1996.
M. Ostendorf and K. Ross, ``A Multi-Level Model for Recognition of Intonation
Labels,'' in Computing Prosody, Y. Sagisaka,
N. Campbell and N. Higuchi (Eds.), 291-308, Springer-Verlag, NY: 1997.
M. Swerts and M. Ostendorf, ``Prosodic Indications of Discourse Structure
in Human-Machine Interactions,'' Speech Communications, v. 22, 1997.
M. Ostendorf, ``Linking Speech Recognition and Language Processing
Through Prosody,'' CC-AI, to appear.
E. E. Shriberg, R.A. Bates and A. Stolcke, ``A prosody-only decision-tree model
for disfluency detection,'' Proc. EUROSPEECH, 1997.
M. Siu and M. Ostendorf, ``Variable N-gram Language Modeling and
Extensions for Conversational Speech," Proc. EUROSPEECH, 1997.
AREA BACKGROUND
Spoken language is the primary mode of
communication between humans for interactive problem solving, and
therefore spoken language understanding is a vital technology for
making computers accessible to a broad range of the population. As
human listeners, we bring many sources of information to bear on the
problem of interpreting an utterance, including syntax, semantics, our
knowledge of the world and conversational context, as well as
prosody. Prosodic phrase structure and prominence patterns often
provide the link between acoustic realization and linguistic
interpretation of a word, giving clues as to how to parse a word
string, which element is in focus, whether a point is in question, and
whether there has been a change in topic. Despite the fact that
prosody provides such important information, it has been little used
so far in spoken language understanding systems. One reason is simply
that prosody modeling is a difficult problem, with acoustic cues
depending on prosodic structures operating at many different time
scales. A second reason may be that current speech systems handle
only constrained domains, where the information provided in prosody is
often redundant with semantic cues in speech understanding systems or
is less useful because the speech recognition task involves read
speech. However, as systems move towards less constrained and more
natural interaction, the additional information provided by prosody
will become increasingly important.
The key problems that must be solved to make effective use of prosody
in human-computer communication include: identification of the
important abstract prosodic patterns and their acoustic correlates,
automatic prosodic pattern recognition, analysis and computational
modeling of the mapping between prosodic patterns and linguistic
constructs, and integrating the detected prosodic information with
existing speech recognition and understanding systems. Solving these
problems, which span both technical and linguistic disciplines,
benefits from a multi-disciplinary approach. Progress has been made in
all of these areas. Through a multi-site collaborative effort, a
standard prosodic labeling system has been proposed for American
English (Pitrelli et al, 1994), which makes it possible for different
researchers to share data and results. Much of the automatic prosody
recognition algorithm development has built on statistical models from
speech recognition, including (Wightman and Ostendorf, 1994; Ostendorf
and Ross, 1997). There have been numerous studies of the mappings
between acoustic correlates, abstract prosodic structure and
linguistic structure. For prosody and discourse structure in
particular, a good summary can be found in (Hirschberg, 1993). A
variety of architectures have been explored for integrating prosody
into speech understanding system, as overviewed in (Ostendorf, 1997);
notable success stories include (Kompe et al., 1993; Batliner et al., 1996).
In speech recognition studies, there have been several efforts at
improving on HMM duration models, as well as more recent work on
characterizing disfluencies in language modeling (Stolcke & Shriberg,
1996) and using prosody to detect disfluencies (Shriberg et al.,
1997). Although more work is still needed on all fronts, much of the
groundwork has been laid in the field, opening up numerous
possibilities for application of prosody modeling in actual speech
understanding systems.
AREA REFERENCES
(Excluding those listed above)
J. Pitrelli, M. Beckman, and J. Hirschberg, ``Evaluation of prosodic transcription labeling
reliability in the ToBI framework,'' Proc. ICSLP, pp. 123-126, 1994.
J. Hirschberg, ``Studies of intonation and discourse,'' Proc. ECSA Workshop on Prosody,
pp. 90-95, 1993.
R. Kompe et al., ``Prosody takes over: a prosodically guided dialog system,''
Proc. Eurospeech, 2003-2006, 1993.
A. Batliner et al., ``Prosody, empty categories and parsing -- a
success story,'' Proc. ICSLP, 2:1169-1172, 1996.
RELATED PROGRAM AREAS
Other Communication Modalities.
Adaptive Human Interfaces.
Intelligent Interactive Systems for Persons with Disabilities.