Electrical and Computer Engineering Department
Boston University
The research will investigate both utterance-level and dialog-level control of prosody, developing models and associated automatic training algorithms aimed at portability to different task domains and different generators. With the dual objectives of advancing the state of the art and providing general software tools, the effort will include linguistic inquiry and statistical modeling research as well as a software engineering component. Working with a commercially available synthesizer and building on existing prosody synthesis and recognition algorithms, the research will involve: 1) collection of read and spontaneous speech corresponding to task-specific responses, 2) improving automatic labeling of prosodic patterns and training of prediction modules; 3) use of syntactic, semantic and discourse annotation available from text generation systems to drive prosodic control modules and thereby improve the quality of the synthesized computer speech response; and 4) investigation of the role/effectiveness of prosody in computer response for guiding the dialog, e.g. for marking clarification subdialogs and other types of system initiative. To ensure that the goal of portability is achieved, the synthesized responses will be evaluated with multiple generators and on at least two different task domains; thus an important component of the work is development of evaluation protocols for assessing speech generation quality and the impact on human-computer interaction.
By making using of the rich linguistic information available from text generation, the research will benefit spoken language technology that currently uses synthesis in a text-to-speech generation mode. In addition, it will provide a new capability in systems that use no spoken response generation, opening up application areas such as telephone-based computer access and potentially changing the face of multi-media interactions. Moreover, the results of the investigations of prosodic marking of dialog and information structure and lessons learned from system evaluation work will have implications for improving text generation and dialog management technology, as well as prosody and synthesis research.
``A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location,'' M. Ostendorf and N. Veilleux, Computational Linguistics, 20 (1) 27-54, 1994.
``Prediction of Abstract Prosodic Labels for Speech Synthesis,'' K. Ross and M. Ostendorf, Computer, Speech and Language, 10 (3) 155-185, 1996.
Spoken response generation involves two main tasks: natural language (NL) text generation and speech synthesis. NL generation involves planning the semantic information that is to be communicated in a particular utterance and its high level organization, and the selection of appropriate words and syntactic structures for expressing the information. Speech synthesis involves generating an intermediate phonological representation of the text, followed by generation of acoustic parameters used in synthesizing a speech waveform. In this project, we address the problem of speech synthesis. One could generate speech responses simply by putting a text generation system in series with a speech synthesis system, but much better quality speech can be generated with a tightly coupled system. In particular, producing speech from a natural language generator has two advantages over text-to-speech: (1) Since the generator builds the syntactic structure, there is no need to hypothesize phrase boundaries---they are already there, and there is no syntactic ambiguity. (2) Since the generator has an underlying model of the situation from which it is generating the text (the dialog/discourse context), the generator chooses, rather than having to hypothesize, what parts of the message are in focus, which parts should be emphasized or contrasted with other information, and what is given vs. new information.
Speech generation papers:
N.J. Youd and J. House, ``Generating intonation in a voice dialog system,'' Proc. Eurospeech, 1287-1290, 1991.
J. House and N. Youd, ``Evaluating the prosody of synthesized utterances within a dialogue system,'' Proc. ICSLP, 1175-1178, 1992.
Y. Yamashita and R. Mizoguchi, "SOCS: A speech output system from concept representation," Proc. ICASSP, II:69-72, 1992.
A. Monaghan, ``Intonation accent placement in a concept-to-dialogue system,'' ESCA/IEEE Workshop on Speech Synthesis, 171-174, 1994.
S. Prevost and M. Steedman, ``Information based intonation synthesis,'' ARPA Workshop on Human Language Technology, 193-198, 1994.
B. Grote, E. Hagen and E. Teich, "Matchmaking: dialogue modelling and speech generation meet," Proc. INLG, 1996.