Postscript Version

Speech Generation for Human-Computer Interaction

Mari Ostendorf

Electrical and Computer Engineering Department
Boston University

CONTACT INFORMATION

8 St. Mary's St.
Boston, MA 02215
Phone: (617) 353-5430
Fax : (617) 353-8437
Email: mo@bu.edu

WWW PAGE

Lab: http://raven.bu.edu
Project: http://raven.bu.edu/projects/generate.html

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

Prosody, intonation, speech synthesis, speech generation, concept-to-speech, dialog systems.

PROJECT SUMMARY

Speech synthesis systems have long been commercially available, but the quality is not sufficiently natural for widespread use as a computer output modality. Moreover, most existing systems have been designed for unrestricted text-to-speech synthesis and so have a very generic speaking style that may not be well matched to a particular task domain. It is generally agreed that prosody - the phrase and accent structure of speech that provides information about sentence meaning - is one of the most critical aspects of synthesis technology to improve, and that it is the main carrier of speaking "style". Thus, this project addresses the problem of computer speech generation for human-computer interaction using spoken language, with the goal of improving speech synthesis quality by controlling prosodic parameters based on text generation outputs.

The research will investigate both utterance-level and dialog-level control of prosody, developing models and associated automatic training algorithms aimed at portability to different task domains and different generators. With the dual objectives of advancing the state of the art and providing general software tools, the effort will include linguistic inquiry and statistical modeling research as well as a software engineering component. Working with a commercially available synthesizer and building on existing prosody synthesis and recognition algorithms, the research will involve: 1) collection of read and spontaneous speech corresponding to task-specific responses, 2) improving automatic labeling of prosodic patterns and training of prediction modules; 3) use of syntactic, semantic and discourse annotation available from text generation systems to drive prosodic control modules and thereby improve the quality of the synthesized computer speech response; and 4) investigation of the role/effectiveness of prosody in computer response for guiding the dialog, e.g. for marking clarification subdialogs and other types of system initiative. To ensure that the goal of portability is achieved, the synthesized responses will be evaluated with multiple generators and on at least two different task domains; thus an important component of the work is development of evaluation protocols for assessing speech generation quality and the impact on human-computer interaction.

By making using of the rich linguistic information available from text generation, the research will benefit spoken language technology that currently uses synthesis in a text-to-speech generation mode. In addition, it will provide a new capability in systems that use no spoken response generation, opening up application areas such as telephone-based computer access and potentially changing the face of multi-media interactions. Moreover, the results of the investigations of prosodic marking of dialog and information structure and lessons learned from system evaluation work will have implications for improving text generation and dialog management technology, as well as prosody and synthesis research.

PROJECT REFERENCES

Our focus is on automatically trainable algorithms for portability to different generators and different task domain speaking styles. Examples of our past work in this area include:

``A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location,'' M. Ostendorf and N. Veilleux, Computational Linguistics, 20 (1) 27-54, 1994.

``Prediction of Abstract Prosodic Labels for Speech Synthesis,'' K. Ross and M. Ostendorf, Computer, Speech and Language, 10 (3) 155-185, 1996.

AREA BACKGROUND

Human interaction with a computer, like human-human communication, is a two-way dialogue where information from the human is input to the computer (e.g., via keyboard, mouse or voice) and a computer response is generated and communicated to the human in return (e.g., via text or graphical display, or audio output). Speech input and output have not played a major role in human-computer interaction in the past, largely because the technology was not sufficiently advanced. However, significant gains have been achieved in the area of speech understanding, so that speech is now becoming a viable input modality. However, there has been little work on the other half of the problem: automatic speech generation. Most computer interfaces, including those for current spoken language systems, rely heavily on tabular and graphical displays for system output. Although these types of displays are often effective, they are not appropriate for all types of information (e.g., explanations), and they do not allow the system to take initiative and respond in a helpful way (e.g. guiding the user with clarification queries). Moreover, for telephone applications that provide remote computer services, visual displays are not currently an option. Thus, speech generation is an important capability for facilitating human-computer interaction in future generations of computer services.

Spoken response generation involves two main tasks: natural language (NL) text generation and speech synthesis. NL generation involves planning the semantic information that is to be communicated in a particular utterance and its high level organization, and the selection of appropriate words and syntactic structures for expressing the information. Speech synthesis involves generating an intermediate phonological representation of the text, followed by generation of acoustic parameters used in synthesizing a speech waveform. In this project, we address the problem of speech synthesis. One could generate speech responses simply by putting a text generation system in series with a speech synthesis system, but much better quality speech can be generated with a tightly coupled system. In particular, producing speech from a natural language generator has two advantages over text-to-speech: (1) Since the generator builds the syntactic structure, there is no need to hypothesize phrase boundaries---they are already there, and there is no syntactic ambiguity. (2) Since the generator has an underlying model of the situation from which it is generating the text (the dialog/discourse context), the generator chooses, rather than having to hypothesize, what parts of the message are in focus, which parts should be emphasized or contrasted with other information, and what is given vs. new information.

AREA REFERENCES

General overview:

K. McKeown and J. Moore, "Spoken Language Generation", Chpt 5.4 in Survey of the State of the Art in Human Language, R. Cole et al. (eds.), 1996.

Speech generation papers:

N.J. Youd and J. House, ``Generating intonation in a voice dialog system,'' Proc. Eurospeech, 1287-1290, 1991.

J. House and N. Youd, ``Evaluating the prosody of synthesized utterances within a dialogue system,'' Proc. ICSLP, 1175-1178, 1992.

Y. Yamashita and R. Mizoguchi, "SOCS: A speech output system from concept representation," Proc. ICASSP, II:69-72, 1992.

A. Monaghan, ``Intonation accent placement in a concept-to-dialogue system,'' ESCA/IEEE Workshop on Speech Synthesis, 171-174, 1994.

S. Prevost and M. Steedman, ``Information based intonation synthesis,'' ARPA Workshop on Human Language Technology, 193-198, 1994.

B. Grote, E. Hagen and E. Teich, "Matchmaking: dialogue modelling and speech generation meet," Proc. INLG, 1996.

RELATED PROGRAM AREAS

Other Communication Modalities.
Adaptive Human Interfaces.
Intelligent Interactive Systems for Persons with Disabilities.