Center for Spoken Language Understanding
Oregon Graduate Institute of Science and Technology
The goal of this international collaboration, funded by NSF in the United States and CONACyT in Mexico, is to develop tools and technologies for researching and developing Spanish spoken language systems. Our collaborators in Mexico are Professor Ofelia Cervantes and her students at UDLA, the Universidad de las Americas en Puebla.
The starting point for this work is the CSLU Toolkit, an integrated set of tools and technologies supporting research and development of spoken language systems.
The graphical authoring tools in the CSLU Toolkit allow spoken language systems to be designed in minutes, hours or days, depending upon their complexity, and compiled and tested immediately, allowing an iterative design and test process. The toolkit currently supports structured dialogues, such as the pizza ordering system shown in Figure 1, which can be built and tested in about 30 minutes by first time users. More complex applications, such as web browsers or voice messaging systems, can be developed with more practice. The system automatically performs word spotting, rejection of low scoring words, and conversational repair. Detailed information about the Toolkit is available at http://www.cse.ogi.edu/CSLU/toolkit/toolkit.html.
Figure 1 - The CSLU Toolkit
In structured dialogues, the system produces prompts, recognizes utterances produced in response to the prompts, and performs actions based on the recognition. Prompts are created by the system designer by recording them or typing the desired words, which are spoken by a text-to-speech synthesizer. Actions, such as visiting a Web site, parsing text on the web site and playing it as speech, are programmed with Tcl scripts. The recognition vocabulary is also specified by typing the words to be recognized; their pronunciations are either located in a large pronunciation dictionary, or generated automatically using the TTS system when an entry cannot be found.
In addition to providing an environment for designing and testing spoken language systems, the Toolkit supports a wide range of research activities in spoken language technology, including speech recognition with hidden Markov models and neural nets, text-to-speech synthesis using the FESTIVAL system, natural language understanding, and dialogue modeling.
To develop a Spanish version of the Toolkit, it is necessary to:
The CSLU Toolkit was ported to UDLA by Ron Cole and Stephen Sutton during a visit to UDLA in March, 1997. The Spanish version of the CSLU Toolkit was developed during a month of intensive effort at CSLU by Alejandro Barbosa of UDLA, in collaboration with several people. Alejandro developed the TTS system with Helen Van Scoi during and following a TTS short course taught by Alan Black and Mike Macon at OGI during June 1997. This effort involved:
Work is now underway to improve the intelligibility and naturalness of the TTS, and the accuracy and robustness of the recognizer. In August, 1997, three additional students will visit OGI to take the Spoken Dialogue Systems short course using the CSLU Toolkit. Upon their return to UDLA, they will conduct research projects using and improving the Spanish version of the Toolkit. By the end of the second (and final) year of the project, we hope to have an improved version of the Spanish toolkit in daily use at UDLA for both education and research.
Sutton, S., Kaiser, E., Schalkwyk, J., de Villiers, J., Cole, R., Carlson, D., Cronk, A., ``Bringing Spoken Language Systems to the Classroom," EuroSpeech97.
Schalkwyk, J., de Villiers, J., van Vuuren, S., Vermeulen, P., ``CSLUsh: An Extendible Research Environment", EuroSpeech97.
Sutton, S., Novick, D.G., Cole, R., Vermeulen, P., de Villiers, J., Schalkwyk, J. and Fanty, M., "Building 10,000 Spoken-Dialogue Systems", Proceedings of the 1996 International Conference on Spoken Language Processing, Philadelphia, PA, 709-712, October, 1996.
Colton, D., Cole, R., Novick, D., Sutton, S. ``A Laboratory Course for Designing and Testing Spoken Dialogue Systems," Proceedings of the 1996 International Conference on Acoustics, Speech and Signal Processing, Atlanta, GA, 1129-1132, May, 1996.
Progress in the area of spoken language systems, and speech recognition technology in particular, is linked directly to projects sponsored by DARPA, the Defense Department's Advanced Research Projects Agency, and large industry efforts, mainly at IBM and AT&T. Between 1971 and 1976, the first ARPA speech program resulted in speaker-dependent recognition of fluent speech for a 1000 word low perplexity task (CMU's HARPY system met or exceeded all of the goals established at the beginning of the five year project.) Since 1984, DARPA has funded the development of large vocabulary continuous speech recognition (LVCSR) systems, with annual evaluation of these systems on benchmark tasks, leading to the current state of the art, in which systems attempt to transcribe words in news broadcasts and telephone conversations, with no constraints on vocabulary size. There is no question that dramatic progress has been made in recognizing words from natural continuous speech, although machine performance is still about an order of magnitude worse than human performance.
Unfortunately, while progress in transcription tasks has been steady and impressive, progress in developing spoken language systems has lagged far behind. The defining feature of a spoken language system is the interaction between human and machine. It follows that progress in developing these systems requires the continued study of how people interact with machines using speech. Such studies will highlight the limitations of recognition technology in the context of system use, and focus research efforts on ways to overcome these limitations. If spoken language systems are to become a reality, it is essential to shift the focus of spoken language research from transcription tasks to human computer interaction. For example, research in LVCSR does not focus on issues such as how to phrase a system prompt, how to determine if a recognition error has occurred, or how to engage in conversational repair if such a determination is made.
A second barrier to progress in spoken language systems is the lack of easily accessible and inexpensive tools to support research and technology transfer. The development of spoken language systems is a complex activity, requiring significant computer resources, integration of sophisticated signal processing, training and recognition algorithms, and language resources such as speech corpora and pronunciation dictionaries. Because of the resources and expertise required, spoken language systems research is localized in a few specialized laboratories, which produce only five or six Ph.D. students each year. The result is that all but a few of the most fortunate students are denied the opportunity to participate in this exciting area of research, and we are not training enough researchers in an area of great strategic importance.
Without tools to create and manipulate spoken dialogue systems and support technology transfer, progress will be limited to the efforts of relatively few researchers at elite laboratories. For progress in spoken language systems to occur, researchers need tools to rapidly design working systems and manipulate system parameters to test experimental hypotheses. The development of the CSLU Toolkit is intended to help fill this need.
Survey of the State of the Art in Human Language Technology:
We invite all researchers to examine the CSLU Toolkit, available free of charge via the CSLU web site. Our hope is that the toolkit will support and stimulate research in interactive systems. We will be grateful to receive feedback on how to improve the toolkit to meet your research needs.