Center for Spoken Language Understanding
Oregon Graduate Institute of Science and Technology
The objective of our work is to develop spoken language resources to support research in:
Excellent progress has been made in development of this corpus. With the exception of Swahili, at least 100 calls were collected in each language. Each of the (over 3000) calls was verified by two native speakers. Verification involved training two native speakers in each language to verify every utterance in the database. Native speakers made judgments about the speaker's accent and dialect, gender, age, intelligibility and connection quality. Further, they coded every individual file that had background noise or background speech. Poor quality files, such as those containing no useful speech were discarded. The corpora has been documented and released via the CSLU Web site.
The "Publications," "Corpus" and "Toolkit" areas of the CSLU Web site provide detailed detailed descriptions of the speech corpora, software tools and technology available from CSLU free of charge for non-commercial use. The "Publications" link provide references and on-line versions of all of our articles.
The development of spoken language systems, which allow people to interact with machines using speech, requires large amounts of annotated speech data. These collections of data, called speech corpora, are used to study and understand the sources of variability in the signal, to develop recognition algorithms, and to evaluate their performance.
In the area of computer speech recognition, the development of speech corpora has revolutionized the field by allowing rigorous evaluation methodology to be applied to the evaluation of recognition systems. By specifying the data used for developing and evaluating systems, and the metrics used to measure performance, it became possible to compare systems produced in different laboratories. The development of speech corpora made it possible to measure progress in computer speech recognition, speaker recognition and language identification, and other areas of human language technology.
In the future, the development of interactive systems will allow people to interact with machines using natural communication skills to accomplish an unlimited number of tasks. New language resources and evaluation methodologies will be needed to evaluate research advances and measure progress in these systems. How specific research advances can be evaluated across a wide variety of systems is an important and exciting challenge for the future of interactive systems.
Chapter 12 of the Survey of the State of the Art of Human Language Technology contains six different contributions on language resources (Overview, Written Language Corpora, Spoken Language Corpora, Lexicons, Terminology, and Addresses for Language Resources), and many references to other sources.
The CSLU web site area on speech corpora describes all corpora available from CSLU, including the protocols used during data collection, and provides pointers to publications that describe the different corpora.
The Linguistic Data Consortium (LDC) Web Site provides information about spoken and written language resources.
The Perceptual Systems Laboratory at the University of California, Santa Cruz, maintains a gateway to global information on recognition and production of visual speech (e.g., animated faces, automatic speechreading).
Human language resources provide the foundation for research in each of the six areas of interactive systems, as currently defined by NSF: Virtual Environments; Speech and Natural Language Understanding; Other Communication Modalities; Adaptive Human Interfaces; Usability and User-Centered Design; and Intelligent Interactive Systems for Persons with Disabilities.