Postscript Version

Human Language Resources for Research in Multilanguage Systems, Robust Recognition and Speaker Identification

Ron Cole, Mark Fanty, Beatrice Oshika

Center for Spoken Language Understanding
Oregon Graduate Institute of Science and Technology

CONTACT INFORMATION

P.O. Box 91000
Portland, OR 97291-1000
Phone: 503 690 1159
Fax : 503 690 1306
Email: cole@cse.ogi.edu

WWW PAGE

http://www.cse.ogi.edu/CSLU/

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

Language resources, multi-language resources, speech corpora, speech recognition, speaker recognition, language recognition.

PROJECT SUMMARY

The objective of our work is to develop spoken language resources to support research in:

National Cellular Corpus.

The goal of the National Cellular data collection is to collect and transcribe cellular speech data from from about 100 cellular phone users in each major dialect area of the United States. Thus far, we have collected over 100 calls in each of four cities: Seattle, Dallas, Kansas City and Las Vegas. We plan to continue collection and transcription of cellular calls at the rate of 100 callers from a different city every two months. A first release of this corpus, with data and transcriptions from the four cities, is available via the CSLU Web site.

Speaker Recognition Corpus.

The goal of the Speaker Recognition corpus is to collect speech data from a large number of people calling from different telephones and locations at different times of the day over a period of two years. Currently, there are over 1500 participants, divided into twelve groups, with people in each group calling twelve different times each year. We are amazed (and pleased) at the responsiveness of the public to this data collection-- the attrition rate is low, and more people have volunteered to be a part of the effort than we are able to accomodate. Orthographic transcriptions have begun on this corpus.

22 Language Corpus.

The goal of the 22 language corpus is to collect and verify utterances in each of 22 languages to support research in automatic language identification, multilanguage speech and speaker recognition, and detection and modeling of foreign accents of English. For each corpus, callers respond to prompts in their native language. They are also asked to speak on a selected topic in English for 20 seconds.

Excellent progress has been made in development of this corpus. With the exception of Swahili, at least 100 calls were collected in each language. Each of the (over 3000) calls was verified by two native speakers. Verification involved training two native speakers in each language to verify every utterance in the database. Native speakers made judgments about the speaker's accent and dialect, gender, age, intelligibility and connection quality. Further, they coded every individual file that had background noise or background speech. Poor quality files, such as those containing no useful speech were discarded. The corpora has been documented and released via the CSLU Web site.

Foreign Accents of English.

We are in the process of creating a new corpus entitled Foreign Accents of English, that will support research on accent detection and speaker modeling. leading to more accurate and robust recognition systems. Each caller in the 22 language corpus produce about 30 seconds of extemporaneous speech in English. We are currently rating the degree of accent, and will release this corpus when we have completed the rating and documented it.

PROJECT REFERENCES

The "Publications," "Corpus" and "Toolkit" areas of the CSLU Web site provide detailed detailed descriptions of the speech corpora, software tools and technology available from CSLU free of charge for non-commercial use. The "Publications" link provide references and on-line versions of all of our articles.

AREA BACKGROUND

The development of spoken language systems, which allow people to interact with machines using speech, requires large amounts of annotated speech data. These collections of data, called speech corpora, are used to study and understand the sources of variability in the signal, to develop recognition algorithms, and to evaluate their performance.

In the area of computer speech recognition, the development of speech corpora has revolutionized the field by allowing rigorous evaluation methodology to be applied to the evaluation of recognition systems. By specifying the data used for developing and evaluating systems, and the metrics used to measure performance, it became possible to compare systems produced in different laboratories. The development of speech corpora made it possible to measure progress in computer speech recognition, speaker recognition and language identification, and other areas of human language technology.

In the future, the development of interactive systems will allow people to interact with machines using natural communication skills to accomplish an unlimited number of tasks. New language resources and evaluation methodologies will be needed to evaluate research advances and measure progress in these systems. How specific research advances can be evaluated across a wide variety of systems is an important and exciting challenge for the future of interactive systems.

AREA REFERENCES

Chapter 12 of the Survey of the State of the Art of Human Language Technology contains six different contributions on language resources (Overview, Written Language Corpora, Spoken Language Corpora, Lexicons, Terminology, and Addresses for Language Resources), and many references to other sources.

The CSLU web site area on speech corpora describes all corpora available from CSLU, including the protocols used during data collection, and provides pointers to publications that describe the different corpora.

The Linguistic Data Consortium (LDC) Web Site provides information about spoken and written language resources.

The Perceptual Systems Laboratory at the University of California, Santa Cruz, maintains a gateway to global information on recognition and production of visual speech (e.g., animated faces, automatic speechreading).

RELATED PROGRAM AREAS

Human language resources provide the foundation for research in each of the six areas of interactive systems, as currently defined by NSF: Virtual Environments; Speech and Natural Language Understanding; Other Communication Modalities; Adaptive Human Interfaces; Usability and User-Centered Design; and Intelligent Interactive Systems for Persons with Disabilities.