Lori Lamel
& Ronald Cole
LIMSI-CNRS, Orsay, France
Oregon Graduate Institute of Science & Technology, Portland, Oregon, USA
Spoken language is central to human communication and has significant links to both national identity and individual existence. The structure of spoken language is shaped by many factors. It is structured by the phonological, syntactic and prosodic structure of the language being spoken, by the acoustic enviroment and context in which it is produced---e.g., people speak differently in noisy or quiet environments---and the communication channel through which it travels.
Speech is produced differently by each speaker. Each utterance is produced by a unique vocal tract which assigns its own signature to the signal. Speakers of the same language have different dialects, accents and speaking rates. Their speech patterns are influenced by the physical environment, social context, the perceived social status of the participants, and their emotional and physical state.
Large amounts of annotated speech data are needed to model the affects of these different sources of variability on linguitic units such as phonemes, words, and sequences of words. An axiom of speech research is there are no data like more data. Annotated speech corpora are essential for progress in all areas of spoken language technology. Current recognition techniques require large amounts of training data to perform well on a given task. Speech synthesis systems require the study of large corpora to model natural intonation. Spoken languages systems require large corpora of human-machine conversations to model interactive dialogue.
In response to this need, there are major efforts underway worldwide to collect, annotate and distribute speech corpora in many languages. These corpora allow scientists to study, understand, and model the different sources of variability, and to develop, evaluate and compare speech technologies on a common basis.
Recent advances in speech and language recognition are due in part to the availability of large public domain speech corpora, which have enabled comparative system evaluation using shared testing protocols. The use of common corpora for developing and evaluating speech recognition algorithms is a fairly recent development. One of first corpora used for common evaluation, the TI-DIGITS corpus, recorded in 1984, has been (and still is) widely used as a test base for isolated and connected digit recognition [Leo84].
In the United States, the development of speech corpora has been
funded mainly by agencies of the Department of
Defense (DoD). Such DoD support produced two early corpora:
Road Rally for studying word spotting, and the
King Corpus, for studying speaker recognition.
As part of its human language technology program, the
Advanced Research Projects Agency (ARPA) of the DoD
has funded TIMIT
[GLF
93,FDGM86,LKS86], a
phonetically transcribed corpus of read sentences used for
modeling phonetic variabilities and for
evaluation of phonetic recognition algorithms, and task related
corpora such as Resource Management (RM)
[PFBP88] and Wall Street Journal (WSJ)
[PB92] for research on continuous speech
recognition, and ATIS (Air Travel Information Service)
[Pri90,Hir92] for research on spontaneous speech and natural language understanding.
Recognition of the need for shared resources led to the creation of the Linguistic Data Consortium (LDC)
in the U.S. in 1992 to promote and support the widespread development and
sharing of resources for human language technology(see section
for contact addresses).
The LDC supports various corpus development activities, and
distributes corpora obtained from a variety of sources. Currently,
LDC distributes about twenty differerent speech corpora
including those cited above, comprising many hundreds of
hours of speech. Information about the LDC as well as contact information
for most of the corpora mentioned below is listed in the next subsection.
The Center for Spoken Language Understanding (CSLU)
at the Oregon Graduate Institute collects, annotates and
distributes telephone speech corpora. The Center's
activities are supported by its industrial affiliates, but the
corpora are made available to universities worldwide free of
charge. Overviews of speech corpora available from the Center, and
current corpus development activities, can be found in:
[CNB
94,CFNL94].
CSLU's Multi-Language Corpus (also available
through the LDC), is the NIST standard for evaluating
language identification algorithms, and is comprised of
spontaneous speech in eleven different languages [MCO92].
Europe is by nature multilingual, with each country having their own language(s), as well as dialectal variations and lesser used languages. Corpora development in Europe is thus the result of both National efforts and efforts sponsored by the European Union (typically under the ESPRIT (European Strategic Programme for Research and Development in Information Technology), LRE (Linguistic Research and Engineering), and TIDE (Technology Initiative for Disabled and Elderly People) programs, and now for Eastern Europe under the PECO (Pays d'Europe Centrale et Orientalle)/Copernicus programs).
In February 1995 the European Language
Resources Association (ELRA)
was established to provide a basis for
central coordination of corpora creation, management and
distribution in Europe.
ELRA
is the outcome of the combined
efforts of partners in the LRE
Relator
project and the LE MLAP (Language Engineering Multilingual Action Plan) projects:
SPEECHDAT,
PAROLE
and
POINTER.
These projects are responsible, respectively, for the
infrastructure
for spoken resources, written resources, and terminology within
Europe. ELRA will work in close coordination with the
Network of Excellence, ELSNET (European Network in
Language and Speech),
whose Reusable
Resources Task Group initiated the Relator project.
Several ESPRIT projects have attempted to create
multilingual speech corpora in some or all of the official European
languages. The first multilingual speech collection action in
Europe was in 1989, consisting of comparable speech material
recorded in five languages: Danish, Dutch,
English, French, Italian. The entire
corpus, now known as EUROM0 includes eight languages
[FHBH89].
Other European projects producing corpora which may be available
for distribution include:
ACCOR
(multisensor recordings, seven
languages,
[MH93]); ARS;
EUROM1
(eleven languages);
POLYGLOT
(seven languages [LIM94]);
ROARS;
SPELL;
SUNDIAL;
and
SUNSTAR.
The LRE ONOMASTICA
project [Tra95] is producing
large dictionaries of proper names and place names for eleven European
languages. While some of these corpora are widely available,
others have remained the property of the project consortium that
created it. The LE SPEECHDAT project
is recording comparable telephone
data from 1000 speakers in eight European languages. A portion of the data
will be validated and made publicly available for distribution by ELRA.
Some of the more important corpora in Europe resulting from
National efforts are:
British English: WSJCAM0
[RFP
95],
Bramshill,
SCRIBE,
and
Normal Speech Corpus;
Scotish English:
HCRC Map Task [ABB
91,TAB
93];
Dutch: Groningen;
French: BDSONS [CDE
84],
BREF
[LGE91,GLE90,GL93];
German: PHONDAT1 and
PHONDAT2,
ERBA
and VERBMOBIL;
Italian: APASCI [ABF
93,ABF
94];
Spanish: ALBAYZIN
[MPB
93,DRP
93];
Swedish:
CAR and Waxholm.
Some of these corpora are readily available (see the following section for contact information on corpora mentioned in this section); and efforts are underway to obtain the availability of others.
There have also been some recent efforts to record everyday speech of typical citizens. One such effort is part of the British National Corpus in which about 1500 hours of speech representing a demographic sampling of the population and wide range of materials has been recorded ensuring coverage of four contextual categories: educational, business, public/institutional, and leisure. The entire corpus is in the process of being orthographically transcribed with annotations for non-speech events. A similar corpus for Dutch is currently under discussion in the Netherlands, and the Institute of Phonetics and Verbal Communication of the University Munich has begun collecting of a very large database of spoken German.
The Translanguage English Database (TED)
[LSF
94] is a corpus of multi-dialect English and
non-native English of recordings of oral presentations at
Eurospeech'93 in Berlin. TEDspeeches contains
data ranging in style from read to spontaneous, under varying
degress of stress. An associated text corpus TEDtexts
contains written versions of the proceedings articles, which can be
used to define vocabulary items and to construct language models.
Two auxilliary sets of recordings were made: one consisting of
speakers recorded with a laryngograph
(TEDlaryngo) in addition to the standard microphone, and
the other a set of Polyphone-like recordings
(TEDphone) made by the speakers in English and in
their mother language. This corpus was partially funded by the
LRE project EuroCocosda.
Other major efforts in corpora collection have been undertaken in
other parts of the world. These include: Polyphone, a
multilingual, multinational application-oriented telephone speech
corpus (co-sponsored by the LDC);
the Australian National Database of Spoken
Language (ANDOSL)
project, sponsored by the
Australian Speech Science and Technology Association Inc. and funded by a
research infrastructure grant from the Australian Research Council, is a national effort to create a database of spoken
language; the Chinese National Speech Corpus
supported by
the National Science Foundation of China designed to
provide speech data for the acquisition of acoustic-phonetic
knowledge and for the development and evaluation of speech
processing systems; and corpora from Japan such as those
publicly available from ATR, ETL and JEIDA.
Challenges in spoken language corpora are many. One basic challenge is in design methodology---how to design compact corpora that can be used in a variety of applications; how to design comparable corpora in a variety of languages; how to select (or sample) speakers so as to have a representative population with regard to many factors including accent, dialect, and speaking style; how to create generic dialogue corpora so as to minimize the need for task or application specific data; how to select statistically representative test data for system evaluation. Another major challenge centers on developing standards for transcribing speech data at different levels and across languages: establishing symbol sets, alignment conventions, defining levels of transcription (acoustic, phonetic, phonemic, word and other levels), conventions for prosody and tone, conventions for quality control (such as having independent labelers transcribe the same speech data for reliability statistics). Quality control of the speech data is also an important issue that needs to be addressed, as well as methods for dissemination. While CDROM has become the defacto standard for dissemination of large corpora, other potential means need to also be considered, such as very high speed fiber optic networks.