Speech Enhancement and Assessment Resource (SpEAR)
SpEAR consists of a speech database and toolkit for assessing
performance of speech enhancement algorithms.
This project is currently pending NSF funding. Companies and corporate
members interested in providing input and/or financial support for this project should
contact Professor Eric Wan.
The objective of this project is to develop standardized data, tools, and
evaluation guidelines to enable the consistent benchmarking of speech
enhancement (noise removal) algorithms. CSLU is developing a Speech
Enhancement Assessment Resource (SpEAR), consisting of a speech database
particularly suited for enhancement tasks, along with a toolkit specifically
designed for assessing the performance of different speech enhancement
algorithms.
SpEAR will be provided for educational purposes as a service to the speech
enhancement research community.
SpEAR Database
Most speech databases today have been designed for speech
recognition experiments. They contain many utterances spoken by a large number
of subjects often in a single noise free environment. In contrast, the SpEAR
Database will be created to examine and assess the performance of speech
enhancement systems in adverse/noisy environments. The SpEAR Database
will contain carefully selected samples of noise-corrupted speech that have
been recorded by acoustically combining clean speech and noise (a
clock-synchronous procedure will be used to provide a time-aligned reference to
the clean speech). Three separate sub-corpora will be
designed:
- ASR corpus (for measuring effects on ASR performance). Two
sub-corpora will be created: regular text and digit sequences.
For regular
text, a subset of the text of the Wall Street Journal Corpus will be
used. For digit sequences, 16-digit pseudo credit card numbers will be
generated
- Human/intelligibility corpus (for measuring effects on
intelligibility). Two tests will be used: the MPIT
and the Semantically
Unpredictable Sentences test, or SUS. The sentences in the MPIT are
constructed to form minimal pairs, such as:
"The saving/shaving ate the
leaf". The sentences in the SUS test have fixed syntactic structures (e.g.,
subject-verb-direct object).
- Human/quality corpus (for measuring effects on perceived
naturalness or pleasantness). We will record the Harvard Sentences, which is a
set of 100 meaningful, syntactically varied, phonetically balanced sentences.
Each sub-corpus will be recorded in the following environments:
- Acoustically controlled room environment. These recordings
will be done in a quit room environment using the acoustic addition
technique developed for the SpEAR Beta release. Clean speech
samples will either be spoken by a human (for Lombard recordings) or played
over a monitor quality loudspeaker from the pre-recorded corpus. Noise
samples will be synchronously played from another loudspeaker and acoustically
combined with the clean speech reference.
- Cellular phone car environment. These recordings will be done in a
moving car to simulate a cellular phone environment with real ambient noise
conditions. The arrangement of microphones will allow us to isolate (or
include) both the effects of the cellular microphone and the coding/channel
effects.
- Artificial environment. This "environment" consists of the
standard technique of artificially (digitally) combining noise samples with
the pre-recorded corpora. While the acoustically controlled room
environment is capable of the same task, the artificial approach allows
for a larger number of combinations to be reasonably implemented.
SpEAR Toolkit
The SpEAR Toolkit will be a collection of routines and a
convenient graphical user interface for examining and assessing the performance
of speech enhancement algorithms. The toolkit will contain components for both
objective and subjective evaluations, with a clear set of guidelines
on how the resources are to be used, what experiments are to be conducted,
and how evaluation results are to be reported.
Objective Quality Assessment Measures.
The SpEAR toolkit will contain a baseline set of standard objective speech
quality measures used within the speech enhancement community. These include:
- Segmental SNR (SNRseg)
- Itakura-Saito distortion (IS)
- Log likelihood ratio (LLR)
- Log area ratio (LAR)
- Weighted spectral slope (WSS)
- Perceptually weighted segmental SNR (PWSNRseg)
- Bark spectral distortion (BSD)
- Perceptual speech quality measure (PSQM)
The toolkit will contain
scripts and interfaces specifically designed to evaluate performance on the
SpEAR Human/quality sub-corpora. A GUI will allow the user to view clean, noisy
and enhanced speech along with noise sources, phonetic labels and objective
quality measures, in a time aligned representation. This will allow the user to
investigate the performance of a speech enhancement algorithm in detail,
especially as far as how it pertains to different components (phonemes, phones,
syllables, etc.) of speech. Scripts will allow the user, for example, to
automatically determine objective performance measures divided into classes
(e.g., voiced and un-voiced, etc.) or the cumulative performance over the entire
SpEAR corpus.
Subjective Evaluation
Five experimental procedures will be implemented: two for measuring
intelligibility and three for measuring quality. These experiments
will be performed on the raw SpEAR sub-corpora as well as after processing
with a number standard enhancement algorithms. The
results will
provide a base reference for subjective performance
evaluation. In addition, the protocols, scripts, and interfaces developed
for performing our evaluations will be made available to the academic community
to aid in standardizing outside evaluation.
- Intelligibility. (For use with the SpEAR Human/performance corpus.)
Two different test will be performed:
- In the MPIT method, an utterance is presented once, followed by a visual
presentation of the minimal pair. The listener has to make a binary
choice between words in the pair.
- In the SUS method, the listener has to write down the answer.
Since the words used in this test are relatively common and have few
spelling-to-sound ambiguities, scoring can be based on a direct orthographic
match.
- Quality. (For use with the SpEAR Human/quality corpus.) Recordings
used will be the Harvard sentence recordings. Three different test will
be performed:
- The Degradation Mean Opinion Score (DMOS) paradigm requires
- The Diagnostic Acceptability Measure (DAM).
- The Diagnostic Acceptability Measure (DAM) modified to measure
enhancement qualities.