David S. Pallett
& Adrian Fourcin
National Institute of Standards and Technology, Gaithersburg, Maryland, USA
University College of London, London, UK
Assessment and evaluation
are concerned with the
global quantification and detailed measurement of system
performance. Disciplined procedures of this type are at the heart of
progress in any field of engineering. They not only make it possible
to monitor change over time in a given system and meaningfully
compare one approach with another; they also usefully extend basic
knowledge.
Within the past several years, there has been widespread and growing
international interest in a number of issues involved in
speech input system performance assessment. In
Europe, the SAM Projects (ESPRIT Projects 2589
and 6819) addressed ``Multi-Lingual Speech Input/Output Assessment,
Methodology and Standardization'' [F
92].
In the United States, the ARPA Spoken Language Program has
made extensive use of periodic benchmark tests to gauge progress
and to serve as a focal point for discussions at a number of
ARPA-sponsored workshops [DAR89,DAR90,DAR91b,DAR92b,ARP93a,ARP94].
There have also been a number of international workshops, such as those held in conjunction with the Eurospeech Conferences and the International Conferences on Spoken Language Processing [JM92]. The present contribution focuses on three sub-areas: speech recognition; speech understanding; and speaker recognition.
To a first approximation, the task of speech recognition may be regarded as being to produce an hypothesized orthographic transcription from a spoken language input. The most commonly cited output is in the form of words in ASCII characters, although other units (e.g., syllables or phonemes) are sometimes found.
Assessment methods developed for speech recognition involve a complementary combination of system based approaches with performance based techniques. System based approaches either deal with the recognition system as a whole (black box methods) or provide access to individual modules within the complete recognizer (glass box methods). For each of these approaches quantitative appraisals of performance may range from the use of applications related (non-diagnostic) training and test data to highly diagnostic techniques, specifically oriented toward detailed evaluations involving the use of test data going from, for example, phonetically controlled speech to language independent data derived from artificial speech generation. These extremes of performance measurement fit into a continuum into which methods of global, benchmarking, assessment and detailed evaluation may be categorized into the following groups:
The most frequently used methods (e.g., those used within the ARPA programme) belong to group (a). Much of the data used for speech recognizer performance assessment consists of read speech, not spontaneous, goal-directed speech. Some of the data used for large-scale performance assessment efforts is openly available (see section 12.6).
Automatic scoring methods are used in most cases, with reliance on dynamic programming methods to align reference and system hypothesis output strings. Results are typically reported in terms of the word or sentence error percentages, where errors are categorized as substitutions, insertions, or deletions.
The statistical validity of assessment tests for recognizers has been studied [CCD91], and a number of well-known statistical measures are in use, using both parametric and non-parametric techniques.
In speech understanding some semantic analysis or interpretation of the speech recognizer's output is implicitly or explicitly required---for example where the process of automatic speech recognition is intended as input to a command/control application.
Performance assessment for speech, or more generally spoken language, understanding systems is substantially more complex and problematic than for speech recognition systems. Procedures for performance assessment of natural language processing systems, in general, are not yet well established, but many relevant issues have been identified and addressed in increasing detail at workshops in Pennsylvania in 1988, Berkeley in 1991, Edinburgh and Trento in 1992, as well as at the ARPA Human Language Technology Workshops.
For spoken language understanding systems, the use of reference speech databases as system input is not so clearly appropriate, because issues involving human behavior and human-computer interactivity become complicating factors. ``It is particularly difficult to engage in speech evaluation where the entire system design assumes a high degree of interaction between user and system, and makes explicit allowance for [dialogue] clarification and recovery, as in the VODIS telephone train inquiry case'' [GSJ93].
Nonetheless, this procedure has, for example, extensively been implemented within the ARPA Spoken Language Program in the U.S., in the Air Travel Information Service (ATIS) domain, a spoken natural language (air travel information) database query task. ``The evaluation methodology is black box and implemented using an automatic evaluation system. It is performance related; only the content of an answer retrieved from the database is evaluated'' [GSJ93].
A variety of procedures have been suggested for accommodating interactive systems with dialogue management and/or clarification. So-called end-to-end assessment methods---in which measures of system-user efficiency in task completion and/or subjective measures of satisfaction are derived---are frequently complicated by large subject-to-subject or task-to-task variabilities, and their attendant statistical considerations. It is clear that these complications will be relevant to the assessment and benchmarking of commercial technology for real applications, as well as to their detailed evaluation and future development.
Speaker recognition technology is conventionally discussed in terms of two different areas: speaker identification and speaker verification (see section 1.7). Speaker identification can often be thought of as a closed set problem, where the system's task is to identify an unidentified voice as coming from one of a set of N reference speakers. In practical applications, open set speaker identification permits a rejection response corresponding to the possibility that the unidentified voice does not belong to any of the reference speakers. The task of a speaker verification system is to decide whether the unlabeled voice belongs to a specific genuine speaker who has previously claimed his identity to the system, or an imposter.
A state-of-the-art in the evaluation of speaker identification and
verification systems can be found in the Proceedings of the Automatic
Speaker Recognition, Identification and
Verification ETRW Workshop [CBP94],
et al., 1994), as a summary of the initial efforts of ESPRIT
Project 6819, Speech Technology Assessment Methodology in
Multilingual Applications (SAM-A) [B
94b].
et al., ).
In the shorter term, provision should be made for more accurate speech recognition scoring procedures making use of time-marked reference transcriptions and system outputs. Such procedures may prove essential when conducting multi-lingual performance assessment, to facilitate cross comparison and, for example, because of increased ambiguity concerning word boundaries for some languages. The adequate provision of these facilities will involve quite new approaches to the large scale accurate labeling of speech databases.
The increasingly wide area of applications of speech recognition technology introduces new needs and new problems. The need to support fluent dialogue interaction with a range of speakers, accents, dialects and conditions of health increases the complexity of assessment and evaluation for developer and user alike. For truly spontaneous speech input collected in operational environments, the presence of disfluencies (e.g., pause-fillers, word fragments, false starts and restarts) and noise artifacts provide additional complicating factors.
The associated need for systems to be able to be trained so as to work with a range of language inputs similarly imposes a much greater burden on the organization and collection of appropriate spoken language corpora. This in turn should lead to the gradual use of more analytic and language independent techniques (glass box techniques) and an increasingly close association between work in speech input with speech output/synthesis and natural language processing.
The increasing complexity of processing associated with the development and application of spoken language processing systems is necessarily tied in with an increasing need for precision, both in the methods employed for the appraisal of performance and in our use of the description of these methods.
Assessment is the process of system appraisal which leads to global, overall, quantification of performance. Assessment is related conceptually to black box methods in which the detailed mechanisms of processing are not considered. (The word itself has its origin in the latin assidere---to sit by---and relates to the levying of tax on the gross production of an enterprise.)
Evaluation involves the analytic description of system performance in terms of defined factors, it is concerned with detailed measurement. Evaluation is conceptually related to the glass box approach, in which the objective is, for example, to gain a greater understanding of system performance from the use of precision diagnostic techniques based on special purpose phonetic databases. (The word itself has its origins in the French word evaluer---to calculate from a mathematical expression or to express in terms of something already known.)