Chapter 13: Evaluation
Lynette Hirschman
& Henry S. Thompson
MITRE Corporation, Bedford, Massachusetts, USA
University of Edinburgh, Scotland
Evaluation plays a crucial role in speech and natural language processing, both for system developers and for technology users. In this section we will introduce the terminology of evaluation for speech and natural language processing and provide a brief survey of areas where it has proved particularly useful, before passing on to more detailed case studies in the subsequent sections.
We can broadly distinguish three kinds of evaluation, appropriate to three different goals.
When systems have a number of identifiable components associated with stages in the processing they perform, it is important to be clear as to whether we approach the system as a whole, or try to evaluate each component independently. When considering individual components, a further distinction between intrinsic and extrinsic evaluation must be respected---do we look at how a particular component works in its own terms (intrinsic) or how it contributes to the overall performance of the system (extrinsic). At the whole system level, this distinction approximates to the performance evaluation/adequacy evaluation one, where intrinsic is to extrinsic as performance evaluation is to adequacy evaluation.
A distinction is often drawn between so-called glass box and black box evaluation, which sometimes appears to differentiate between component-wise versus whole-system evaluation, and sometimes to a less clear-cut difference between a qualitative/descriptive approach (How does it do what it does) and a quantitative/analytic approach (How well does it do what it does).
As speech and natural language processing systems move out of the laboratory and into the market, it is becoming increasingly important to address the legitimate needs of potential users in determining whether any of the products on offer in a given application domain are adequate for their particular task, and if so, whether any of them are obviously more suited than the others. If we reflect on the way similar tasks are approached in other fields, we observe what we can call the Consumer Reports paradigm, which does not necessarily aim at actually identifying the best system, but rather at providing comparative information which allows the user to make an informed choice. Techniques from both diagnostic and performance evaluation may be called on to achieve this aim, but are unlikely to be sufficient in themselves---for example, assessing customisability may be of fundamental importance in determining adequacy to a particular user's needs, but is unlikely to be addressed by existing diagnostic or performance evaluation methodologies.
The term formative evaluation is used in the field of human-computer interaction to refer to a collection of evaluation methodologies more closely related to both adequacy evaluation and to diagnostic evaluation in our terms. The goal of formative evaluation is to provide diagnostic information about where a given system succeeds or needs improvement, relative to its intended users and use. The role of formative evaluation is to influence and guide system design, as opposed to performance evaluation or summative evaluation, which rates systems relative to each other, or relative to some gold standard such as human performance. During system development, user trials of system prototypes or alternative assessments of user interface functionality are conducted, in which more or less formal measurements of usability are recorded (e.g., via study and measurement of user actions performing some representative set of tasks, possibly coupled with interviews). We see considerable potential for importing some of these techniques into adequacy evaluation of speech and natural language processing applications.
In speech and natural language processing application areas where
coverage is important, for example in
machine translation or language
understanding systems with explicit
grammars, a common development methodology employs a large test
suite of exemplary input, whose goal is
to enumerate all the elementary linguistic phenomena in the input
domain, and their most likely and/or important combinations. A
large, mature test suite will be structured into a number of
dimensions of elementary phenomena and contexts, and may include
invalid as well as valid inputs, tagged as
such. [NND
93] describes a recent state-of-the-art
example of this.
Test suites are particularly valuable to system developers and maintainers, allowing automated regression testing to ensure that system changes have the intended effect and no others, but raw profiles of system coverage vis a vis some test suite are unlikely to be of use as such in either adequacy or performance evaluation, both because such test suites may not reflect the distribution of linguistic phenomena in actual application domains, and because the value of good coverage at one point versus bad coverage at another is not in itself indicative of fitness to a user's purpose.
There is a long tradition of quantitative performance evaluation in information retrieval, and many of its concepts have been usefully imported into the development of evaluation methodologies for speech and natural language processing. In particular, in considering any attempt at performance evaluation, we can usefully distinguish between three levels of specificity:
For example, in information retrieval itself, a classic criterion is precision, the extent to which the set of documents retrieved by a formal query satisfy the need which provoked the query. One measure for this is the percentage of documents retrieved which are in fact relevant. One method for computing this, which applies only if the extensions of some set of needs over some test collection are known in advance, is to simply average over some number of test queries the ratio achieved by the system under test.
For speech recognition, where the criterion is recognition accuracy, one measure is word error rate, and the method used in the current ARPA speech recognition evaluation involves comparing system transcription of the input speech to the truth (i.e., transcription by a human expert), using a mutually agreed upon dynamic programming algorithm to score agreement at the word level.
It should be clear from this that the distinction between criterion, measure and method is not hard and fast, and that in any given case the three are interdependent---see [SJ94] for a more detailed discussion of these issues.
As the previous discussion illustrates, evaluation plays an important role for system developers (to tell if their system is improving), for system integrators (to determine which approaches should be used where) and for consumers (to identify which system will best meet a specific set of needs). Beyond this, evaluation plays a critical role in guiding and focusing research.
Periodic performance evaluations have been used successfully in the U.S. to focus attention on specific hard problems: robust information extraction from text, large vocabulary continuous speech recognition, spoken language interfaces, large scale information retrieval, machine translation. These common evaluations have motivated researchers both to compete in building advanced systems, and to share information to solve these hard problems. This paradigm has contributed to increased visibility for these areas, rapid technical progress, and increased communication among researchers working on these common evaluations as a result of the community of effort which arises from working on a common task using common data.
A major side-effect of performance evaluation has been to increase support for infrastructure. Performance evaluation itself requires significant investment to create annotated corpora and test sets, to create well-documented test procedures and programs, to implement and debug these procedures, and to distribute these to the appropriate parties.
Of course, the focus on performance evaluation comes at a price: periodic evaluations divert effort from research on the underlying technologies, the evaluations may emphasize some aspects of development at the expense of other aspects (e.g., increased accuracy at the expense of real-time interaction), and performance evaluation across systems can be misleading, depending on level of effort in developing the systems under comparison, use of innovative vs. proven technologies, and so on.
The common evaluations referred to above have all relied on performance evaluation, in part because some of them have received funding through ARPA, which focused on technology rather than on applications. Increasing emphasis on adequacy evaluation may become appropriate if, as seems likely on both sides of the Atlantic, users and their needs come more to the forefront of funding priorities. There is a difficulty in Europe, however, in that the basic performance evaluation technologies for languages other than English are developed only to a limited extent, with considerable variation across languages.
As noted above, evaluation has contributed some major successes to the development of speech and natural language processing technology; among these we can count:
As noted above, current evaluation technology also has some significant shortcomings and gaps:
It is clear from both the successes and the shortcomings that evaluation methodologies will continue to evolve and to improve. Evaluation has become so central to progress in the speech and natural language area that it should become a research area in its own right, so that we can correct the problems that have become increasingly evident, while continuing to reap the benefits that evaluation provides.