Junichi Kanai
University of Nevada, Las Vegas, Nevada, USA
The variables that affect the performance of an optical character recognition (OCR) system include variations in the clarity of printed documents, as well as their layout style. These factors contribute to the number of needed performance metrics, to the need for large quantities of test data, and the necessity of automating the evaluation task.
Traditionally, the performance of OCR algorithms and systems
is based on the recognition of isolated characters. When a system classifies an individual character, its
output is typically a character label or a reject marker that
corresponds to an unrecognized character. By comparing output labels
with the correct labels, the number of correct recognition,
substitution errors (misrecognized
characters), and
rejects (unrecognized
characters) are determined.
The standard display of the results of classifying individual
characters is the confusion matrix, such as
Figure
.
The character accuracy is:
Substitution-Errors
Rejects
and
are costs associated with correcting a
substitution error and a
reject, respectively.
Many OCR systems use morphological (n-gram) and lexical techniques to correct recognition errors. To evaluate the performance of such systems, word, sentence, or paragraph images are needed. Since linguistic characteristics, such as n-gram statistics and word frequency, depend on document class (or domain), standard lexicons or corpora for training and testing extracted from a variety of document classes are needed. As OCR systems employ other natural language processing techniques to improve accuracy, appropriate training and test databases must be developed.
OCR and document analysis systems recognize not only text but also other features of documents, such as extraction of articles from a page and recognition of the logical structure of an article. New metrics and appropriate resources, such as document-based test data must be made available.
Since the notion of accuracy depends upon the specific application involved, application-specific metrics are also important. Such metrics can also help end users to determine the feasibility of OCR in their tasks. Consider text retrieval applications. Users of text retrieval systems are interested in words and their correct reading order and almost never in individual characters. Thus, word accuracy is a more appropriate metric. Moreover, for these applications, discriminating between stopwords and non-stopwords is important. Stopwords are common words, such as the, but, and which, that are normally not indexed because they have essentially no retrieval value. Therefore, correct recognition of words that are not stopwords is an even more important metric for these applications [RKN93].
Machine translation, document filtering, and other applications require a different measure of accuracy. Many new application specific metrics are needed to objectively assess progress made in OCR research. Examples of metrics and needed metrics are described in [RKN94,KRNN93].
Since a variety of factors affects the performance of OCR systems, a large amount of input test data must be used in the evaluation processes. Consider testing recognition of text printed in a variety of fonts. Over 3,000 combinations of typefaces and type styles are available for laser printers. If ten type sizes are used, over 30,000 test samples are required just to examine one instance of output for each input. Thus, automating both the measurement tasks and the analysis of data are essential. Aside from eliminating human error, automated experiments have the following benefits:
There are different ways to prepare test data. Example sets of real-world document images with the associated truth representation are an ideal form of input test data. The truth representation and attributes of the input images must be manually prepared. Our experience shows that, it takes an average of 2 man-hours to prepare basic page-based data from a page, including the almost 100% accurate truth representation. Therefore, such data are extremely expensive.
It is also possible to generate simulated data. It is customary to perturb ideal images or sample hand-written characters by adding noise. Examples of distortion models are given in [Ish83,Bai92,KHP93]. This approach eliminates expensive truth preparation and allows researchers to control individual noise variables.
In spite of the appeal of generating large test databases this way, their value in predicting the behavior of OCR systems in field condition has not been established. The evaluation and comparison of real-world distortion (example sets) and simulated distortion are important new research tasks. Validation methods have been proposed by [Nag94,LLT94].
Currently, most of the available databases are character-based. The
ETL Character Database
(see section 12.6 for contact addresses), mainly contains hand-printed segmented
Japanese characters.
The U.S. Postal Service
released a database containing
hand-written characters extracted from envelope address blocks.
The National Institute of Standards and Technology (NIST)
distributes a large number of hand-written segmented characters and
hand-printed segmented characters.
The University of Washington
has released a database (UW-I) that
contains 1,147 page images from scientific and technical journals
with the corresponding truth representation.
It also includes image degradation models and performance evaluation
tools. The UW-II data set contains 43 complete articles in English and other data.
To objectively measure progress in character recognition technology and to identify research problems, two kinds of evaluation are needed: internal evaluation and independent evaluation. In internal evaluation, researchers' own test data sets or standard (public) test databases are used to measure and compare their progress. The creation and distribution of a variety of standard test databases is an important task in the OCR research community.
Since character recognition systems can be customized or trained to accurately recognize a given set of data, independent evaluation is also required for objective final assessment. In independent evaluation, test databases are hidden from the development process.
In 1991, the Chinese government evaluated Chinese OCR systems developed under the State Plan 863 [CCW91]. Tests were strictly conducted using standardized data sets. The best machine-printed character recognition rates with and without context were 97.84% and 97.80%, respectively. The best hand-written character recognition rate without adapting to a particular user was 80%.
In 1992, the U.S. Census Bureau and NIST determined the state
of the art in recognition of hand-written segmented characters
[WGJ
92]. Twenty-six organizations from North
America and Europe participated in this test program. About half of
the systems correctly recognized over 95% of the digits,
over 90% of the upper-case letters, and over 80% of the
lower-case letters in the tests.
In 1992, the Institute for Posts and Telecommunications Policy in
Japan evaluated OCR technology for recognizing
postal codes [MNY
93]. Hand-written
segmented character images were used to test systems. Five
universities and eight OCR vendors submitted their
systems. The highest recognition rate was 96.22% with the
substitution error rate 0.37%.
Since 1992, the Information Science Research Institute at the University of Nevada, Las Vegas, has been conducting evaluation of OCR technology for recognition of machine-printed documents. In the 1994 study, six pre-release systems developed by commercial OCR vendors were tested using two sets of page images [RKN94]. These systems correctly recognized over 99% of the characters in good quality pages. However, there is a significant reduction in accuracy on poor quality pages. This study also includes other metrics, such as word accuracy, non-stopword accuracy, and automatic page segmentation.
In this rapidly evolving information age, the need for automated data entry systems is essential. To expedite progress in this field, there is a need for large quantities of both test and training data. This situation is likely to continue until the resources needed to provide such data are made available.