Nowadays, there is much motivation to provide computerized document analysis systems . Giant steps have been made in the last decade, both in terms of technological supports and in software products. Character recognition (OCR) contributes to this progress by providing techniques to convert large volumes of data automatically. There are so many papers and patents advertising recognition rates as high as 99.99%; this gives the impression that automation problems seem to have been solved. However, the failure of some real applications show that performance problems subsist on composite and degraded documents (i.e., noisy characters , tilt , mixing of fonts , etc.) and that there is still room for progress. Various methods have been proposed to increase the accuracy of optical character recognizers. In fact, at various research laboratories, the challenge is to develop robust methods that remove as much as possible the typographical and noise restrictions while maintaining rates similar to those provided by limited-font commercial machines.
There is a parallel analogy between the various stages of evolution of OCR systems and those of pattern recognition . To overcome the recognition deficiency, the classical approach focusing on isolated characters has been replaced with more contextual techniques. The opening of OCR domain to document recognition leads to combination of many strategies such as document layout handling , dictionary checking , font identification , word recognition , integration of several recognition approaches with consensual voting , etc.
The rest of this section is devoted to a summary of the state of the art in the domain of printed OCR (similar to the presentations in [IOO91,GS90,Nad84,Man86]), by focussing attention essentially on the new orientations of OCR in the document recognition area.
Characters are arranged in document lines following some typesetting conventions which we can use to locate characters and find their style. Typesetting rules can help in distinguishing such characters as s from 5, h from n, and g from 9, which can be often confused in multifont context [KPB87]. They can also limit the search area according to characters' relative positions and heights with respect to the baseline [LG91a,LG91b,Kan90]. The role of typesetting cues to aid document understanding is discussed by [HIT91].
Location of characters in a document is always preceded by a layout analysis of the document image. The layout analysis involves several operations such as determining the skew , separating picture from text, and partitioning the text into columns, lines, words, and connected components. The portioning of text is effected through a process known as segmentation. A survey of segmentation techniques is given in [Nad84].
In building character images, one is often confronted with touching or
broken characters that occur in degraded documents (such as
fax , photocopy , etc.). It is still challenging to
develop techniques for properly segmentating words into their
characters.
[KPB87]
detected touching characters by evaluation of vertical pixel
projection. They executed a branch-and-bound search of alternative
splittings and merges of symbols pruned by word-confidence scores
derived from symbol confidence.
[TA91]
used a decision tree for resolving ambiguities.
[CN82]
proposed a recursive segmentation algorithm.
[LAS93]
added to this algorithm contextual information and a spelling checker to
correct errors caused by incorrect segmentation.
[Bay87]
proposed a hypothesis approach for merging and splitting
characters. The hypotheses are tested by several experts to see
whether they represent a valid character. The search is controlled by
the A
algorithm resolving backtracking
processing. The experts comprise the character classifier and a set
of algorithms for context processing.
A document reader must cope with many sources of variations, notably that of font and size of the text. In commercial devices, the multifont aspect was for a long time neglected for the benefit of speed and accuracy, and substitution solutions were proposed. At first, to cater for some institutions, the solution was to work on customized fonts (such as OCR-A and OCR-B) or on a selected font from a trained library to minimize the confusion between similar looking characters. The accuracy was quite good, even on degraded images on the condition that the font is carefully selected. However, recognition scores drop rapidly when fonts or sizes are changed. This is due to the fact that the limitation to one font naturally promotes the use of simple and sensitive pattern recognition algorithms, such as template matching [DH73].
In parallel with commercial investigations, the literature proposed multifont recognition systems that are based on typographical features . Font information is inherent in the constituent characters [Rub88] and feature-based methods are less font sensitive [Sri84,Ull73,KPB87]. Two research paths were taken with multifont machines. One gears towards the office environment. This introduced systems which can be trained by the user to read any given font [Sch78,Shl88,BA91,AB91a,AB91b]. The system is only able to recognize a font from among those learned. The others try to be font independent. The training is based on pattern differentiation rather than on font differentiation [LB87,BKP86,BF91].
This step is crucial in the context of document analysis where several
variations may be caused by a number of different sources:
geometric transformation because of low data quality,
slant and stroke width variation
because of font changing, etc. It seems reasonable to look for
features which are invariant and which capture the characteristics of
the character by filtering out all attributes which make the same
character assume different appearances. The classifier could store a
single prototype per character.
[SBB
92]
applies normalizing transformations to reduce certain
well-defined variations as far as possible. The inevitably remaining
variations are left for learning by statistical adaptation of the
classifier.
The keys of printed character learning are essentially training set and classification adaptation to new characters and new fonts. The training set can be given either by user or extracted directly from document samples. In the first case, the user selects the fonts and the samples to represent each character in each font and then guides the system to create models as in [AB91b]. Here, the user must use sufficient number of samples in each font according to the difficulty of its recognition. However, it is difficult in an omnifont context to collect a training set of characters having the expected distribution of noise and pitch size. [Bai90] suggested parameterized models for imaging defects, based on a variety of theoretical arguments and empirical evidence. In the second case, the idea is to generate the training set directly from document images chosen from a wide variety of fonts and image quality and to reflect the variability expected by the system [Bok92]. The problem here is that one is not sure that all valid characters are present.
Contextual processing attempts to overcome the shortcoming of decisions made on the basis of local properties and to extend the perception on relationships between characters into word. Most of the techniques try to combine geometric information, as well as linguistic information. See [SH85] for an overview of these techniques. [AB91a,AB91b,BA91] used hidden Markov models for character and word modeling. Characters are merged into groups which are matched against words in a dictionary using Ratcliff/Obershelp pattern matching method . In the situation where no acceptable words are found, the list of confused characters is passed through a Viterbi net and the output is taken as the most likely word. The bigram and character position-dependent probabilities used for this purpose were constructed from a French dictionary of some 190,000 words. The word-level recognition stands at over 98%.
Commercial OCR machines came in practically at the beginning of 1950s and have evolved in parallel with research investigations. The first series of products heavily relied on customized fonts, good printing quality and very restricted document layout. Nowadays, we can find a vast range of products, more powerful than the previous ones. Among these are certain hand-held scanners, page readers, and integrated flat-bed and document readers. The tendency is to use the fax machine as an image sensor. Instead of printing the fax message on paper, it is taken directly as input to an OCR system. It is to be noted that the obtained images are of a poor quality. The challenge in this area is the development of high performing tools to treat degraded text with results as good as those of classical OCRs.
OCR is used in three main domains: the banking environment for data entry and checking, office automation for text entry , and the post office for mail sorting . We can find many surveys on commercial products in [MSY92,Man86,Bok92,Nag92]. Recently, the Information Science Research Institute had the charge to test technologies for OCR from machine printed documents. A complete review has been published [NRK94] giving a benchmark of different products in use in the U.S. market.
We have attempted to show that OCR is an essential part of the document analysis domain. Character recognition cannot be achieved without typesetting cues to help the segmentation in a multifont environment . We have also shown the unavoidable recourse to linguistic context ; the analysis must be extended to this domain. The training still remains the weak side of OCR for now, as it is difficult to generate a training set of characters which includes all the variability the system will be expected to handle. Finally, it would appear more and more that in real-world OCR many different techniques must be combined to yield high recognition scores [AB91b,Ho92]. For this reason, the tendency is to combine the results of many OCR systems in order to obtain the best possible performance.