Document analysis or more precisely, document image analysis, is the process that performs the overall interpretation of document images. This process is the answer to the question, ``How is everything that is known about language, document formatting, image processing and character recognition combined in order to deal with a particular application?'' Thus document analysis is concerned with the global issues involved in recognition of written language in images. It adds to OCR a superstructure that establishes the organization of the document and applies outside knowledge in interpreting it.
The process of determining document structure may be viewed as guided by a model, explicit or implicit, of the class of documents of interest. The model describes the physical appearance and the relationships between the entities that make up the document. OCR is often at the final level of this process, i.e., it provides a final encoding of the symbols contained in a logical entity such as paragraph or table, once the latter has been isolated by other stages. However, it is important to realize that OCR can also participate in determining document layout. For example, as part of the process of extracting a newspaper article the system may have to recognize the character string, continued on page 5, at the bottom of a page image, in order to locate the entire text.
In practice then, a document analysis system performs the basic tasks of image segmentation, layout understanding, symbol recognition and application of contextual rules in an integrated manner [WCW82,NSS85]. Current work in this area can be summarized under four main classes of applications.
The ultimate goal for text systems can be termed inverse formatting or completion of the Gutenberg loop,
meaning that a scanned printed document is translated back into a
document description language from which it could be accurately
reprinted if desired. At the research level this has been pursued in
domains such as technical papers, business letters
and chemical structure diagrams
[TA92,S
92,NSS85].
Some commercial OCR systems provide limited inverse
formatting, producing codes for elementary structures such as
paragraphs, columns, and tables
[Bok92].
Current OCRs will detect, but not encode, halftones
and line drawings.
In certain applications less than total interpretation of the document is required. A system for indexing and retrieving text documents may perform only a partial recognition. For example, a commercially available retrieval system for technical articles contains a model of various journal styles, assisting it to locate and recognize the title, author, and abstract of each article, and to extract keywords. Users conduct searches using the encoded material, but retrieve the scanned image of desired articles for reading.
Forms are the printed counterparts of relations in a data base. A
typical form consists of an n-tuple of data items each of which can be
represented as an ordered pair (item name, item value). OCR is
used to recognize the item value; more general document analysis
operations may be needed in order to identify the item name
[C
92].
The capability for locating items on a form, establishing their name class, and encoding the accompanying data values has many applications in business and government. Form documents within a single enterprise and single application are highly repetitive in structure from one example to the next. In such a case the model for the document can consist largely of physical parameters whose values are estimated from sample documents. Such systems for gathering form data are commercially available. The Internal Revenue Service of the U.S. has recently granted a large contract to automate processing of scanned income tax forms. This will require extraction of data from a large variety of forms, as well as adaptation to perturbations of a single form resulting from different printing systems.
These applications are characterized by a well-defined logical format (but a highly variable physical layout), and a high degree of contextual constraint on the symbolic data [Sri92]. The latter is potentially very useful in the attainment of high accuracy. Contextual rules can modify OCR results to force agreement of city names and postal codes, for example, or to reconcile numeric dollar amounts on checks with the written entry in the legal amount field. Contextual constraints can also assist in the detection of misrecognized documents, so that these can be handled by manual or other processes. While mailpieces and checks are actually a subclass of form documents, the large amount of effort invested in these problems justifies listing them separately.
Current equipment in use for these applications make limited use of contextual information, and is limited to reading postal codes in the case of handwritten addresses, or numeric amounts for checks. Postal machines now in development will read the complete address field and obtain greater accuracy by applying contextual constraints. At the same time they will provide a higher granularity in the sorting of mail. In the U.S., for example, new machines are planned to arrange mailpieces into delivery order for the route of individual postmen.
Much of the activity in this area centers on entry of engineering drawings to Computer-Assisted Design / Computer-Assisted Manufacture (CADCAM) systems [KSO90,VT92]. A project for input of integrated circuit diagrams has reported cost-efficient conversion of drawings compared with conventional manual input. This project offers evidence that new circuits can most efficiently be created on paper and then encoded by recognition processes. The claim is that this is better than direct input at a terminal, due to the small screen sizes on present-day equipment. A commercial version of such a system is available. Other research in progress aims at obtaining 3-D models for multiple views in drawings of manufactured parts. Research progress has also been reported in conversion of land-use maps.
One source of motivation for work in document analysis has been the great increase in image systems for business and government. These systems provide fast storage, recall and distribution of documents in workflow processing and other applications. Document analysis can help with the indexing for storage and recall, and can partition the image into subregions of interest for convenient access by users.
In the near future such capabilities will be extended to the creation
of electronic libraries which will likewise benefit from
automatic indexing and formatting services. In the longer
range, efforts will increase to interpret more of the information
represented in the stored images, in order to provide more flexible
retrieval and manipulation facilities
[D
92].
How will document analysis capabilities have to improve to meet future needs? There is a strong need to incorporate context, particularly language context, into the models that govern document analysis systems. Over 35 years of research and development have still not been able to produce OCR based on shape that has the accuracy of human vision. Contextual knowledge must be invoked in order both to minimize errors and to reject documents that can not be interpreted automatically. An important research issue here is how to define such constraints in a generic way, such that they can easily be redefined for different applications. Beyond this, how are such rules to be converted to software that integrates with recognition processes in order to optimize performance?
Linguistic analysis may not simply be a postprocessing stage in future document analysis systems. Modern recognition processes often perform trial segmentation of character images and choose the best segmentation from a set of alternatives using recognition confidence as a guide. Such an operation might be performed most reliably if it were implemented as a sequential process, with contextual rules governing the choice of the sequence.
In order to facilitate future progress in document analysis, there is a need for a number of scanned document data bases, each representative of a different class of documents: text, engineering drawings, addresses, forms, handwritten manuscripts, etc. Currently such collections are limited to text-oriented documents. With access to common research material, different researchers will be able to compare results and gain greater benefit from each other's efforts.