next up previous contents index
Next: 12.3 Spoken Language Corpora Up: 12 Language Resources Previous: 12.1 Overview

12.2 Written Language Corpora

Eva Ejerhed & Ken Church
University of Umea, Sweden
AT&T Bell Labs, Murray Hill, New Jersey, USA

12.2.1 Review of the State of the Art in Written Language Corpora

Written Language Corpora, collections of text in electronic form, are being collected for research and commercial applications in natural language processing (NLP). Written Language Corpora have been used to improve spelling correctors, hyphenation routines and grammar checkers, which are being integrated into commercial word-processing packages. Lexicographers have used corpora to study word use and to associate uses with meanings. Statistical methods have been used to find interesting associations among words (collocations). Language teachers are now using on-line corpora in the classroom to help learners distinguish central and typical uses of words from mannered, poetic, and erroneous uses. Terminologists are using corpora to build glossaries to assure consistent and correct translations of difficult terms such as dialog box, which is translated as finestra `window' in Italian and as boite `box' in French. Eurolang is currently integrating glossary tools, translation memories of recurrent expressions, and more traditional machine translation systems into Microsoft's Word-for-Windows and other popular word-processing applications. The general belief is that there is a significant commercial market for multilingual text processing software, especially in a multilingual setting such as the European Community. Researchers in Information Retrieval and Computational Linguistics are using corpora to evaluate the performance of their systems. Numerous examples can be found in the proceedings of recent conferences like the Third Message Understanding Conference [DAR91], and the Speech and Natural Language Workshops sponsored by the Defense Advanced Research Projects Agency (DARPA) [DAR92,ARP93,ARP94].

Written language corpora provide a spectrum of resources for language processing, ranging from the raw material of the corpora themselves to finished components like computational grammars and lexicons. Between these two extremes are intermediate resources like annotated corpora (also called tagged corpora in which words are tagged with part of speech tags and other information), tree banks (in which sentences are analyzed syntactically), part-of-speech taggers, partial parsers of various kinds, lexical materials such as specialized word lists and listings of the constructional properties of verbs.

The corpus-based approach has produced significant improvements in part-of-speech tagging. [FK82] enabled research in the U.S. by tagging the Brown Corpus and making it available to the research community. Similar efforts were underway within the International Computer Archive of Modern English (ICAME) community in the UK and Scandinavia around the same time. A number of researchers developed and tested the statistical n-gram methods that ultimately became the method of choice. These methods used corpora to train parameters and evaluate performance. The results were replicated in a number of different laboratories. Advocates of alternative methods were challenged to match the improvements in performance that had been achieved by n-gram methods. Many did, often by using corpus-based empirical approaches to develop and test their solutions, if not to train the parameters explicitly. More and more data collection efforts were initiated as the community began to appreciate the value of the tagged Brown Corpus.

Of course, corpus analysis is not new. There has been a long empirical tradition within descriptive linguistics. Linguists have been counting words and studying concordances for hundreds of years. There have been corpora, libraries and archives for as long as there has been written language. Text has been stored in electronic form for as long as there have been computers. Many of the analysis techniques are based on Information Theory, which predates computers.

So why so much interest, and why now? The role of computers in society has changed radically in recent years. We used to be embarrassed that we were using a million dollar computer to emulate an ordinary typewriter. Computers were so expensive that applications were supposed to target exclusive and unusual needs. Users were often expected to write their own programs. It was hard to imagine a computer without a compiler. Apple Computer Inc. was one of the first to realize that computers were becoming so cheap that users could no longer afford to customize their own special-purpose applications. Apple took a radical step and began to sell a computer without a compiler or a development environment, abandoning the traditional user-base and targeting the general public by developing user-friendly human-machine interfaces that anyone could use. The emphasis moved to so-called killer applications like word-processing that everyone just had to have. Many PCs now have email, fax and a modem. The emphasis on human-machine interfaces is now giving way to the information super-highway cliche. Computers are rapidly becoming a vehicle for communicating with other people, not very different from a telephone.

``Phones marry computers: new killer applications arrive.''
-- cover of Byte magazine, July 1994

Now that so many people are using computers to communicate with one another, vast quantities of text are becoming available in electronic form, ranging from published documents (e.g., electronic dictionaries, encyclopedias, libraries and archives for information retrieval services), to private databases (e.g., marketing information, legal records, medical histories), to personal email and faxes. Just ten years ago, the one-million word Brown Corpus [FK82] was considered large. Today, many laboratories have hundreds of millions or even billions of words. These collections, are becoming widely available, thanks to data collection efforts such as the following:

Association for Computational Linguistics' Data Collection Initiative (ACL/DCI), the Linguistic Data Consortium (LDC) (see section gif for contact addresses), the Consortium for Lexical Research (CLR), the Japanese Electronic Dictionary Research (EDR), the European Corpus Initiative (ECI), International Computer Archive of Modern English (ICAME), the British National Corpus (BNC), the French corpus Frantext of Institut National de la Langue Francaise (INaLF-CNRS), the German Institut für deutsche Sprache (IDS), the Dutch Instituut voor Nederlandse Lexicologie (INL), the Danish Dansk Korpus (DK), the Italian Istituto di Linguistica Computazionale (ILC-CNR), the Spanish Reference Corpus Project of Sociedad Estatal del V Centenario, Norwegian corpora of Norsk Tekstarkiv, the Swedish Stockholm-Umea Corpus (SUC) and corpora at Sprakdata, and Finnish corpora of the University of Helsinki Language Corpus Server. This list does not claim to be an exhaustive listing of data collections or data collection efforts, but an illustration of their breadth. Data collections exist for many languages in addition to these, and new data collection efforts are being initiated. There are also standardization efforts for the encoding and exchange of corpora such as the Text Encoding Initiative (TEI).

12.2.2 Identification of Significant Gaps in Knowledge and/or Limitations of Current Technology

The renaissance of interest in corpus-based statistical methods has rekindled old controversies---rationalist vs. empiricist philosophies, theory-driven vs. data-driven methodologies, symbolic vs. statistical techniques. The field will ultimately adopt an inclusive strategy that combines the strengths of as many of these positions as possible.

In the long term, the field is expected to produce significant scientific insights into language. These insights would hopefully be accompanied by corresponding accomplishments in language engineering: better parsers, information retrieval and extraction engines, word processing interfaces with robust grammar/style checking, etc. Parsing technology is currently too fragile, especially on unrestricted text. Text extraction systems ought to determine who did what to whom, but it can be difficult to simply extract names, dates, places, etc. Most information retrieval systems still treat text as merely a bag of words with little or no linguistic structure. There have been numerous attempts to make use of richer linguistic structures such as phrases, predicate argument relations, and even morphology, but, thus far, most of these attempts have not resulted in significant improvements in retrieval performance.

Current natural language processing systems lack lexical and grammatical resources with sufficient coverage for unrestricted text. Consider the following famous pair of utterances:

Time flies like an arrow.
Fruit flies like a banana.

It would be useful for many applications to know that fruit flies is a phrase and time flies is not. Most systems currently do not have access to this kind of information. Parsers currently operate at the level of parts of speech, without looking at the words. Ultimately, parsers and other natural language applications will have to make greater use of collocational constraints and other constraints on words. The grammar/lexicon will have to be very large, at least as large as an 1800-page book [QGLS85]. The task may require a monumental effort like Murray's Oxford English Dictionary project.

Corpus-based methods may help speed up the lexical acquisition process by refining huge masses of corpus evidence into more manageable piles of high-grade ore. In Groliers encyclopedia [Gro91], for example, there are 21 instances of fruit fly and fruit flies, and not one instance of time fly and time flies. This kind of evidence is suggestive of the desired distinction, though far from conclusive.

12.2.3 Future Directions

The long-term research challenge is to derive lexicons and grammars for broad coverage natural language processing applications from corpus evidence.

A problem with attaining this long-term goal is that it is unclear whether the community of researchers can agree that a particular design of lexicons and grammars is appropriate, and that a large scale effort to implement that design will converge on results of fairly general utility [Lib92].

In the short-term, progress can be achieved by improving the infrastructure, i.e. the stock of intermediate resources mentioned in section 12.1. Data collection and dissemination efforts have been extremely successful. Efforts should now be focused on principles, procedures and tools for analyzing these data. There is a need for manual, semi-automatic and automatic methods that help produce linguistically motivated analyses that make it possible to derive further facts and generalizations that are useful in improving the performance of language processors.

While there is wide agreement in the research community on these general points, there seems to be no shared vision of what exactly to do with text corpora, once you have them. A way to procede in the short and intermediate term is for data collection efforts to achieve a consensus within the the research community by identifying a set of fruitful problems for research (e.g., word sense disambiguation, anaphoric reference, predicate argument structure) and collecting, analyzing and distributing relevant data in a timely and cost-effective manner. Funding agencies can contribute to the consensus building effort by encourging work on common tasks and sharing of common data and common components.



next up previous contents
Next: 12.3 Spoken Language Corpora Up: 12 Language Resources Previous: 12.1 Overview