Postscript Version

Representation and Manipulation of Linguistic Corpora

Nancy Ide

Department of Computer Science
Vassar College

CONTACT INFORMATION

Department of Computer Science
Vassar College
Poughkeepsie, New York 12604-0520 USA
Phone : (+1) 914 437 5988
Fax : (+1) 914 437 7498
Email : ide@cs.vassar.edu

WWW PAGE

http://www.cs.vassar.edu/~ide/research/

PROGRAM AREA

KEYWORDS

Linguistic corpora, text encoding, SGML, data architectures for corpora, tools for corpus annotation and corpus handling

PROJECT SUMMARY

This project is intended to provide a theoretical background and develop coherent methodologies for the representation, access, and manipulation of corpora intended for use in corpus-based natural language processing (NLP) research. The project builds on and continues a program of collaborative research, established in 1988, between Vassar College's Department of Computer Science and the Laboratoire Parole et Langage (LP&L) of the The Centre National de la Recherche Scientifique (CNRS) in Aix-en-Provence, France. The collaborative project is supported by a grant from the National Science Foundation (NSF RUI grant IRI-9413451). The project has also received support under the European Commission projects MULTEXT, MULTEXT-EAST, and EAGLES (in particular, the EAGLES Text Representation subgroup).

Our goal is to develop a sound basis and methodology for corpus representation as well as for the design of corpus-handling tools. There is an obvious dependency between the two, which demands that they are developed hand-in-hand. The task involves: (1) analysis of the needs of corpus-based NLP research, both in terms of the kinds and degree of annotation required and the requirements for efficient processing, accessibility, etc.; (2) analysis of general properties and configuration of corpora, analysis of relevant structural and logical features of component text types, and the design of encoding mechanisms that can represent all required elements and features while accomodating the requirements determined in (1); and (3) specifications for text software design, coordinated with (2), with the aim of avoiding redundancy and maximizing modifiability, extendability, and reusability.

The project has developed a Corpus Encoding Standard which provides SGML encoding conventions especially suited to encoding linguistic corpora and annotation documents associated with them (in particular part of speech encoding and alignment of parallel translations), as well as a data architecture for SGML documents containing this data. We have also developed a suite of text handling tools, for extraction and manipulation of SGML documents, segmentation, lexical annotation, and some for handling speech data. During the past year, we have tested our tools and standards by creating, encoding, and annotating lexicons and corpora in six Eastern European languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene) and English, the centerpiece of which consists of seven parallel translations of Orwell's Nineteen Eighty-Four, all tagged for part of speech and sentence-aligned. Description of this corpus can be found at http://nl.ijs.si/ME/.

PROJECT REFERENCES

Barnard, D. and Ide, N. (in press) The Text Encoding Initiative: Flexible and Extensible Document Encoding. Journal of the American Society for Information Science.

Erjavec, T., Ide, N., Petkevic, V., Véronis, J. (1996) Multext-East: Multilingual Text, Tools and Corpora for Central and Eastern European Languages. Corpora Proceedings of the First TELRI European Seminar, 87-98.

Erjavec, T., Ide, N., Tufis, D. (1997) Encoding and Parallel alignment of linguistic corpora in six Central and Eastern European Languages. Proceedings of the Joint International ACH/ALLC Conference, Kingston, Ontario, 41-43.

Ide, N. et al. Corpus Encoding Standard. Internal Project Technical Report, available from Laboratoire Parole et Langage, Université de Provence, Aix-en-Provence, France, 100p. Also available on the World Wide Web: <URL:http://www.cs.vassar.edu/CES/>

Ide, N. Encoding standards for large text resources. Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan (1994), 574-78.

Ide, N. (1996) Encoding Standards for Linguistic Corpora Proceedings of the First TELRI European Seminar, 65-78.

Ide, N. (1996). Representation schemes for language data. Proceedings of Langues Situées Technologies, et Communication, Rabat, Morocco, April 1996.

Ide, N. (1996) A Standard for Encoding Linguistic Corpora. Proceedings of the Joint International ALLC/ACH Conference, 152-55.

Ide, N., Klavans, J. (1996) The Text Encoding Initiative Guidelines and Their Application to Building Digitial Libraries. Proceedings of the First Association for Computing Machinery Conference on Digital Libraries, 182-84.

Ide. N. (1996) EAGLES Final Report: Text Representation Working Group. Internal Project Techncial Report, available from Istituo di Linguistica, CNR, Pisa, 100pp. Also available on the World Wide Web: http://www.ilc.pi.cnr.it/EAGLES/.

Ide, N. (1996) F1.1. Specifications for tools and data. Internal Project Technical Report, available from Laboratoire Parole et Langage, Université de Provence, Aix-en-Provence, France, 17pp.

Ide, N. (1996) F1.2. Testing, refinement, and final integration of tools and data in the MULTEXT Project. Internal Project Technical Report, available from Laboratoire Parole et Langage, Université de Provence, Aix-en-Provence, France, 20pp.

Ide, N. (ed.) 1996. Multext-East Language-specific Resources. Deliverable D1.2. Multext-East Project COP 106.

Ide, N., Véronis, J. (1995) The Text Encoding Initiative: Background and Context. Dordrecht: Kluwer Academic Publishers.

Ide, N., Véronis, J. (1996). Une application de la TEI aux industries de la langue: le Corpus Encoding Standard. Cahiers GUTenberg no 24, 166--169.

Ide, N., Véronis, J. (1996). Codage TEI des dictionnaires électroniques. Cahiers GUTenberg no 24, 170--176.

Ide, N., Véronis, J. MULTEXT: Multilingual Text Tools and Corpora (1994). Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan, 588-92.

AREA BACKGROUND

The increasing interest in the use of large-scale textual resources for natural language processing research has led to the rapid proliferation of both massive amounts of textual data and text-handling tools. Much of the currently available data is marked and annotated using ad hoc formats, most of which are entirely inconsistent with one another, and almost none of which has been developed on the basis of a sound model of text and text categories or in view of any serious consideration of the needs of corpus-based NLP research. Similarly, and for related reasons, there is an enormous redundancy in the functionality of much existing corpus-handling software (part-of-speech taggers, statistics-gathering programs, etc.), due to the fact that the same systems need to be re-invented over and over again to accomodate specific inputand output formats and platforms. Because such software is typically instantiated in large, unbreakable systems, the ability to modify it and re-use relevant pieces in other applications is severely limited. Again,the lack of a principled basis for text software design is the cause of this redundancy and limited reusability.

More generally, large-scale text databases are being developed for use in other disciplines, such as the humanities. These texts require much richer encoding than many texts currently available via, for example, the World Wide Web, if they are to be usable for intelligent retrieval and scholarly research. It is especially important to allow for multiple views of such data, since they may be accessed from a variety of perspectives: as a logically structued document, as a linguistic object, as a rehtorical object, as a database of information, etc.

It is imperative that standardized, flexible markup systems be devised which can provide for encoding the full range of information which is of potential interest in these texts. This in turn will enable the development of generally-usable software to access and mainpulate these texts.

AREA REFERENCES

Coombs, J.H., Renear, A.H., and DeRose, S.J. (1987).Markup systems and the future of scholarly text processing. Communications of the ACM, 30, 11, 933- 947.

Cunningham, H., Humphreys, K. Gaizauskas, H., Wilks, Y. (1997) Software Infrastructure for Natural Language Processing. Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C.

Ide, N., Véronis, J. (Eds.) (1995a). The Text Encoding Initiative: Background and Context. Kluwer Academic Publishers, Dordrecht, 342p. [reprinted from triple special issue of Computers and the Humanities, 29, no 1/2/3, with an original bibliography]

McKelvie, D., Brew, C., Thompson, H. (1997) Using SGML as a Basis for Data-Intensive NLP. Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C.

Sperberg-McQueen, C.M., Burnard, L. (Eds.) (1994). Guidelines for Electronic Text Encoding and Interchange, Text Encoding Initiative, Chicago and Oxford. <URL:http://etext.virginia.edu/TEI.html>

RELATED PROGRAM AREAS

POTENTIAL RELATED PROJECTS

Design of user interfaces for access and manipulation of large and complex text databases.