Department of Computer Science
Vassar College
Our goal is to develop a sound basis and methodology for corpus representation as well as for the design of corpus-handling tools. There is an obvious dependency between the two, which demands that they are developed hand-in-hand. The task involves: (1) analysis of the needs of corpus-based NLP research, both in terms of the kinds and degree of annotation required and the requirements for efficient processing, accessibility, etc.; (2) analysis of general properties and configuration of corpora, analysis of relevant structural and logical features of component text types, and the design of encoding mechanisms that can represent all required elements and features while accomodating the requirements determined in (1); and (3) specifications for text software design, coordinated with (2), with the aim of avoiding redundancy and maximizing modifiability, extendability, and reusability.
The project has developed a Corpus Encoding Standard which provides SGML encoding conventions especially suited to encoding linguistic corpora and annotation documents associated with them (in particular part of speech encoding and alignment of parallel translations), as well as a data architecture for SGML documents containing this data. We have also developed a suite of text handling tools, for extraction and manipulation of SGML documents, segmentation, lexical annotation, and some for handling speech data. During the past year, we have tested our tools and standards by creating, encoding, and annotating lexicons and corpora in six Eastern European languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene) and English, the centerpiece of which consists of seven parallel translations of Orwell's Nineteen Eighty-Four, all tagged for part of speech and sentence-aligned. Description of this corpus can be found at http://nl.ijs.si/ME/.
Barnard, D. and Ide, N. (in press) The Text Encoding Initiative: Flexible and Extensible Document Encoding. Journal of the American Society for Information Science.
Erjavec, T., Ide, N., Petkevic, V., Véronis, J. (1996) Multext-East: Multilingual Text, Tools and Corpora for Central and Eastern European Languages. Corpora Proceedings of the First TELRI European Seminar, 87-98.
Erjavec, T., Ide, N., Tufis, D. (1997) Encoding and Parallel alignment of linguistic corpora in six Central and Eastern European Languages. Proceedings of the Joint International ACH/ALLC Conference, Kingston, Ontario, 41-43.
Ide, N. et al. Corpus Encoding Standard. Internal Project Technical Report, available from Laboratoire Parole et Langage, Université de Provence, Aix-en-Provence, France, 100p. Also available on the World Wide Web: <URL:http://www.cs.vassar.edu/CES/>
Ide, N. Encoding standards for large text resources. Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan (1994), 574-78.
Ide, N. (1996) Encoding Standards for Linguistic Corpora Proceedings of the First TELRI European Seminar, 65-78.
Ide, N. (1996). Representation schemes for language data. Proceedings of Langues Situées Technologies, et Communication, Rabat, Morocco, April 1996.
Ide, N. (1996) A Standard for Encoding Linguistic Corpora. Proceedings of the Joint International ALLC/ACH Conference, 152-55.
Ide, N., Klavans, J. (1996) The Text Encoding Initiative Guidelines and Their Application to Building Digitial Libraries. Proceedings of the First Association for Computing Machinery Conference on Digital Libraries, 182-84.
Ide. N. (1996) EAGLES Final Report: Text Representation Working Group. Internal Project Techncial Report, available from Istituo di Linguistica, CNR, Pisa, 100pp. Also available on the World Wide Web: http://www.ilc.pi.cnr.it/EAGLES/.
Ide, N. (1996) F1.1. Specifications for tools and data. Internal Project Technical Report, available from Laboratoire Parole et Langage, Université de Provence, Aix-en-Provence, France, 17pp.
Ide, N. (1996) F1.2. Testing, refinement, and final integration of tools and data in the MULTEXT Project. Internal Project Technical Report, available from Laboratoire Parole et Langage, Université de Provence, Aix-en-Provence, France, 20pp.
Ide, N. (ed.) 1996. Multext-East Language-specific Resources.
Deliverable D1.2. Multext-East Project COP 106.
Ide, N., Véronis, J. (1995) The Text Encoding Initiative:
Background and Context. Dordrecht: Kluwer Academic Publishers.
Ide, N., Véronis, J. (1996). Une application de la TEI aux
industries de la langue: le Corpus Encoding Standard. Cahiers
GUTenberg no 24, 166--169.
Ide, N., Véronis, J. (1996). Codage TEI des dictionnaires
électroniques. Cahiers GUTenberg no 24, 170--176.
Ide, N., Véronis, J.
MULTEXT: Multilingual Text Tools and Corpora (1994).
Proceedings of the 15th International Conference on Computational
Linguistics, COLING'94, Kyoto, Japan, 588-92.
More generally, large-scale text databases are being developed for use
in other disciplines, such as the humanities. These texts require much
richer encoding than many texts currently available via, for example,
the World Wide Web, if they are to be usable for intelligent retrieval
and scholarly research. It is especially important to allow for
multiple views of such data, since they may be accessed from a
variety of perspectives: as a logically structued document, as a
linguistic object, as a rehtorical object, as a database of
information, etc.
It is imperative that standardized, flexible markup systems be devised
which can provide for encoding the full range of information which is
of potential interest in these texts. This in turn will enable the
development of generally-usable software to access and mainpulate
these texts.
Coombs, J.H., Renear, A.H., and DeRose, S.J. (1987).Markup systems
and the
future of scholarly text processing. Communications of the ACM,
30, 11, 933-
947.
Cunningham, H., Humphreys, K. Gaizauskas, H., Wilks, Y. (1997)
Software Infrastructure for Natural Language Processing.
Proceedings of the Fifth Conference on Applied Natural Language
Processing, Washington, D.C.
Ide, N., Véronis, J. (Eds.) (1995a).
The Text Encoding Initiative:
Background and Context. Kluwer
Academic
Publishers, Dordrecht,
342p. [reprinted from triple special issue of Computers and the
Humanities, 29, no 1/2/3, with an original bibliography]
McKelvie, D., Brew, C., Thompson, H. (1997) Using SGML as a Basis for
Data-Intensive NLP. Proceedings of the Fifth Conference on Applied
Natural Language Processing, Washington, D.C.
Sperberg-McQueen, C.M., Burnard, L. (Eds.) (1994).
Guidelines
for Electronic Text Encoding and Interchange, Text Encoding
Initiative,
Chicago and Oxford.
<URL:http://etext.virginia.edu/TEI.html>
Design of user interfaces for access and manipulation of large and
complex text databases.
AREA BACKGROUND
The increasing interest in the use of large-scale textual resources
for natural language processing
research has led to the rapid proliferation of both massive amounts of
textual
data and text-handling tools. Much of the currently available data is
marked and
annotated using ad hoc formats, most of which are entirely
inconsistent
with one another, and almost none of which has been developed on the
basis of
a sound model of text and text categories or in view of any serious
consideration
of the needs of corpus-based NLP research. Similarly, and for related
reasons,
there is an enormous redundancy in the functionality of much existing
corpus-handling software (part-of-speech taggers, statistics-gathering
programs,
etc.), due to the fact that the same systems need to be re-invented
over and
over again to accomodate specific inputand output formats and
platforms.
Because such software is typically instantiated in large, unbreakable
systems,
the ability to modify it and re-use relevant pieces in other
applications is
severely limited. Again,the lack of a principled basis for text
software design
is the cause of this redundancy and limited reusability.
AREA REFERENCES
RELATED PROGRAM AREAS
POTENTIAL RELATED PROJECTS