Department of Computer Science
(*)Columbia University
(**)University of Rochester
(***)University of Pittsburgh
CARD will consist of three components. The first will be a Discourse Annotation Language (DAL) to encode information pertaining to language use directly within discourse corpora. DAL will be a theory neutral encoding scheme defined to capture properties of discourse independent of modality (e.g., spoken vs. written), number of participants, domain, genre and so on. That is, procedures for identifying a given DAL element, such as an utterance, might vary with respect to spoken versus written discourse, but the symbolic function of such units will be constant. The vocabulary and syntax of DAL will be documented, along with procedural rules for annotating different types of corpora. From a corpus annotated with DAL, researchers will be able to determine the relative frequency of phenomena of interest, and investigate their interdependence. Some DAL features will be automatically annotated, and others will require human coders. In either case, reliability measures of the degree of variability in DAL annotations will be provided as the second CARD component, thus insuring the integrity of an annotated corpus. The third, and perhaps most immediately useful CARD item will be a library of DAL-annotated corpora. The selected corpora will vary in modality, number of participants, domain, and communicative task.
The collaborators on the proposed work have independently collected or begun collection of six corpora from which to begin constructing the CARD corpus library. The set includes three spoken language corpora: a 14,000 word corpus of narrative monologues; a 33,000 word corpus of career-counseling interviews; a 62,000 word corpus of human problem solving dialogues. Three corpora currently being collected include spoken and written human/computer dialogues, and mixed written/spoken dialogues between humans. One entirely new corpus will be collected, potentially from a Multi-User Dialogue (MUD) application. The existing corpora have been partially annotated, but may need to be re-annotated to achieve consistency with the new DAL definitions. DAL will be a modular language with five layers of linguistic representation: morpho-syntactic, prosodic, anaphoric, lexical, and segmental. Modularity will allow investigators to select specific features of interest, and analyze alternative classifications of the same data. DAL will be implemented in SGML (Standard Generalized Markup Language), following the Text Encoding Initiative (TEI) Guidelines. Adherence to these guidelines should simplify machine processing, encourage document sharing, and facilitate common authoring and editing utilities. The project will end with design and tests of an effective distribution mechanism for CARD.
Over the last year, the Rochester team has led the development of DAMSL, a "standard" annotation scheme for capturing the different levels of action and interaction in dialog. The initial scheme arose from a meeting of the Discourse Research Initiative at the University of Pennsylvania in 1996. We subsequently developed a coding manual and a GUI tool for annotating dialogues which was used by over twenty five different researchers world-wide as the "homework" for a second meeting in Daghstul, Germany in Feb 1997. A revised scheme was developed there, which formed the basis for a new manual and revised tool developed at Rochester.
We are currently engaged in a concerted effort to annotate a good part of the TRAINS-93 human-human dialog as a test of the scheme, and will be performing some reliability experiments.
James Allen and Mark Core. 1996. Draft of DAMSL: Dialog Act Markup in Several Layers. University of Rochester, Department of Computer Science. December 10.
Hongyan Jing, Vasileios Hatzivassiloglou, Rebecca Passonneau, and Kathleen McKeown. 1997. Investigating Complementary Methods for Verb Sense Pruning. In Proceedings of ANLP97 workshop "Tagging Text with Lexical Semantics: Why, What, and How?, April 4-5, Washington, D.C.
Rebecca J. Passonneau and Diane Litman. 1997. Discourse Segmentation by Human and Automated Means. Computational Linguistics, 23(1): 103-140, March. Special Issue on Empirical Studies in Discourse Interpretation and Generation.
Rebecca Passonneau. 1996. Instructions for Applying Discourse Reference Annotation for Multiple Applications (DRAMA). Draft. Columbia University, Department of Computer Science. December 13.
Rebecca Passonneau. 1996. Interacting Constraints on Reference in Spoken Language. Presented at the International Symposium on Spoken Dialogue, October 1-3, Philadelphia, PA.
Rebecca Passonneau. 1997. Applying Reliability Metrics to Co-Reference Annotation. Columbia University, Technical Report, CUCS-017-97.
Experience has shown real benefits for the human language community when corpus resources are annotated or shared. Examples include performance evaluation of systems, such as the Message Understanding Conference evaluations of systems that train and test on the same materials. Individual systems modules can also be profitably trained and tested on the same annotated data, e.g., for syntactic parsing, or for reference resolution. Often, the creation of shareable tagged data requires the research community to agree upon common methods of annotation, as in the University of Pennsylvania's hand annotated Treebank corpus of parsed Wall Street Journal text, the MUC standards for tagging Co-reference relations, or the development of the ToBI (TOnes and Break Indices) labeling system for prosody. However, development of a common annotation language for discourse corpora is problematic in part due to a wide variety of technology goals, and in part due to variation across types of corpora. In our work, we are developing and implement a Discourse Annotation Language (DAL) that addresses these problems, and we will use it to create a heterogeneous library of annotated corpora.
Mitch Marcus, B. Santorini, and M. Marcinkiewicz. 1993. Building a large annotated corpus of English: The PennTreebank. Computational Linguistics , 19:313-330.
Nancy A. Chinchor and Beth Sundheim. 1995. Message Understanding Conference (MUC) Test of Discourse Processing. In AAAI Spring Symposium: Empirical Methods in Discourse Interpretation and Generation, pages 21-27.
K. Silverman, M. Beckman, M. Pitrelli, J. Ostendorf, M. Wightman, C. Price, J. Pierrehumbert, and J. Hirschberg. 1992. ToBI: A Standard for Labeling English Prosody. In Proceedings of the International Conference on Spoken Language Processing, pages 867-870.