Postscript Version

Large-Scale Interlingual Machine Translation (NYI) and Development of a Framework for Large-Scale Translation, Tutoring, and Information Filtering (PFF/PECASE)

Bonnie J. Dorr

University of Maryland
Department of Computer Science
and Institute for Advanced Computer Studies
A.V. Williams Building
College Park, MD 20742

CONTACT INFORMATION

Email: bonnie@cs.umd.edu
Phone: 301-405-6768
Fax: 301-314-9658

WWW PAGE

http://www.umiacs.umd.edu/~bonnie

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

PROJECT SUMMARY

The main goal of the research plan for the NYI and PFF/PECASE projects (1993-1997,1997-1999) is to investigate the applicability of a lexical-based framework to large-scale natural language processing (NLP) tasks such as interlingual machine translation (MT), foreign language tutoring (FLT), and information filtering and retrieval. Specifically, the projects aim to systematize the relation between syntax and semantics in lexical representations and to examine the problem of lexicon construction in multilingual NLP applications.

Four expected results of this research will be: (1) a set of novel lexical representations that apply uniformly to languages as diverse as Arabic, English, French, Korean, and Spanish; (2) translation and tutoring systems designed to use these novel lexical representations; (3) a general approach to automatic construction of large-scale lexicons for NLP applications; and (4) techniques for accurate selection of texts from a multilingual information collection. The primary benefit derived from this research is the validation of a common semantic representation that may be used by different researchers to test long-standing hypotheses about computerized translation, tutoring, and information filtering.

Currently, NLP systems are plagued with problems concerning extensibility to new languages or domains; these problems are exacerbated when designers attempt to scale up their systems so that they have broader coverage. The most significant bottleneck in this regard is the construction of lexicons that support multiple languages. To date, computational lexicons have been built through laborious word-by-word recoding of existing on-line dictionaries. This problem is addressed by demonstrating that a lexical-based framework accommodates automatic lexicon construction and supports a broader range of cross-linguistic phenomena; the end result will be a significant reduction in development time for large-scale NLP systems.

Earlier work supports the view that language divergences (i.e., syntactic and semantic differences that result in translation mismatches across languages) may be finitely specified, and that these divergences may be resolved by means of a handful of lexical-semantic parameters associated with entries in the lexicon. The following results have been achieved as part of this research:

As part of this earlier research, a set of basic templates was developed to represent the meaning of words; from this a set of correspondence rules was devised in order to form an association between the word meanings and their realization properties (i.e., the range of possible syntactic structures associated with each word in a sentence). While this approach resolves many types of translation divergences that arise between languages, more recent work [14,15,16,17,18,19,24,25] has demonstrated that the lexical-semantic representations and correspondence rules alone do not suffice for resolving more complex mismatches.

The current research aims to extend earlier work in lexical-semantics by building conceptual descriptions for well-defined subclasses of predicates and to identify and formalize different mechanisms required for resolution of complex linguistic mismatches such as aspectual distinctions. This work will contribute toward the development of representations that capture enough meaning to perform accurate lexical selection during the generation of an output sentence in applications such as MT and FLT. One important goal is to develop a well-defined semantic representation that provides a direct, natural description of the structure and content of linguistic information. The semantic representation serves as the basis of experiments for testing the validity of the following hypotheses:

Testing these hypotheses will involve the development of procedures for the automatic construction of multilingual lexicons that support a large-scale effort in MT, FLT, and information filtering and retrieval. The main innovation this research provides is potential acquisition of lexical-semantic representations through the application of linguistically motivated constraints. Although there have been a number of studies pertaining to the problem of lexical acquisition in the past, these studies have focused primarily on the acquisition of language-specific syntactic representations with no underlying linguistic framework.

A recent investigation [4] has demonstrated that linguistic constraints may be used to provide a semantic analysis of on-line dictionaries, thesauri, and textual data; these results will be used for the automatic generation of appropriate lexical-semantic representations for new words. The resulting framework provides the basis for the construction of an English lexicon. Specifically, the plan is to develop an acquisition program that exploits the compositional nature of the semantic representation to build new lexical definitions automatically. An additional goal is to demonstrate that the resulting definitions are easily ported to other languages, e.g., Arabic, French, Korean, and Spanish. The results of this research will be used in two systems currently under development, one for MT [10,11] and the other for FLT [3,6].

As the representations and techniques for MT and FLT are enhanced, it becomes feasible to consider their application to other areas such as multilingual information filtering, i.e., the evaluation of articles from an information stream and the selection of those which are relevant to a user's interest regardless of their source language. An initial investigation [20,21,22] has demonstrated that it is possible to apply MT techniques to the problem of monitoring and disseminating information taken from multi-language news groups. The current goal is to extract word senses from an interlingual analysis of text in order to add a natural language component to the filtering process. Such an approach would allow for a variety of languages---both source (i.e., input from a news group) and target (i.e., output into the users' native tongue)---to be incorporated seamlessly into an information filtering system.

Focusing on these research areas will result in the national availability of: (1) a general framework that allows the same programs to be used for processing multiple languages in different applications (e.g., MT, FLT, and information filtering and retrieval); (2) an extensible lexical-semantic representation that serves as a broad-coverage interlingua and provides a means for natural language researchers to test new ideas by making minor modifications; and (3) implemented techniques for reducing the development time of large-scale NLP systems. These results will contribute toward the realization of NLP models that serve as a research community standard, enabling researchers to make incremental changes to test long-standing hypotheses about computerized translation, tutoring, and information filtering and retrieval.

PROJECT REFERENCES

[1] Bonnie J. Dorr. Interlingual Machine Translation: a Parameterized Approach. Artificial Intelligence, 63(1&2):429-492, 1993.

[2] Bonnie J. Dorr. Machine Translation Divergences: A Formal Description and Proposed Solution. Computational Linguistics, 20(4):597-633, 1994.

[3] Bonnie J. Dorr. Large-Scale Acquisition of LCS-Based Lexicons for Foreign Language Tutoring. In Proceedings of the ACL Fifth Conference on Applied Natural Language Processing (ANLP), pages 139-146, Washington, DC, 1997.

[4] Bonnie J. Dorr. Large-Scale Dictionary Construction for Foreign Language Tutoring and Interlingual Machine Translation. Machine Translation, 12(1), To appear.

[5] Bonnie J. Dorr, Joseph Garman, and Amy Weinberg. From Syntactic Encodings to Thematic Roles: Building Lexical Entries for Interlingual MT. Machine Translation, 9:71-100, 1995.

[6] Bonnie J. Dorr, Jim Hendler, Scott Blanksteen, and Barrie Migdalof. Use of LCS and Discourse for Intelligent Tutoring: On Beyond Syntax. In Melissa Holland, Jonathan Kaplan, and Michelle Sams, editors, Intelligent Language Tutors: Balancing Theory and Technology, pages 289-309. Lawrence Erlbaum Associates, Hillsdale, NJ, 1995.

[7] Bonnie J. Dorr and Douglas Jones. Acquisition of Semantic Lexicons: Using Word Sense Disambiguation to Improve Precision. In Proceedings of the Workshop on Breadth and Depth of Semantic Lexicons, 34th Annual Conference of the Association for Computational Linguistics, pages 42-50, Santa Cruz, CA, 1996.

[8] Bonnie J. Dorr and Douglas Jones. Role of Word Sense Disambiguation in Lexical Acquisition: Predicting Semantics from Syntactic Cues. In Proceedings of the International Conference on Computational Linguistics, pages 322-333, Copenhagen, Denmark, 1996.

[9] Bonnie J. Dorr and Douglas Jones. Use of Syntactic and Semantic Filters for Lexical Acquisition: Using WordNet to Increase Precision. In Proceedings of the Workshop on Predicative Forms in Natural Language and Lexical Knowledge Bases, pages 81-88, Toulouse, France, 1996.

[10] Bonnie J. Dorr, Dekang Lin, Jye-hoon Lee, and Sungki Suh. A Paradigm for Non-head-driven Parsing: Parameterized Message-Passing. In Proceedings of the International Conference on New Methods in Language Processing, pages 174-181, Manchester, UK, 1994.

[11] Bonnie J. Dorr, Dekang Lin, Jye-hoon Lee, and Sungki Suh. Efficient Parsing for Korean and English: A Parameterized Message Passing Approach. Computational Linguistics, 21(2):255-263, 1995.

[12] Bonnie J. Dorr and Mari Broman Olsen. Multilingual Generation: The Role of Telicity in Lexical Choice and Syntactic Realization. Machine Translation, 11(1-3):37-74, 1996.

[13] Bonnie J. Dorr and Mari Broman Olsen. Deriving Verbal and Compositional Lexical Aspect for NLP Applications. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL-97), Madrid, Spain, July 7-12 1997.

[14] Bonnie J. Dorr and Martha Palmer. Building a LCS-Based Lexicon from TAGs. In Proceedings of the AAAI-95 Spring Symposium Series, Representation and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generativity, pages 33-38, Stanford, CA, March 1995.

[15] Bonnie J. Dorr and Clare Voss. Machine Translation of Spatial Expressions: Defining the Relation between an Interlingua and a Knowledge Representation System. In Proceedings of Twelfth Conference of the American Association for Artificial Intelligence, pages 374-379, Washington, DC, 1993.

[16] Bonnie J. Dorr and Clare Voss. A Multi-Level Approach to Interlingual MT: Defining the Interface between Representational Languages. International Journal of Expert Systems, 9(1):15-51, 1996.

[17] Bonnie J. Dorr, Clare Voss, Eric Peterson, and Michael Kiker. Concept Based Lexical Selection. In Proceedings of the AAAI-94 Fall Symposium on Knowledge Representation for Natural Language Processing in Implemented Systems, pages 21-30, New Orleans, LA, 1994.

[18] Bonnie J. Dorr and Clare R. Voss. Constraints on the Space of MT Divergences. Technical Report AAAI TR, SS-93-02, Building Lexicons for Machine Translation, AAAI-93, Spring Symposium, Stanford, CA, pp.l43-53, 1993.

[19] Bonnie J. Dorr and Clare R. Voss. The Case for a MT Developers' Tool with a Two-Component View of the Interlingua. In Proceedings of the First Annual Association for MT in the Americas Conference on Partnerships in Translation Technology, pages 40-47, Columbia, MD, 1994.

[20] Douglas W. Oard, Nicholas DeClaris, Bonnie J. Dorr, and Christos Faloutsos. On Automatic Filtering of Multilingual Texts. In Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, pages 1645-1650, San Antonio, TX, 1994.

[21] Douglas W. Oard, Nicholas DeClaris, Bonnie J. Dorr, and Christos Faloutsos. Experimental Investigation of High Performance Cognitive and Interactive Text Filtering. In Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, pages 4398-4403, Vancouver, Canada, 1995.

[22] Douglas W. Oard and Bonnie J. Dorr. Evaluating Cross-Language Text Filtering Effectiveness. In Proceedings of the Cross-Linguistic Multilingual Information Retrieval Workshop, pages 8-14, Zurich, Switzerland, 1996.

[23] Clare Voss, Bonnie Dorr, and Mine Ulku Sencan. Lexical Allocation in Interlingua-Based Machine Translation of Spatial Expressions. In Proceedings of IJCAI-95 Workshop on Representation and Processing of Spatial Expressions, Montreal, Canada, 1995.

[24] Clare Voss and Bonnie J. Dorr. Toward a Lexicalized Grammar for Interlinguas. Machine Translation, 10(1-2):139-180, 1995.

[25] Clare R. Voss and Bonnie J. Dorr. Defining the Lexical Component in Interlinguas. In Proceedings of AAAI spring symposium on Representation and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generativity, pages 168-173, Stanford, CA, 1995.

[26] Clare R. Voss, Bonnie J. Dorr, and Mine Ulku Sencan. The Problem of Lexical Allocation in Interlingua-based Machine Translation of Spatial Expressions. In Patrick Olivier, editor, Representation and Processing of Spatial Expressions. Springer-Verlag, To appear in 1997.

AREA BACKGROUND

NLP technology provides ways of programming the computer with enough information about language that it can build representations for such tasks as machine translation [2,5], foreign language tutoring [3], and multilingual information filtering [6]. Currently, NL systems are plagued with problems concerning extensibility, particularly when designers attempt to scale up their systems so that they have broader coverage. The most significant bottleneck in this regard is the construction of machine-tractable lexicons (i.e., large NL databases that relate words to their corresponding meanings) [1]. To date, designers have been forced to build such lexicons through laborious word-by-word recoding of already existing on-line dictionaries. More recently, researchers have examined linguistically-motivated frameworks---such as that of Levin [4]---which accommodate automatic lexicon construction and cover a wider range of cross-linguistic phenomena. The ultimate goal is to develop techniques that provide a substantial reduction in development time for large-scale NL systems.

AREA REFERENCES

[1] Branimir Boguraev and Ted Briscoe, editors. Computational Lexicography for Natural Language Processing. Longman, London, 1989.

[2] Bonnie J. Dorr. Machine Translation: A View from the Lexicon. The MIT Press, Cambridge, MA, 1993.

[3] Melissa Holland, Jonathan Kaplan, and Michelle Sams, editors. Intelligent Language Tutors: Theory Shaping Technology. Lawrence Erlbaum Associates, Hillsdale, NJ, 1995.

[4] Beth Levin. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago, IL, 1993.

[5] Sergei Nirenburg, Jaime Carbonell, Masaru Tomita, and Kenneth Goodman, editors. Machine Translation: A Knowledge-Based Approach. Morgan Kaufmann Publishers, San Mateo, CA, 1992.

[6] Douglas W. Oard. The State of the Art in Text Filtering. User Modeling and User-Adapted Interaction, 1997. To appear.

RELATED PROGRAM AREAS

1. Virtual Environments.

3. Other Communication Modalities.

4. Adaptive Human Interfaces.

POTENTIAL RELATED PROJECTS

We are currently investigating the applicability of the linguistically-motivated framework described above to large-scale tasks in different languages and applications. Future directions will involve the use of our text-based NLP techniques in systems that support man/machine interaction in a virtual visual environment. We are also investigating other communication modalities (such as speech) and human-computer interaction in applications such as foreign language tutoring and multilingual information filtering. We plan to investigate the development of: (1) a set of novel machine-tractable lexical representations that apply uniformly to languages as diverse as Arabic, English, French, Korean, and Spanish; (2) human-language tutoring systems designed to use our interlingual representations; and (3) linguistically-motivated techniques for accurate selection of texts from a multilingual information stream.