Postscript Version

Generalized Example-based Machine Translation

Jaime Carbonell

Language Technologies Institute
Carnegie Mellon University

CONTACT INFORMATION

Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213
Phone: (412) 268-7279
Fax : (412) 268-6298
Email: jgc@cs.cmu.edu

WWW PAGE

Home page under construction

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Machine translation, natural language processing, case-based reasoning, example-based machine translation, computational linguistics.

PROJECT SUMMARY

Example-Based Machine Translation (EBMT) holds promise as a rapid-deployment technology for new language pairs, especially as the key engine in a multi-engine "plug-and-play" MT system, requiring minimal development to extend to new languages. However, EBMT is still very much "pre-competitive" technology, requiring a focused R&D effort to realize its potential as probably the most cost-effective technology for accurate MT in multi-lingual assimilation environments such as intelligence gathering and analysis. In essence, EBMT operates be finding multiple matches to each sentence being translated in the source-language side of a bilingual corpus. Then, sections of the corresponding target-language sentences are combined, both compositionally and by deformations guided to recover from the measured deviances in the original match.

This project addresses the primary scientific issues and technology gaps in EBMT, namely: principled partial matching of new text to the bilingual corpus, judicious infusion of encapsulated linguistic knowledge to increase manyfold the utility of a bilingual corpus, improved systematic search, and self-tuning scoring metrics. The EBMT method will be tested on Spanish-English MT, Korean-English MT, and/or any other selected language pair of interest for which there are significant bilingual corpora available. Thus far we have developed partial matching methods and taken the first steps towards linguistic generalization.

Measurements of translation accuracy will be conducted, as will measurements of efficiency gains in human translator time in the context of a human-aided MT system such as the multi-engine MEMT system or its plug-and-play application (DIPLOMAT) for speech-speech MT. Benefits of developing principled generalized text matching methods with infusion of linguistic and statistical knowledge go beyond the instance EBMT focus. Information Retrieval, automated text summarization, Interactive textual and multi-media information navigation would all stand to gain from significantly better textual similarity measures.

PROJECT REFERENCES

The project started recently, and therefore has no publications as yet.

AREA BACKGROUND

Machine Translation (MT) is occasionally labeled the "paradigm task" for natural language processing, because fully-accurate MT will require in-depth text comprehension, as Bar-Hillel forecast in the 1960. Bar-Hillel and others based their arguments on the need for full disambiguation: lexical, syntactic, referential, intentional, etc. Rather than being discouraged by MT being an AI-complete task, researchers sought to build approximations to accurate MT requiring much less comprehension, or to circumscribe the task to well-defined domains and semi-controlled inputs. The KANT system at CMU is a good example of the latter.

With respect to general-purpose MT, researchers attempted two general paradigms, transfer-based MT and direct MT. The former includes syntactic parsing of the source language, structural transfer to a syntactic structure corresponding to the target language, and target-language generation. Most commercial MT systems are based on this paradigm, requiring years of careful hand-crafting of grammars, dictionaries and transfer rules. Both example-based MT (Kyoto U., ATR, CMU) and statistical MT (IBM: Candide project), attempted a more direct mapping, bypassing the need for grammars or hand-coded rules. These direct mappings are far less general than linguistically motivated ones, but have the advantage that they are learned automatically from large aligned bilingual text corpora. Generalized EBMT, provides a hefty measure of generality to automatically-learned translation mappings. The crux of this project is on just how to accomplish effective and principled generalized EBMT.

AREA REFERENCES

M. Nagao, "A Framework of a Mechanical Translation between Japanese and English by Analogy Principle". In Artificial and Human Intelligence, A. Elithorn and R. Banerji (eds). NATO Publications, 1984.

P. Brown, J. Cocke, S. DellaPietra, V. DellaPietra, F. Jelinek, J. Lafferty, R. Mercer, and P. Roossin. ``A Statistical Approach to Machine Translation.'' In Computational Linguistics 16(2), 1990.

Carbonell, J. G., Cullingford, R. E. and Gershman A. G. "Steps Towards Knowledge-Based Machine Translation", IEEE PAMI, Vol 3, Num 4, 1981.

Carbonell, J., T. Mitamura and E. Nyberg, "The KANT Perspective: A Critique of Pure Transfer (and Pure Interlingua, Pure Statistics, ...)", Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, Montreal, 1992.

R. Frederking, S. Nirenburg, D. Farwell, S. Helmreich, E. Hovy, K. Knight, S. Beale, C. Domashnev, D. Attardo, D. Grannes, R. Brown. ``Integrating Translations from Multiple Sources within the Pangloss Mark III Machine Translation.'' Proceedings of the First Conference of the Association for Machine Translation in the Americas, AMTA-94, Columbia, MD, 1994.

RELATED PROGRAM AREAS

None of the other areas are a close match.

POTENTIAL RELATED PROJECTS

Textual similarity measures are crucial for context setting (e.g. selection of language model in speech recognition, improving information retrieval), beyond the current MT application. Collaborative projects in this direction would be of significant interest.