Language Technologies Institute
Carnegie Mellon University
This project addresses the primary scientific issues and technology gaps in EBMT, namely: principled partial matching of new text to the bilingual corpus, judicious infusion of encapsulated linguistic knowledge to increase manyfold the utility of a bilingual corpus, improved systematic search, and self-tuning scoring metrics. The EBMT method will be tested on Spanish-English MT, Korean-English MT, and/or any other selected language pair of interest for which there are significant bilingual corpora available. Thus far we have developed partial matching methods and taken the first steps towards linguistic generalization.
Measurements of translation accuracy will be conducted, as will measurements of efficiency gains in human translator time in the context of a human-aided MT system such as the multi-engine MEMT system or its plug-and-play application (DIPLOMAT) for speech-speech MT. Benefits of developing principled generalized text matching methods with infusion of linguistic and statistical knowledge go beyond the instance EBMT focus. Information Retrieval, automated text summarization, Interactive textual and multi-media information navigation would all stand to gain from significantly better textual similarity measures.
With respect to general-purpose MT, researchers attempted two general paradigms, transfer-based MT and direct MT. The former includes syntactic parsing of the source language, structural transfer to a syntactic structure corresponding to the target language, and target-language generation. Most commercial MT systems are based on this paradigm, requiring years of careful hand-crafting of grammars, dictionaries and transfer rules. Both example-based MT (Kyoto U., ATR, CMU) and statistical MT (IBM: Candide project), attempted a more direct mapping, bypassing the need for grammars or hand-coded rules. These direct mappings are far less general than linguistically motivated ones, but have the advantage that they are learned automatically from large aligned bilingual text corpora. Generalized EBMT, provides a hefty measure of generality to automatically-learned translation mappings. The crux of this project is on just how to accomplish effective and principled generalized EBMT.
M. Nagao, "A Framework of a Mechanical Translation between Japanese and English by Analogy Principle". In Artificial and Human Intelligence, A. Elithorn and R. Banerji (eds). NATO Publications, 1984.
P. Brown, J. Cocke, S. DellaPietra, V. DellaPietra, F. Jelinek, J. Lafferty, R. Mercer, and P. Roossin. ``A Statistical Approach to Machine Translation.'' In Computational Linguistics 16(2), 1990.
Carbonell, J. G., Cullingford, R. E. and Gershman A. G. "Steps Towards Knowledge-Based Machine Translation", IEEE PAMI, Vol 3, Num 4, 1981.
Carbonell, J., T. Mitamura and E. Nyberg, "The KANT Perspective: A Critique of Pure Transfer (and Pure Interlingua, Pure Statistics, ...)", Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, Montreal, 1992.
R. Frederking, S. Nirenburg, D. Farwell, S. Helmreich, E. Hovy, K. Knight, S. Beale, C. Domashnev, D. Attardo, D. Grannes, R. Brown. ``Integrating Translations from Multiple Sources within the Pangloss Mark III Machine Translation.'' Proceedings of the First Conference of the Association for Machine Translation in the Americas, AMTA-94, Columbia, MD, 1994.