Postscript Version

Logical and Statistical Approaches to Mismatch Resolution in Machine Translation

Jean Mark Gawron and Megumi Kameyama

SRI International,EK282
333 Ravenswood Avenue
Menlo Park, CA 94025

CONTACT INFORMATION

Dr. Jean Mark Gawron, SRI International, EK282, 333 Ravenswood Avenue, Menlo Park, CA 94025

Email: gawron@ai.sri.com

phone: (415)859-5089

fax: (415)859-3735

WWW PAGE

http://www.ai.sri.com/~megumi/mt.html

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Machine Translation, natural language processing, translation mismatch, analysis, semantics transfer, generation

PROJECT SUMMARY

A major bottleneck in present-day machine translation (MT) is identifying appropriate approximations when no exact translation exists between the source and target languages. An MT system must resolve mismatches either by incorporating implicit information from context or leaving out some information in the source text.

The prototype MT system implemented here translates texts of news articles about joint commercial ventures from Japanese to English. It essentially adopts the transfer approach, incorporating all the pieces of a classic transfer system, Analysis (analyzing Japanese sources into a Japanese-oriented semantic representation), Transfer (transferring the Japanese-oriented semantics into an English-oriented semantics), and Generation (building English targets from the transferred semantics). The problem of mismatch is addressed through that addition of a novel Mismatch Resolution Module (MRM) called when Generation fails.

In this architecture, Transfer rules are overgeneral and simplified, so that they can fire with minimal contextual information. Also the semantic representations of Japanese sentences are underspecified in a variety of ways, leaving unresolved various components which may not need resolution in order for translation to be successful, including scope, anaphora resolution, and word-sense disambiguation. As a result, the Transfer module is nondeterministic, in general producing a set of candidate English-oriented semantic representations. The bulk of the disambiguation and selection is done in two ways, first by consulting English specific models of collocational information, and second within the Generation-MRM loop, where the MRM offers solutions to the problems encountered by Generation. To illustrate the collocational model at work, we can take the example of English eat translating to either German fressen or essen. On the classic approach the transfer rule consults linguistic context, mapping eat to essen when the subject is human, and to fressen when the subject animal. On the approach advocated here, there are simply unconstrained transfer rules licensing both mappings, essentially the information that can be gotten from a bilingual dictionary, and we rely on the (statistical) collocational model of German to disprefer a human subject of fressen or an animal subject of essen. The Transfer model is simplified and the cross-translations are correctly characterized as unlikely, not impossible. To illustrate the Mismatch Resolution Module at work, consider various features which Japanese may leave unspecified, which need to be spelled out in constructing an English sentence, such as the definiteness and number of Noun Phrases, or the gender for an English pronominal form of a Japanese zero. It is the task of the MRM to specify these where they are necessary for successful generation. We explore two kinds of constraints that can be incorporated into an MRM, logical and statistical, with the design goal of a single MRM combining the advantages of both approaches.

The system will use the KNP parser (Kurohashi-Nagao Parser) from Kyoto University for Japanese analysis, a semantic Transfer module, and an English grammar using the Gemini formalism descended from the CLE. The Generator will use a version of head-driven generation. Work so far has focused on the translation and analysis of a set of twenty example articles, and on the implementation of a set of preliminary transfer rules. Translations will be evaluated by monolingual English speakers applying evaluation measures modeled on those of the DARPA MT program. The focus will be on translating the Japanese joint venture business articles in the MUC-5 corpus into English.

PROJECT REFERENCES

Alshawi, H., D. Carter, M. Rayner, and B.Gamback. (1991) "Translation by Quasi Logical Form Transfer." Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, {Morristown, N.J.,} Association for Computational Linguistics.

Alshawi, H. (Ed.) (1992) The Core Language Engine, Cambridge, MA: The MIT Press.

Andry, F., J. M. Gawron, J. Dowding, and R. Moore. (1994) "A tool for collecting domain dependent sortal constraints from corpora," In Proceedings of COLING-94.

Dowding, J., J. M. Gawron, D. Appelt, L. Cherny, R. Moore, and D. Moran. (1993) "Gemini: A Natural Language System for Spoken Language Understanding," In Proceedings of the Thirty-First Annual Meeting of the ACL, Association for Computational Linguistics.

Hobbs, J. R., and M. Kameyama. (1990) "Translation by Abduction," In Proceedings of COLING-90.

Kameyama, M., R. Ochitani, and S. Peters. (1991) "Resolving Translation Mismatches With Information Flow," Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics.

Shieber, Stuart M., van Noord, G., Pereira, F.C.N., and Moore, R.C. (1980) "Semantic, Head-Driven Generation." Computational Linguistics, 16:1.

AREA BACKGROUND

Within natural language processing, the two areas this project most directly addresses are Machine Translation and Generation. Machine Translation in its most ambitious form is translation of text or speech from one language into another (possibly unrelated) language by a computer without human intervention. In practice it has proved forbiddingly difficult to build computational systems that produce fluent output without human interventions and all machine translation systems in use today make use of human-editing of the machine output (post-editing), human editing of the input (pre-editing), or both. Such systems are often called human-assisted machine translation systems. The field of Machine Translation has also attacked the problem from the other direction, expanding to include systems called machine-assisted translation systems or translator's workbenches, where the computational system plays the role of an assistant mustering various online knowledge resources and proposing candididate translations under human supervision.

What makes unassisted quality translation so difficult is that it requires simultaneous perfect performance on a number of difficult natural language processing tasks. The input in the source language must be correctly analyzed to at least partially specify its intended meaning. This analysis must be manipulated in some way that makes it appropriate for the target language. There are two general approaches to this. One is to further analyze the input into some abstract (or "deep") representation common to all languages. This is called the Interlingua approach, and it requires the existence of some interlingua representation rich enough to map all distinctions in all languages, and a deep analysis module that builds representations rich enough to exploit it. The other approach is called the Transfer approach. A Transfer MT system transfers a source language analysis into a target language analysis. The Transfer approach requires a Transfer module unnecessary on the interlingua approach, but it makes up for this by setting itself an easier analysis task. In general analysis may end up with a representation specifically tailored to the source language, so that analysis may be less abstract. Whichever approach is adopted, the final step is to input a representation into a Generation module, whose task is to put together all the necessary words in all the correct forms to create an understandable target language sentence that represents the meaning of the input. Success requires perfect performance in all three phases: (a) initial source Analysis; (b) Deep Analysis or Transfer; and (c) Generation. All these tasks are beset by all the problems that make natural language systems in general misbehave, ambiguity, vagueness, complexity, and each has the potential of introducing new problems or worsening problems introduced by previous modules.

Much of the complexity of MT arises in representing information specific to language-pairs, for example, representing all the numerous possibilities for translating the French preposition de into English, some of which don't require a preposition. Even to be a moderately useful system that suceeds with heavy post-editing requires that an enormous amount of information about both languages be encoded and maintained. It is thus extremely important for MT systems to be able to leverage off existing resources, for example, by separating out as much as possible the information particular to a language and used for analysis or generation, from information about the mapping between a particular language pair. The former can be assembled from more common available monolingual resources and systems. The latter requires rarer and more expensive bilingual information.

One of the goals of the MRM architecture in this project is to simplify the core of bilingual information an MT system needs by factoring more into monolingual collocational models and a flexible Generator-Mismatch Resolution Module loop. Factoring information out of Transfer and into the Generation MRM loop separates out those mismatch strategies (for example strategies involving particular kinds of paraphrase or sense resolution) that are largely independent of the source language. In effect, this is an attempt to partially preserve one of the chief advantages of an Interlingua system on a transfer approach. Interlingua systems have always held out the appeal of minimal reliance on language-specific information, with no module devoted to a particular language pair; systems with n languages in principle require only a Generator and Analyzer for each, or 2n modules. In an MRM system, a somewhat different trade-off is explored; where the Interlingua system reduces reliance on bilingual information at the cost of more complex analysis, the MRM system does it at the cost of more complex generation, through a Generator able to adjust to semantic representations that are less than perfect fits for the target language.

AREA REFERENCES

Carbonell, J., E.~Rich, D.~Johnson, M.~Tomita, M.~Vasconcellos, and Y.~Wilks. (1992) JTEC Panel Report on Machine Translation in Japan. Technical report, Japanese Technology Evaluation Center, Loyola College.

Dorr, B. (1993) Machine Translation: A View from the Lexicon. The MIT Press, Cambridge, MA.

Kay, M., J. M. Gawron, and P. Norvig. (1994) Verbmobil: A Translation System for Face-to-Face Dialog, Stanford, CA: CSLI Publications.

Nirenburg, S., Carbonell, J.,Tomita, M., and Goodman, K. (eds) (1992) Machine Translation: A Knowledge-Based Approach. Morgan Kaufmann Publishers, San Mateo, CA.