next up previous contents index
Next: 8.6 Multilingual Speech Processing Up: 8 Multilinguality Previous: 8.4 Machine-aided Human Translation

8.5 Multilingual Information Retrieval

Christian Fluhr
CEA-INSTN, Saclay, France

8.5.1 State of the Art

The problem of multilingual access to text databases can be seen as an extension of the general information retrieval (IR) problem corresponding to paraphrase. How does one retrieve documents containing expressions which do not exactly match those found in the query?

The most traditional approach to IR in general and to multilingual retrieval in particular, uses a controlled vocabulary for indexing and retrieval. In this approach, a documentalist (or a computer program) selects for each document a few descriptors taken from a closed list of authorized terms. Semantic relations (synonyms, related terms, narrower terms, broader terms) can be used to help choose the right descriptors, and solve the sense problems of synonyms and homographs. The list of authorized terms and semantic relations between them are contained in a thesaurus.

To implement multilingual querying using this approach, it is necessary to give the corresponding translation of each thesaural term for each new language recognized. This work is facilitated by the fact each descriptor is not chosen randomly but in order to express a precise unambiguous concept. The CIENTEC term bank [VL89] is one of many multilingual projects adopting this approach.

A problem remains, however, since concepts expressed by one single term in one language sometime are expressed by distinct terms in another. For example, the common language term mouton in French is distinguished into two different concepts in English, mutton and sheep. One solution to this problem, given that these distinctions are known between the languages implemented is to create pseudo-words such as mouton (alimentation)---mutton, and mouton (animal)---sheep. These domain semantic tags (such as animal and alimentation) as well as the choice of transfer terms depend on the final use of the multilingual thesaurus, and it is therefore sometimes easier to build a multilingual thesaurus from scratch rather than to adapt a monolingual one.

This controlled vocabulary approach gives acceptable results but prohibits precise queries that cannot be expressed with these authorized keywords. It is however a common approach in well-delimited fields for which multilingual thesauri already exist (legal domain, energy, etc.) as well as in multinational organizations or countries with several official languages, which contain lexicographical units familiar with problems of terminological translation.

Automatization of such methods consists in deducing, during indexing, the key-words that would be supplied for a text from the terms contained in the full-text or summary. Links between full-text words and controlled descriptors can be constructed either manually or by an automatic learning process from previously indexed documents. During interrogation, the same process can deduce the key-words from the terms used in the query to produce a search request. If links between text words and key-words are pre-established using different languages, it is possible to interrogate texts that are not in the same language as the query using the key-words as a pivot language. See figure gif.


Figure: Multilingual Interrogation using interlingual pivot concepts.

Generally, the controlled vocabulary approach means that queries can only be as precise as the predefined key-words (i.e., concepts) present in the thesaurus, posing an upper limit on query precision.

A third approach to multilingual interrogation is to use existing machine translation (MT) systems to automatically translate the queries, or even the entire textual database from one language to another. When only queries are translated from a source to target language, text can be searched in the target language and results can be dynamically translated back to the source language as they are displayed after the search.

This kind of method would be satisfactory if current MT systems did not make errors. A certain amount of syntactic error can be accepted without perturbing results of information retrieval systems, but MT errors in translating concepts can prevent relevant documents, indexed on the missing concepts, from being found. For example, if the word traitement in French is translated by processing instead of salary, the retrieval process would yield wrong results.

This drawback is limited in MT systems that use huge transfer lexicons of noun phrases like the RETRANS system developed by [BKK93] in the VINITI, Moscow. But in any collection of text, ambiguous nouns will still appear as isolated nouns phrases untouched by this approach.

A fourth approach to multilingual information retrieval is based on the Salton's vector space model [SM83]. This model represents documents in a n-dimensional space (n being the number of different words in the text database). If some documents are translated into a second language, these documents can be observed both in the subspace related to the first language and the subspace related to the second one. Using a query expressed in the second language, the most relevant documents in the translated subset are extracted (usually using a cosine measure of proximity). These relevant documents are in turn used to extract close untranslated documents in the subspace of the first language.

An improvement to this approach using existing translations of a part of the database has been investigated by a team in Bellcore [LL90]. Their information retrieval is based on latent semantic indexing. They approximate the full word-document matrix by a product of three lower dimensionality matrices of orthogonal factors derived by singular value decomposition. This transformation enables them to make a comparison not using individual words but taking into account sets of semantically related words. This approach use implicit dependency links and co-occurrences that better approximate the notion of concept.

The method has been tested with some success on the English-French language pair using a sample of the Canadian Parliament bilingual corpus. 2482 paragraphs were selected. 900 were used for training, using both the English and French words in the documents to build the matrices. The 1582 remaining documents were add to the matrices in their French version only. The English versions of these 1582 documents were then used as queries using the 900 English documents of the training set to relate the French and English words in the latent semantic indexing. For 92% of the English text documents the closest document returned by the method was its correct French translation.

Such an approach presupposes that the sample used for training is really representative of the full database. Translation of the sample remains a huge undertaking that must be done for each new database.

Still another approach consists combining machine translation methods with information retrieval methods. This approach has been developed by a European ESPRIT consortium (French, Belgian, German) in the project EMIR (European Multilingual Information Retrieval) [EMI94]. Experiments have been performed on French, English and German. This system uses 3 main tools:

The [EMI94] system uses large monolingual and bilingual dictionaries enabling it to process full-text databases in any domain. That means that all possible ambiguity in the language from both the syntactic and the semantic point of view are taken into account. A few additions are needed for unseen technical domains in the monolingual and bilingual dictionaries, especially in the bilingual dictionaries of multiterms.

Database texts are processed by linguistic processors which normalize single words and compounds. A weight is computed for all normalized words using a statistical model [DFR89]. During the interrogation the text which is used as a query undergoes the same linguistic processing. The result of this processing is passed to the reformulation process which infers new terms using monolingual reformulation rules (on source language and/or target language) and bilingual reformulation rules (transfer) [Flu90]. Compounds that are translated word for word are restructured by transformational rules. It can be seen that this approach differs significantly to the MT approach where only one translation of each query word is used. EMIR uses all possible translations in its database search.

In such an approach training for each database is not needed. Experiments on different databases have shown that, in most cases, the translation ambiguities (often more than 10 for each word) are solved by a comparison with the database lexicon and the co-occurrence with the translations of the other concepts of the query. Implicit semantic information contained in the database text is used as semantic filter to find the right translation in cases where current MT systems would not succeed.

In the framework of EMIR, tests have been been performed on the English CRANFIELD information retrieval testbed. First the original English language queries were translated into French by domain experts. Then two approaches were tested. Querying using the French-to-English SYSTRAN translation followed by a monolingual search was compared to querying using the first bilingual EMIR prototype to access English text by expanding the French queries into English possibilities. The multilingual EMIR interrogation was 8% better than using SYSTRAN followed by monolingual interrogation. On an other hand monolingual interrogation using the original English queries with monolingual EMIR was 12% better than the bilingual interrogation.

8.5.2 Future Directions

To continue research in the domain of multilingual information retrieval it is necessary to develop tools and textual data resources whose construction will be costly. Apart from the need for tools that are needed in all or most areas of natural language research, we see the need for the following:

Large bilingual test corpora are urgently needed in order to evaluate and compare methods in an objective manner. Existing test databases are monolingual, mainly in English. Large-scale test databases which are truly multilingual (i.e., with texts which are strict translations of each other) are needed. It will then be necessary to elaborate a set of queries in the various languages tested as well as to find all the relevant document for each query. This is a huge task. Such an undertaking for English textual database has begun in the TREC (Text Retrieval Evaluation Conference) project [Har93]. A similar process needs to be put in motion for multilingual test databases.

Databases of lexical semantic relations as general as possible are needed in a variety of languages for monolingual reformulation using classical relations like synonyms, narrower terms, broader terms and also more precise relations like part of, kind of, actor of the action, instrument of the action, etc., such as is being created for English in WordNet [Mil90]. Bilingual transfer dictionaries should also be as general as possible (general language as well as various specific domains).

To accelerate the construction of such lexicons, tools are needed for extracting terminology and for automatic construction of the semantic relations from corpora of texts. If bilingual corpus of texts are available in a domain, tools for computer aided building of transfer dictionaries should be developed. This extraction is specially needed for recognizing translations of compounds.



next up previous contents
Next: 8.6 Multilingual Speech Processing Up: 8 Multilinguality Previous: 8.4 Machine-aided Human Translation