Postscript Version

Multitext Fusion, Tracking and Trend Detection

Jaime Carbonell

Language Technologies Institute
Carnegie Mellon University

CONTACT INFORMATION

Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213
Phone: (412) 268-7279
Fax : (412) 268-6298
Email: jgc@cs.cmu.edu

WWW PAGE

Home page under construction

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

Text summarization, natural language processing, information fusion, topic tracking, information retrieval, relevance metrics

PROJECT SUMMARY

The primary research directions of this project focused on developing novel ways to generate topic-relevant summaries of diverse texts, fusing multiple text summaries into one, and tracking topics over multiple texts. In the original proposal we expected to use as input frame-like templates extracted from the text by the ARPA/TIPSTER engines because we believed it may be too difficult to work directly from the raw text. However, subsequent experiments proved otherwise, and we are using full text as input to our text summarization and multi-text fusion research.

The primary result to date of our research has been the creation of the Maximal Marginal Relevance (MMR) metric and its successful implementation as a means of improving document retrieval, single-text summary generation, and most recently multi-text summary fusion. In essence, MMR combines relevance with novelty in both document selection for IR, and passage selection for constructing synthetic multi-text summaries.

Relevance is established by the IR system that retrieved that document, e.g. cosine similarity between query and word vectors or any other computable metric.] In essence, we are searching for the document that is both most different from the one already scanned but that still scores high on query relevance. This method promotes a maximal-diversity search, still providing report(s) of the most relevant event first, but then switching to reports of other relevant events. In essence, the new method ranks the documents dynamically by the the marginal query-relevant information gain per additional report. In this manner, both relevance-to-query and diversity from already scanned information are considered, and a tunable function of the two is optimized when selecting the next document in the ranking. The objective is simply to provide the analyst with a maximal-diversity sampling of information pertinent to the query. The maximal marginal relevance (MMR) metric determines the differential relevance of a document with respect to documents already seen.

Beyond single-document summarization, a synthesized summary of a set of documents -- such as those output by the retrieval engine with respect to an analyst's query -- often proves more desirable. We first apply the same method for localization of relevant document segments as single-document summarization to each passage in each document in the relevant retrieved set. Then, we filter these for redundancy (noting, if desired, the various sources) using the MMR method developed for improved ranking (see previous section). Finally, we assemble by topic-cohesion the non-redundant and query-relevant document segments into a synthetic multi-document summary at the desired level of granularity.

We are also extending the MMR method to search documents cross-linguistically and to produce summaries in languages other than English.

PROJECT REFERENCES

Carbonell, J. G., Geng, Y., and Goldstein, J., "Query-Relevant Document Summarization", Carnegie Mellon University Technical Report, 1997.

Carbonell, Jaime G.; Yang, Yiming; Frederking, Robert E. Brown, Ralf D.; Geng, Yibing; Lee, Danny, "Translingual Information Retrieval: A Comparative Evaluation", IJCAI-97, Nagoya, Japan. 1997. (Distinguished paper award.)

AREA BACKGROUND

Modern information retrieval (IR) search engines produce a ranked list of retrieved documents as a function of declining relevance to the query. But, they typically do not address issue of avoiding massive redundancy. Consider a user faced with 100 retrieved documents who starts scanning down from the most relevant one, and after scanning the first 20 runs out of time or patience, finding them all to be different reports of the same event. However, it could turn out that the 31st, 48th and 66th documents in the ranked list of 100 each introduce a new event also relevant to the query, though marginally less relevant than the much-repeated first one. These events would have been totally ignored by the analyst, regardless of their potential significance. In essence, the user would have been swamped with relevant but massively redundant information.

Now consider a dynamic re-ranking method based on maximizing the new query-relevant information per document with respect to documents already scanned by the analyst. [By "query-relevant" document we mean documents relevant to the topic of the search, whether this topic is expressed as an ad-hoc query, or a more complex user profile or topic descriptor. Relevance is established by the IR system that retrieved that document, e.g. cosine similarity between query and word vectors or any other computable metric.] In essence, we are searching for the document that is both most different from the one already scanned but that still scores high on query relevance. This method promotes a maximal-diversity search, still providing report(s) of the most relevant event first, but then switching to reports of other relevant events. In essence, the new method ranks the documents dynamically by the the marginal query-relevant information gain per additional report. In this manner, both relevance-to-query and diversity from already scanned information are considered, and a tunable function of the two is optimized when selecting the next document in the ranking.

Summarizing documents, sometimes called "abstracting", as currently practiced, is a very labor-intensive procedure. For instance the National Library of Medicine and professional societies such as Chemical Abstracts engage armies of technically-proficient abstractors to read and summarize each document for entry into their data base. These abstracts, while professionally created at significant cost (i.e. time, people and money) are primarily one-size-fits-all summaries. There is only one given degree of abstraction, rather than a tunable parameter to express whether the reader wishes to read the headline, the one-paragraph abstract or the 2-page summary of the 40-page article. Moreover, a document could be important for different reasons, and the abstract is created without knowledge of which reasons the reader may deem important. This is necessarily the case because manual abstraction is performed only once, statically, before any reader enters the system to express his or her interests via queries or profiles.

In contrast, query relevant abstraction, extracts those passages of the document that are directly relevant to the user's profile or query. Applying MMR techniques, redundancy among extracted passages is reduced, and a document summary is formed.

AREA REFERENCES

Salton, G. and McGill, M. J., Introduction to Modern Information Retrieval , McGraw-Hill, New York, 1983.

Kupiec, J. M., Pedersen, J. and Chen, F., "A Trainable Document Summarizer", Proceedings of the 18th Annual Int. ACM/SIGIR Conference on Research and Development in IR, Seattle, WA. Pages 68-73. 1995.

Dumais, S., Landauer, T. and Littman, M., "Automatic Cross-Linguistic Information Retrieval using Latent Semantic Indexing", Proceedings of the 18th Annual Int. ACM/SIGIR Conference on Research and Development in IR, Zurich, 1996.

RELATED PROGRAM AREAS

Adaptive Human Interfaces.

POTENTIAL RELATED PROJECTS

Information fusion and summarization across modalities beyond text (e.g. broadcast news and other video, numerical/graphical information)

Unifying data mining and summarization from text with statistical learning methods for trend detection and exploratory data analysis in data mining from structured (numerical or symbolic) data