Language Technologies Institute
Carnegie Mellon University
The primary result to date of our research has been the creation of the Maximal Marginal Relevance (MMR) metric and its successful implementation as a means of improving document retrieval, single-text summary generation, and most recently multi-text summary fusion. In essence, MMR combines relevance with novelty in both document selection for IR, and passage selection for constructing synthetic multi-text summaries.
Relevance is established by the IR system that retrieved that document, e.g. cosine similarity between query and word vectors or any other computable metric.] In essence, we are searching for the document that is both most different from the one already scanned but that still scores high on query relevance. This method promotes a maximal-diversity search, still providing report(s) of the most relevant event first, but then switching to reports of other relevant events. In essence, the new method ranks the documents dynamically by the the marginal query-relevant information gain per additional report. In this manner, both relevance-to-query and diversity from already scanned information are considered, and a tunable function of the two is optimized when selecting the next document in the ranking. The objective is simply to provide the analyst with a maximal-diversity sampling of information pertinent to the query. The maximal marginal relevance (MMR) metric determines the differential relevance of a document with respect to documents already seen.
Beyond single-document summarization, a synthesized summary of a set of documents -- such as those output by the retrieval engine with respect to an analyst's query -- often proves more desirable. We first apply the same method for localization of relevant document segments as single-document summarization to each passage in each document in the relevant retrieved set. Then, we filter these for redundancy (noting, if desired, the various sources) using the MMR method developed for improved ranking (see previous section). Finally, we assemble by topic-cohesion the non-redundant and query-relevant document segments into a synthetic multi-document summary at the desired level of granularity.
We are also extending the MMR method to search documents cross-linguistically and to produce summaries in languages other than English.
Carbonell, J. G., Geng, Y., and Goldstein, J., "Query-Relevant Document Summarization", Carnegie Mellon University Technical Report, 1997.
Carbonell, Jaime G.; Yang, Yiming; Frederking, Robert E. Brown, Ralf D.; Geng, Yibing; Lee, Danny, "Translingual Information Retrieval: A Comparative Evaluation", IJCAI-97, Nagoya, Japan. 1997. (Distinguished paper award.)
Now consider a dynamic re-ranking method based on maximizing the new query-relevant information per document with respect to documents already scanned by the analyst. [By "query-relevant" document we mean documents relevant to the topic of the search, whether this topic is expressed as an ad-hoc query, or a more complex user profile or topic descriptor. Relevance is established by the IR system that retrieved that document, e.g. cosine similarity between query and word vectors or any other computable metric.] In essence, we are searching for the document that is both most different from the one already scanned but that still scores high on query relevance. This method promotes a maximal-diversity search, still providing report(s) of the most relevant event first, but then switching to reports of other relevant events. In essence, the new method ranks the documents dynamically by the the marginal query-relevant information gain per additional report. In this manner, both relevance-to-query and diversity from already scanned information are considered, and a tunable function of the two is optimized when selecting the next document in the ranking.
Summarizing documents, sometimes called "abstracting", as currently practiced, is a very labor-intensive procedure. For instance the National Library of Medicine and professional societies such as Chemical Abstracts engage armies of technically-proficient abstractors to read and summarize each document for entry into their data base. These abstracts, while professionally created at significant cost (i.e. time, people and money) are primarily one-size-fits-all summaries. There is only one given degree of abstraction, rather than a tunable parameter to express whether the reader wishes to read the headline, the one-paragraph abstract or the 2-page summary of the 40-page article. Moreover, a document could be important for different reasons, and the abstract is created without knowledge of which reasons the reader may deem important. This is necessarily the case because manual abstraction is performed only once, statically, before any reader enters the system to express his or her interests via queries or profiles.
In contrast, query relevant abstraction, extracts those passages of the document that are directly relevant to the user's profile or query. Applying MMR techniques, redundancy among extracted passages is reduced, and a document summary is formed.
Kupiec, J. M., Pedersen, J. and Chen, F., "A Trainable Document Summarizer", Proceedings of the 18th Annual Int. ACM/SIGIR Conference on Research and Development in IR, Seattle, WA. Pages 68-73. 1995.
Dumais, S., Landauer, T. and Littman, M., "Automatic Cross-Linguistic Information Retrieval using Latent Semantic Indexing", Proceedings of the 18th Annual Int. ACM/SIGIR Conference on Research and Development in IR, Zurich, 1996.
Unifying data mining and summarization from text with statistical learning methods for trend detection and exploratory data analysis in data mining from structured (numerical or symbolic) data