Summarization is a hard problem of Natural Language Processing because, to do it properly, one has to really understand the point of a text. This requires semantic analysis, discourse processing, and inferential interpretation (grouping of the content using world knowledge). The last step, especially, is complex, because systems without a great deal of world knowledge simply cannot do it. Therefore, attempts so far of performing true abstraction--creating abstracts as summaries--have not been very successful.
Fortunately, however, an approximation called extraction is more feasible today. To create an extract, a system need simply to identify the most important/topical/central topic(s) of the text, and return them to the reader. Although the summary is not necessarily coherent, the reader can form an opinion of the content of the original. Most automated summarization systems today produce extracts only.
SUMMARIST is an attempt to develop robust extraction technology as far as it can go and then continue research and development of techniques to perform abstraction. This work faces the depth vs. robustness tradeoff: either systems analyze/interpret the input deeply enough to producegood summaries (but are limited to small application domains), or they work robustly over more or less unrestricted text (but cannot analyze deeply enough to fuse the input into a true summary, and hence perform only topic extraction). In particular, symbolic techniques, using parsers, grammars, and semantic representations, do not scale up to real-world size, while Information Retrieval and other statistical techniques, being based on word counting and word clustering, cannot create true summaries because they operate at the word (surface) level instead of at the concept level.
To date, SUMMARIST produces extract summaries in five languages (and has been linked to translation engines for these languages in the MuST system). Work is underway both to extend the extract-based capabilities of SUMMARIST and to build up the large knowledge collection required for inference-based abstraction.
We are building SUMMARIST, a system that combines symbolic concept-level world knowledge (embodied in ISI's ontology SENSUS, dictionaries, and similar resources) with robust NLP processing (using techniques from Information retrieval and elsewhere) to overcome the problems of the depth/robustness tradeoff. SUMMARIST is based on the following 'equation':
|Summarization = Topic Identification + Interpretation + Generation|
For each step, the system hybridizes techniques as follows:
1. Topic Identification:
Generalizing word-level IR techniques, and adding additional techniques of topic spotting, we use SENSUS and dictionaries to perform 'concept counting' and generalization, in order to identify important topics in the text. English, Japanese, Spanish, Indonesian, and Arabic preprocessing modules and lexicons are providing multilingual capabilities. This is the most developed stage of SUMMARIST at this time.
Training on Wall Street Journal and other texts, we employ statistical techniques from IR (word clustering, tf.idf, chi-squared) and cognitive psychology (latent semantic analysis, WordNet, etc.), as well as lexicons and dictionaries, to perform 'concept-based' topic fusion (interpretation) to find true summarizing concepts. The achieve the robust performance required for general utility, we are busy building a large collection of 'concept families', organized in the SENSUS ontology).
We will develop three alternatives: a keyword lister; a phrase template generator; and one of ISI's sentence planners and sentence generators (Penman, NITROGEN). All three will provide hyperlinks from the summary back into the source document.
Prototypes of each portion of the system have alrady been built and are separately evaluated. A formal evaluation of 18 systems was performed under the auspices of the TIPSTER research funding program in February 1989. For information about SUMMAC contact Eduard Hovy (email@example.com) for a forwarding address.
Multilinguality: Since the system uses no parser or grammar, and since for Machine Translation and other purposes at ISI we have built lexicons of English (90,000 items), Japanese (220,000), Spanish (45,000), Arabic (60,000), Indonesian (110,000), and Korean (110,000), most of which have been partially linked to SENSUS, the system's design makes it possible to provide English summaries or keyword extracts of documents written in any of these four languages. Chin-Yew Lin's recent work on embedding SUMMARIST in a multilingual web access and information retrieval system called MuST, with the addition of a shallow Indonesian-to-English translator, illustrates this approach. Click here for the MuST system.
Discourse-level processing: In order to produce a coherent, fluent summary, and to determine the flow of the author's argument, it is necessary to determine the overall discourse structure of the text. Daniel Marcu's work shows how, by first performing automated discourse analysis and then removing sentences peripheral to the main message of the text, it is possible to construct coherent extracts. This module will be added to SUMMARIST in the near future.
Indicator phrases: Phrases such as "in conclusion" and "note that" in some genres indicate important content. The project of Hao Liu focused on developing techniques to learn useful indicator phrases automatically. Click here for more information about the package and toolkit.
SUMMARIST Prototype Screen Shot
Full screen shot of version 0.1 Alpha
Eduard Hovy -- senior project leader
Chin-Yew Lin -- research scientist
Daniel Marcu -- research scientist
NLG overview | Project Members | Projects| Demonstrations | Publications