August 9, 1998
Université de Montréal
Montréal, Québec, Canada


Eduard Hovy and Daniel Marcu
Information Sciences Institute
University of Southern California
4676 Admiralty Way, Suite 1001
Marina del Rey, CA 90292-6601

After lying dormant for over two decades, automated text summarization has experienced a tremendous resurgence of interest in the past few years. Research is being conducted in China, Japan, Europe, and North America, and industry has brought to market more than 30 summarization systems; most recently, two specialized workshops were devoted to the topic (the ACL/EACL-97 Workshop on Intelligent Scalable Text Summarization and the AAAI-98 Spring Symposium on Intelligent Text Summarization).

In this tutorial, we will review the state of the art in automatic summarization, and will discuss and critically evaluate current approaches to the problem. We will first outline the major types of summary: indicative vs. informative; abstract vs. extract; generic vs. query-oriented; background vs. just-the-news; single-document vs. multi-document; and so on. We will describe the typical decomposition of summarization into three stages, and explain in detail the major approaches to each stage. For topic identification, we will outline techniques based on stereotypical text structure, cue words, high-frequency indicator phrases, intratext connectivity, and discourse structure centrality. For topic fusion, we will outline some ideas that have been proposed, including concept generalization and semantic association. For summary generation, we will describe the problems of sentence planning to achieve information compaction.

How good is a summary? Evaluation is a difficult issue. We will describe various suggested measures and discuss the adequacy of current evaluation methods. Whenever possible, we will make explicit the relationships between computational and psycholinguistic research on summarization.

Throughout, we will highlight the strengths and weaknesses of statistical and symbolic/linguistic techniques in implementing efficient summarization systems. We will discuss ways in which summarization systems can interact with and/or complement natural language generation, discourse parsing, information extraction, and information retrieval systems.

Finally, we will present a set of open problems that we perceive as being crucial for immediate progress in automatic summarization.

The tutorial is intended for a wide audience; only a minimal level of familiarity with traditional computational linguistics issues is required.

Tutorial notes

Eduard Hovy directs the Natural Language Group at the Information Sciences Institute of the University of Southern California. He currently leads a project that is building the summarization system SUMMARIST. Other related work he has been or is currently involved with include concept clustering, large ontology construction, machine translation, and text generation.

Daniel Marcu's recently completed PhD at University of Toronto builds theoretical and algorithmic foundations of rhetorical parsing and discourse-based summarization of unrestricted texts. His work spans a wide range of computational linguistics topics that include text and discourse theories, knowledge representation for natural language, and natural language generation.