The tool forms part of the work performed in the SUMMARIST project that is building an automated text summarization system.
Author and code developer:
Hao Liu, USC graduate student, 1997-98.
Advisor and contact point:
Information Sciences Institute
of the University of Southern California
4676 Admiralty Way
Marina del Rey, CA 90292-6695
ISI's indicator phrase package provides utilities to generate indicator phrases and use them to generate text extracts.
Indicator phrases are those phrases that signal important sentences in a text. Such sentences can be used to create an extract-type summary of the text. Example indicator phrases are "it is important", "note that", "in conclusion".
For a detailed discussion of indicator phrases and their power in text summarization, see (Liu and Hovy, 1999). We strongly recommend that you read this paper for a better understanding and use of this package. For additional information, see (Teufel and Moens, 1998) and (Kupiec et al., 1995).
Copy all files from /nfs/isd/liu/distribution/cuephrase/code to your own directory. Note that the current path for the Perl intepreter is /nfs/isd/cyl/perl/bin/perl. This is put in the first line of all Perl scripts. You may need to change that.
The package requires training data to generate indicator phrases. Training data are documents with existing summaries.
Documents for testing are not required to have summaries. However, if you want to use the utility provided in the package to measure the performance of the indicator phrases obtained, you need testing data with summaries; see below.
Before using this package, you have to run ISI's SUMMARIST system to preprocess the training documents. Click here to see how to install and run SUMMARIST. Run SUMMARIST in single mode as instructed in Section 5. You need to run only the first three steps given in part b: SUMpreprocess.pl, SUMnormalize.pl, SUMmorhper.pl.
You have to separate summaries from texts and put them into different directories, such as ~data/training/texts and ~data/training/summaries. The text and its corresponding summary should have the same name. For an example, see /nfs/isd/liu/distribution/cuephrase/training.
Run gendocword.pl, gendocphrase.pl, and gendoctriple.pl to generate the lists for single words, pairs, and triples, respectively. They have the same syntax. The syntax for gendocword.pl is:
gendocword.pl -c configconfig is the name of the configuration file that specifies parameters for the utility. It has the following format:
TXTDIR specifies the directory holding the texts; SUMDIR specifies the directory holding the summaries; EXT specifies the extension of the preprocessed training documents' names; OUTPUTEXT specifies the extension of the generated item lists' names.
These utilities will generate a list of items for each document. The list file name has the stem name of the document and the extension specified in the configuration file. The list files are placed in the directory holding the document texts, that is, the TXTDIR field in the configuration file.
Run gencue.pl to generate indicator words/pairs/triples. The syntax is:
gencue.pl -c configconfig is the name of the configuration file. It has the following format:
DIR specifies the directory holding the files of items generated in the above step; EXT specifies the extension of these files; OUTPUT specifies the file containing the generated indicator phrases; CUE_NUMBER specifies the number of indicator phrases to generate.
This utility will generate a file containing generated indicator phrases, which is put in the directory DIR, specified in the configuration file.
For new docuuments, run findimp_normal.pl or findimp_weight.pl to generate document extracts using the indicator phrases found above. The difference between these two programs lies in the method of scoring indicator phrases. findimp_normal.pl gives score 1 to each single indicator word, score 2 to each indicator pair, and score 3 to each indicator triple. In contrast, findimp_weight.pl uses the score obtained during the training phase. See (Liu and Hovy, 1999) for a more detailed explanation.
The two utilities have the same syntax:
findimp_normal.pl -c configwhere config is the name of the configuration file. It has the following format:
TXTDIR specifies the directory holding testing texts; EXT specifies the extension of testing texts' names; CUETRIPLEFILE specifies the file containing indicatator triples; CUEPHRASEFILE specifies the file containing indicator pairs; CUEWORDFILE specifies the file containg single indicator words. OUTPUTEXT specifies the extension of generated extracts, which is composed of the paragraph and sentence numbers of the extracted sentences. COUNT specifies the number of sentences to extract; CUENUMBER specifes the number of indicator phrases to use when extracting sentences.
A # in the configuration file tells the utility to ignore the line beginning with it. In the above example, the utility will only use single indicator words and indicator triples to generate extracts. This option is used only for CUEFILE parameters.
For each document, this utility will generate a file containing a list of sentences extracted from the document as its extract summary. The files have the stem names of their documents and the extension specified by the OUTPUTEXT field in the configuration file.
To evaluate the performance of the indicator phrases, you can calculate the precision and recall scores of the sentences extracted in step C. To do so, you require for each text an extract summary against which to compare the sentenced extracted above. The comparison utility that computes recall and precision is genprecall.pl. Its syntax is:
genprecall.pl -c configwhere config is the name of the configuration file. It has the following format:
TEXTDIR specifies the directory holding the test texts; REFEXT specifies the extension of the files containing the correct extract sentences; CMPEXT specifies the extension of the files generated in step C. OUTPUT specifies the generated file containing precision and recall. TOTAL specifies the number of text sentences, which is used to calculate the actual compression ratio.
Edmundson (1969) studied the use of indicator phrases to construct extract summaries. He used a different method of scoring words, based on the difference in probability of occurrence in the text and in its summary.
To construct a list of indicator phrases using Edmundson's scoring, the steps are the same as in Section 1. However, you must run a different set of utilities. Also, you need to run the tool getimpsens.pl to retrieve important sentences and use them as summaries.
The utilities for generating document item lists are gendocword_prob.pl, gendocphrase_prob.pl, and gentriple_prob.pl. The usage of these utilities is the same as in Section 1. Note that SUMDIR should refer to the directory used by getimpsens.pl.
The utility for generating indicator items is gencue_prob.pl, with the same usage as gencue.pl.
The utilities for generating document extracts are same as those in Section 1.
The package also contains tools to preprocess data, separate summaries from texts, divide data into training set and testing set, etc. This section describes these tools, which can be categorized into tools dealing with Ziff-Davis (ZF) data and Computational Linguistics (CL) data on the one hand, and tools dealing with preprocessing data on the other. The Ziff-Davis corpus is a collection of newspaper articles announcing new products, available from the TREC conference series; the Computational Linguistics corpus is a set of approx. 90 research papers downloaded from the online repository of CL papers.
This tool add tags to the Computational LinguIstics documents so that other tools can process them. The syntax is:
addtag.pl -c configThe config file has the following format:
SOURCEDIR=~/text TODIR=~/docwhere SOURCEDIR gives the location of the original texts and TODIR indicates where to place the tagged files.
This tool compares two indicator item lists generated in Section B and returns the overlap of the two lists. The order of the items is not important. Usage:
comparecues.pl -c1 cuelist1 -c2 cuelist2 -n cuenumberwhere cuelist1, cuelist2 are the files containing the two lists; cuenumber specifies the number of cue items to compare. For example, if it is set to 100, the first 100 items will be compared.
This tool counts the number of texts and sentences in a directory. This information is used for step D, the TOTAL information for the configuration file. Usage:
countsen.pl -d directory -e ext-of outputwhere ext gives the extension of files to be counted, for example, mph.
This tool retrieves summary and text from Computational Linguistics files and places them into separate directories. This tool prepares data for another tool, CMP2ZF.pl. Usage:
divideCMP.pl -sd srcdir -td tardirwhere srcdir gives the directory holding the original Computational Linguistics files. The summaries and texts retrieved will be placed into the directories SUM and TEXT under tardir.
This tool gathers information about important sentences of the texts and writes it into files. The texts must first be processed by Daniel Marcu's algorithm that determine the important sentences (see Step 11 below). Marcu's program delineates these important sentences by <I> and </I>. The utilities in this package require that these marks be converted to the words BEGIN-IMPORTANT and END-IMPORTANT. This conversion is done by another utility, makedata.pl; see Step 7 below. Also, the texts should be preprocessed as described above in step 2 of section "Prepare Data". Usage:
gatherimp.pl -c configwhere config has the following format:
TEXTDIR=~/textThese items have the same meanings as in the above examples.
This tool retrieves all important sentences of a text and paces them into a file. The requirement for this data is the same as for gatherimp.pl. Usage:
getimpsens.pl -c configwhere config has the following format:
TEXTDIR=~/textThese items have the same meanings as in the above examples.
The original Ziff-Davis data are stored in several huge files, each of which contains several documents. This tool retrieves summaries and texts for individual documents in these huge files and puts them into separate directories for summary and text. Also, this tool replaces <I> and </I> tags with the words BEGIN-IMP and END-IMP. The reason for this is given in the explanation of the tool gatherimp.pl. This tool also works for all files having Ziff-Davis format, including Computational Linguistics data after begin processed by CMP2ZF.pl. Usage:
makeZFdata.pl -c configwhere config has the following format:
DATADIR=~/dataZF specifies the prefix of the huge files in DATADIR.
This tool divides the whole data set into a training set and a test set according to the given ratio. Usage:
preparedata.pl -d ~/data -e mph -p 0.7 -d1 ~/train -d2 ~/testOption -d specifies the directory holding data, -e gives the extension of files, -p specifies the ratio of training set in the data, -d1 gives the directory holding the training data, -d2 gives the directory holding the test data.
This tool retrieves individual Ziff-Davis documents from the huge files described in makeZFdata.pl. Usage:
separate.pl -p ZF -d ~/dataOption -p specifies the prefix of the huge files, and -d specifies the directory to hold the individual documents.
Note: This tool should be run in the directory holding the original Ziff-Davis files.
This tool converts Computational Linguistics data into Ziff-Davis format. This conversion is required to run Daniel Marcu's algorithm (step 11). Normally, it is run after divideCMP.pl. Usage:
CMP2ZF.pl -c configwhene the config file has the following format:
This tool, developed by Daniel Marcu of ISI, finds all important sentences in a text that contribute to human-created abstracts. For its usage, please contact Daniel Marcu at firstname.lastname@example.org.
This page was created by Hao Liu and is maintained by Katya Shuldiner.
Please send formatting suggestions to email@example.com and technical queries to firstname.lastname@example.org.