MuST Multilingual Summarization and Translation

MuST (Multilingual Summarization and Translation) performs web access (or document retrieval from a set of documents), plus text summarization, plus (if you request it) translation into English, of either the retrieved texts or the summaries, or both.

MuST is being built as a prototype under DARPA funding in the Natural Language Group of the Information Sciences Institute of the University of Southern California in Los Angeles. Other research performed in the Group is embodied in the systems mentioned below. The team building MuST consists of Dr. Chin-Yew Lin (research scientist and principal builder) and Dr. Eduard Hovy (PI and research scientist). Please email comments to cyl@isi.edu and hovy@isi.edu.

MuST includes several subsystems:

  • Links to six web search engines (Yahoo!, AltaVista, Hotbot, etc.), for access to the web in the normal way.
  • A link to QuTE, a system that performs crude surface-level translation of Indonesian to English. Since fortunately Indonesian word order closely parallels that of English, and that Indonesian does not have complex morphology, QuTE produces translations that are enough to get the sense of the document, although not anywhere near proper translation.
  • Links to the MT engine GAZELLE for translation into English of texts written in Arabic, Spanish, and Japanese. However, since GAZELLE is a research system, it is not always online.
  • SUMMARIST does pure text summarization by extracting the most important sentences (in English, Japanese, Arabic, Spanish, and Indonesian). Input: one or more texts, in a subdirectory. Output: extracts of the most important part(s) of each text, translated to English (if the source is Indonesian), in files in that same subdirectory. You can specify how long the summary should be, and you can specify your query of interest according to which the summary should be created.


Running MuST

To run MuST, please point your browser (Microsoft Explorer or Netscape) to http://www.isi.edu/~cyl/must/must_beta.htm.

Perform the following steps:

1. Select the source you want to retrieve from by clicking on the right hand side menu:

  • a daily updated collection of web news articles in Indonesian, which is collected by MuST's own web spider;
  • ditto for Spanish or English;
  • our local collection of 5000 docs in Spanish, Arabic, Japanese, Indonesian, and English;
  • regular AltaVista or one of the other (English only).

2. Type in your search word(s) in the window at left. (If you have selected a foreign collection or source, you should type in a foreign search word, or else of course a name such as Clinton or Albright). When the texts are retrieved, you can look at them by clicking on them.

3. Click Search.

4. When the documents are listed in the middle window, you can view them by clicking on them. Sometimes the documents are no longer available from their source; them please choose another.

5. To summarize, click on the appropriate button in the middle bar. bar. You can summarize the original or the translation.

6. To translate, click on the appropriate button in the middle bar. You can translate the original or the summary. (The Indonesian translation is quick and dirty; the others are much slower, and may not always be running.)

Response should be fairly quick. If you wait for more than 45 seconds, something is wrong. Please try again.

Remember please that these are research systems! Your comments are very welcome--we hope that this can be of use to someone!

For information, please contact:

Dr. Eduard Hovy or Dr. Chin-Yew Lin
Information Sciences Institute of USC
hovy@isi.edu or cyl@isi.edu
http://www.isi.edu/natural-language/nlp-at-isi.html


Advanced Search Operators

For the Indonesian news retrieval, MuST supports the following operators besides AND and OR (default):

W/n proximity
ADJ adjacency
NEAR/n near

Searching for Words that are Near Each Other

Note: You can enter query operators in upper- or lower-case. They are capitalized in the following examples only for purposes of clarity.

The Proximity Operator

You can use the proximity operator to search for word pairs in which the pair's second term occurs within a specified number of words after the first.

Note: The proximity operator does not work across field boundaries; you cannot use it to search for a word pair in which the words occupy separate fields within a record.

Syntax:

word1 W/n word2

where n is the number of words within which word2 must occur after word1. The variable n can be any integer greater than 0 (the actual limit is between 1,000,000,000 and 2,000,000,000). Stopwords and punctuation do not count as words in the range specified by n. The proximity operator is unidirectional from left to right. It retrieves only those records in which word2 occurs within n words after word1. Occurrences of word1 within n words after word2 are not considered hits.

Example:

amphibian W/5 DNA

This query will retrieve records in which DNA occurs within five words after amphibian.

The Adjacency Operator

The adjacency operator--ADJ--is equivalent to a proximity operator with a defined range of one word (i.e., W/1). Certain punctuation marks--hyphen, apostrophe, comma, and period--function as adjacency operators when they appear in the middle of a character string; they do not function as such if immediately preceded or followed by a character not recognized by PLWeb Turbo (e.g., a space).

Note: The adjacency operator does not work across field boundaries; you cannot use it to search for a word pair in which the words occupy separate fields within a record.

Syntax:

word1 ADJ word2
str-ing
str'ing
str,ing
str.ing

The adjacency operator is unidirectional from left to right. It retrieves only records in which word2 follows word1. Occurrences of word1 that follow word2 are not considered hits.

Examples:

great ADJ white
337-81-4417
O'Hara
2,001
Carcharodon.carcharias

The Near Operator

The near operator duplicates the functions of the proximity and adjacency operators, with one exception: it is bidirectional; you can use it to search for word pairs in which the second term occurs within a specified number of words before or after the first.

If you specify a word range with it, the near operator functions as a bidirectional proximity operator. If no word range is specified, it serves as a bidirectional adjacency operator.

Note: The near operator does not work across field boundaries; you cannot use it to search for a word pair in which the words occupy separate fields within a record.

Syntax:

word1 NEAR/n word2
word1 NEAR word2

where n is the number of words within which word1 must occur before or after word2. The variable n can be any integer greater than 0 (the actual limit is between 1,000,000,000 and 2,000,000,000). Stopwords and punctuation do not count as words in the range specified by n.

Examples:

Whitewater NEAR/5 indictment
tax NEAR increase

The first query will retrieve records in which Whitewater occurs within five words of indictment. The second query will retrieve records in which tax occurs immediately before or after increase.


[ NLG overview | Project Members | Publications | Projects | Demonstrations ]