Multilingual Web Access and Text Management

C*ST*RD


Objective

Information access systems typically overwhelm the user with information. Whether they are Information Retrieval systems operating on text collections or Web access systems over the Internet, a query for information usually results in hundreds of documents. Seldom are all of these relevant. Some systems try to help, by ranking the documents by appropriateness or listing their titles. But the system may not be able to provide enough help--there may still be too many documents, some in other languages, many containing graphics, pictures, tables, and formatted in various ways. The user still has to wade through overwhelming amounts of data.

This project constructed a powerful information analysis environment to help English-speaking users manage multilingual, multimedia information in simple, intuitive ways. It contained modules that performed multilingual document retrieval, clustering, text summarization, translation into English, and display. Some of the more complex of these modules--summarization and translation--are separately funded projects at ISI; this project focused on their integration.


Approach

What is required is a powerful information analysis environment that helps the user manage multilingual, multimedia information in simple, intuitive ways. The system must:

To support these functions, the system must include various key functionalities. The C*ST*RD project's MuST prototype system is an information analysis environment that incorporated a number of new and innovative techniques while leveraging several other technologies being developed at USC/ISI, as follows:

Cluster / filter documents according to their major topics: several new clustering techniques (from our work on text summarization) were combined with a number of established ones (from Information Retrieval), and a new package implementing Lattice Theory was included.

Summarize clusters and individual documents: C*ST*RD used and extended SUMMARIST, the automated Text Summarization system being built at USC/ISI, in applying techniques to identify the most salient portions and to summarize of collections of documents. SUMMARIST employed technology to create summaries of texts robustly, following a three-stage approach: topic identification; topic interpretation (concept fusion); and output generation. A set of topic identification modules currently produce extract-type summaries for texts in English, Japanese, Arabic, Spanish, Italian, Chinese, Indonesian, and French.

Translate other languages into English. The system was linked to SYSTRAN's (commercial) translation site (also known as Babelfish) for translating European languages into English. For this we obtained a collaboration agreement with the SYSTRAN management. The system has also been linked to ISI's GAZELLE (JAPANGLOSS) project for translation from Arabic, Japanese, and Spanish to English (developed under separate funding). Also, a fast but skimming-quality Indonesian-to-English translation system called SHALT was implemented and linked to the system.

Support interactive display: a friendly interface allowed the user to request, retrieve, summarize, and translate documents, at arbitrary levels of detail. The system was configured to support users of both Information Retrieval systems and Web access systems.


Accomplishments

The prototype system, MuST (MUltilingual Summarization and Translation), was used in several settings:

MuST was used for an extended period by PACOM (Hawaii), to monitor developments in Indonesia during 1998-2000.

MuST was used in a scenario-based evaluation study performed by MITRE.

The SUMMARIST engine of MuST was linked to the website of the commercial machine translation company SYSTRAN (also known as Babelfish), to provide summarizations and translations of French, Italian, and Spanish, in addition to its operation on Chinese, Japanese, and Arabic.

MuST provided experience and some software upon which the spinoff company Infosqueeze was started. The Infosqueeze demo site.


Project Members

Eduard Hovy -- senior project leader

Chin-Yew Lin -- research scientist

Ulf Hermjakob -- research scientist


Current Status

The C*ST*RD project ended in January 2001.


Associated Projects

Dictionary Parsing Project


NLG overview | Project Members | Projects | Demonstrations | Publications