The Webclopedia is intended to answer questions posed to it in various languages, drawing its answers from text collections and/or the web, from multiple languages.
Examples of Webclopedia in operation, on TREC corpus of 1 million documents.
In the TREC-9 QA competition, Webclopedia tied for second place with a score of 31%.
The Webclopedia interface (still under development). The architecture includes:
The Webclopedia uses the CONTEX parser to parse the user's question and identify the question operator (called Qtarget), the desired topic, and additional specification details (called Qargs and Qwords). Webclopedia then creates a query and retrieves documents from the source corpus using the MG IR engine. Several increasingly general queries can be created (using stemming, query expansion lists, etc.). One of several Segmenters is run to split up documents into topically cohesive segments. The Ranker module then ranks the segments according to likelihood of containing an answer. The sentences in the top-ranked segments are then parsed by CONTEX. Next, the Matcher applies several independent matching heuristics to each candidate sentence in order to pinpoint the answer(s). One set of heuristics employs general question-answer patterns that express how portions of questions and answers relate within a CONTEX parse tree. Another set computes the degree of overlap between question tree and candidate answer tree, taking into account the Qtarget, Qargs, and Qwords. Other heuristics implement the fallback strategy of scoring a fixed-length window of words for their contents (overlap with the important words in the question). Finally, the Answer module compares the various answers' scores and rates them, also deciding whether an acceptable answer (or set of answers) has been found.
The Webclopedia is based on the theory that questions fall into a natural typology, based on the semantics of their desired answers. That is, "who discovered America", "what is the name of the person who discovered America?", "what was the discoverer of America called?" are all essentially the same question, and require a Named-Person as an answer. In contrast, "who was Columbus?" is a different type of question (one we call Why-Famous), and requires a different type of answer. We have developed a typology of over 140 question-answer types and a corresponding set of numerous answer patterns, based on an analysis of several thousand questions.
The CONTEX parser is used to parse both questions and candidate answers. The question parse yields a list of semantic types of the likely answer(s), the Qtargets, as defined in the QA Typology, which are then matched against the parsed answer candidates in order to pinpoint answers. See below for details.
Automated grammar learning and parsing
CONTEX is a parser that produces syntactic-semantic analyses of sentences. CONTEX consists of two major parts, a grammar learner and a parser. The grammar learner uses machine learning techniques to induce a grammar (represented as parsing actions) from a set of training examples (sentences with their trees, produced by a human).
By having a human supervisor assist with training, CONTEX cuts down on the number of training examples it requires. A grammar of Korean was learned from scratch over a three-month period with the help of two graduate students (one to create a training set of 1100 trees; the other to put in place a part of speech tagger and other auxiliary software). The system performs at approx. 86% labeled bracketed precision and recall, tested on unseen sentences. The Japanese version of CONTEX performs at approx. 91%. The English version of CONTEX currently performs at about 92% labeled bracketed precision and recall, when trained on 2048 sentences. This figure is a few percent lower than the best English parsers in the world today; however, these systems require more than 100x as much training data.
In Webclopedia, continued development of the CONTEX parser supports several goals:
Demo of CONTEX.
The Webclopedia project is developing technology to produce text summaries automatically. Research focuses on a number of subtasks, including:
Automated summarization evaluation. We have been analyzing IBM's BLEU scoring method for MT and developing the corresponding evaluation methodology for summarization, which we call RED. We have determined that simple unigram overlap scores, recall-based, perform better for summarization evaluation than BLEU's combination of ngram scores plus brevity penalty, and have suggested the need for additional rank-based comparison coefficients.
Locating training data with which to learn summarization engine parameters. We have developed a new method that makes available from the web tens of thousands of articles of training material in a variety of domains. We have shown that the assumption that a weekly article (from, say, Time Magazine) is in effect a summary of the daily (say, newspaper) articles of that week, and a monthly article summarizes the daily and weekly articles for that month, does not result in significant degradation compared to using real news summaries as training data. Our paper describes the D-W-M text alignment required and identifies a Yahoo! collection that can be used by anyone for training.
Single-document headline summarization. We have developed a new agglomerative learning-based method and tested it on 31 combinations of models. Although reasonably good in content, as measured in DUC-03 Test 1, the algorithm still needs the addition of grammatical information to achieve fluency. We plan to investigate both ngram models (as in IBM's MT language model) and grammatical rules to improve this aspect.
Manual summarization evaluation interface (SEE). Last year we built an interface called SEE (Summarization Evaluation Environment) to support manual summary evaluation. This tool was used by NIST to score the DUC-02 evaluation entries. At NIST's request, and under joint funding from DARPA and ARDA, we developed a new version of SEE that is more or less production-quality. This version was used by NIST to perform DUC-03 evaluations and worked flawlessly. It is also available for general distribution.
Participation in DARPA's Surprise Language Exercises. We participated in both the practice and the real exercises, and developed multilingual summarization (both full- document and headline summarization) capabilities. We also developed a version of the MuST framework in which CLIR, summarization, and MT functionalities were unified (see Cebuano demo). We also developed an entirely new interface for the full Surprise Language Exercise.
Earlier research on summarization at ISI focused on the development of SUMMARIST, a single-document multilingual text summarizer. SUMMARIST can summarize texts in English, Chinese, Arabic, Spanish, French, Italian, Japanese, and Bahasa Indonedia, and was used in the MuST system by the Pacific Command (PACOM) to monitor events in Indonesia in 1998--2000.
We build on SUMMARIST in Webclopedia, focusing on multi-document summarization. In collaboration with Daniel Marcu of the Rewrite project at ISI, we are starting to investigate the range of types of multi-document summaries (event stories, object descriptions, biographies, etc.), as well as methods for producing summaries of the more tractable of them. A recent comparison of several methods has found that for newspaper text several simple baseline methods work about as well as more sophisticated methods involving sentence clustering and filtering.
We have also built an interface with which summaries can be evaluated. SEE allows an assessor to compare the system's summary to a human's, at any level of granularity, and to tabulate findings, which are then tallied and converted to recall and precision scores. SEE is likely to be used by NIST in assessing the quality of summaries in the new Document Understanding Conference (DUC).
Demo of MuST.
Demo of SUMMARIST.
Demo of Must for Cebuano surprise language interface.
Rapid ramp-up of new languages
In order to support our focus on multilingual language processing, we continue to explore methods to incorporate new languages rapidly. We have recently collected a large amount of Chinese text, several Chinese dictionaries, and a tree bank of clauses. We are working with both Mandarin (simplified) and Taiwanese character sets.
CONTEX has been used to automatically learn garmmars of English, Japanese, and Korean. SUMMARIST can summarize texts in English, Chinese, Arabic, Spanish, French, Italian, Japanese, and Bahasa Indonedia. With postdoctoral visitors in the year 2001--2002, we are developing Korean named entity taggers, part of speech taggers, and other software tools.