next up previous
Next: Handling Unknown Words Up: Introduction Previous: The TACITUS System

Evaluating the System

SRI International participated in the recent MUC-3 evaluation of text-understanding systems (Sundheim, 1991). The methodology chosen for this evaluation was to score a system's ability to fill in slots in templates summarizing the content of newspaper articles approximately one page in length on Latin American terrorism. The template-filling task required identifying, among other things, the perpetrators and victims of each terrorist act described in the articles, the occupation of the victims, the type of physical entity attacked or destroyed, the date, the location, and the effect on the targets. Frequently, articles described multiple incidents, while other texts were completely irrelevant.

An example of a relatively short terrorist report is the following from a news report dated March 30, 1989:

A cargo train running from Lima to Lorohia was derailed before dawn today after hitting a dynamite charge.

Inspector Eulogio Flores died in the explosion.
The police reported that the incident took place past midnight in the Carahuaichi-Jaurin area.

Some of the corresponding database entries are as follows:

   Incident: Date 30 Mar 89
   Incident: Location Peru: Carahuaichi-Jaurin (area)
   Incident: Type Bombing
   Physical Target: Description ``cargo train''
   Physical Target: Effect Some Damage: ``cargo train''
   Human Target: Name ``Eulogio Flores''
   Human Target: Description ``inspector": "Eulogio Flores''
   Human Target: Effect Death: ``Eulogio Flores''

The fifteen participating sites were given a development corpus of 1300 such texts in October 1990. In early February 1991, the systems were tested on 100 new messages (the TST1 corpus), and a workshop was held to debug the testing procedure. In May 1991 the systems were tested on a new corpus of 100 messages (TST2); this constituted the final evaluation. The results were reported at a workshop at NOSC in May 1991.

The principal measures in the MUC-3 evaluation were recall and precision. Recall is the number of answers the system got right divided by the number of possible right answers. It measures how comprehensive the system is in its extraction of relevant information. Precision is the number of answers the system got right divided by the number of answers the system gave. It measures the system's accuracy. For example, if there are 100 possible answers and the system gives 80 answers and gets 60 of them right, its recall is 60% and its precision is 75%.

The database entries are organized into templates, one for each relevant event. In an attempt to factor out some of the conditionality among the database entries, recall and precision scores were given, for each system, for three different sets of templates:

The results for TACITUS on the TST2 corpus were as follows.

  Recall Precision
Matched Templates 44% 65%
Matched/Missing 25% 65%
All Templates 25% 48%

Our precision was the highest of any of the sites participating in the evaluation. Our recall was somewhere in the middle.

We also ran our system, configured identically to the TST2 run, on the first 100 messages of the development set. The results were as follows:

  Recall Precision
Matched Templates 46% 64%
Matched/Missing 37% 64%
All Templates 37% 53%

Here recall was considerably better, as would be expected since the messages were used for development.

Although pleased with these overall results, a subsequent detailed analysis of our performance on the first 20 messages of the 100-message test set is much more illuminating for evaluating the success of the particular robust processing strategies we have chosen. In the remainder of this paper, we discuss the impact of the robust processing methods in the light of this detailed analysis.

We will divide our discussion into four parts: handling unknown words, our statistical relevance filter, syntactic analysis, and pragmatic interpretation. The performance of each of these processes will be described for Message 99 of TST1 (given in the Appendix) or on Message 100 of the development set (given in Section 5). Then their performance on the first 20 messages of TST2 will be summarized.


next up previous
Next: Handling Unknown Words Up: Introduction Previous: The TACITUS System
Jerry Hobbs 2004-02-24