In most information extraction applications that have so far been implemented the set of events of interest has been narrowly constrained. The topics have included joint ventures, microelectronics, terrorist incidents, management succession events, and so on (Hobbs et al., 1997). By contrast, in SRI's Message Handler System, all substantive events mentioned in the messages must be reported. We can call the latter case an ``open domain'', rather than a closed domain. Open domains give the user greater capability in retrieval of information, but require a greater effort to develop.
In this report, we describe our development of an open-domain system for business news. The capabilities provided by this system will give users a rich basis from which to tailor a system to a particular set of needs. Users can select the open-domain patterns that are of immediate interest, constrain their arguments in some fashion, and thereby produce a special-purpose information extraction system in very short order.
We have focused on events described by clause-level verbs and their nominalizations. The typical newspaper article has verbs primarily of three sorts, illustrated by the sentence,
Presidio Oil Co. said yesterday it completed the purchase of Kaiser Energy Inc. for $210 million.
First, there are epistemic and communication verbs, such as ``said''. Generally these are not of primary interest, although we often need to recognize them to provide information about the modality of the event and the reliability of the information. We are building patterns to recognize these verbs so that they are treated appropriately not as events but as modifiers on events.
Second, there are support verbs, such as ``completed''. These do not correspond to a reportable event. There is no separate ``completing'' event, distinct from the purchase, in the above example. They function as carriers of predicate-argument relations between subjects and clausal or nominalization complements. In the example, it is the Presidio Oil Co. which purchases. In the open-domain system, these verbs are combined with their complements as complex verbs. Users who specify an interest in purchases do not also have to specify an interest in the completing of purchases; that happens automatically.
Finally, there are the content verbs and their nominalizations, like ``purchase''. Much of the work in constructing the open-domain system has consisted in developing a library of the most commonly occurring verbs and nominalizations in a Wall Street Journal corpus. Our work has been corpus-based. We determined the most common verbs and nominalizations, and for those words, we searched the corpus to determine all of the case frames in which they were used, together with the constraints on the arguments. The open-domain system thus has patterns that match any of the various linguistic constructions in which these words can occur, and it builds the corresponding case-frame template.
For the above example, the following template is built:
| Event: | ``purchase'' | ||
| Agent: | ``Presidio Oil Co.'' | ||
| Object: | ``Kaiser Energy Inc.'' | ||
| For: | ``$210 million'' | ||
| Location: | -- | ||
| Time: | -- | ||
| Info-Source: | ``Presidio Oil Co.'' | ||
| Modality: | FACTUAL |
For some verbs, the slot will be filled by a pointer to another event. For example, for ``agree(ment)'', in addition to the two parties involved, there is the activity they agree to do.
From our previous work in information extraction, we have developed a basic ontology of business news, with the capability of recognizing and constructing objects to encode information about such entities as people, organizations, and locations. In developing the open-domain system, we have expanded this ontology to include quite a number of other kinds of entities.
The utility of an open-domain system for developing special-purpose information extraction systems can be illustrated by our efforts in preparing for the MUC-6 evaluation in September 1995. When we learned what the event of interest in the new domain was, we scanned the training texts to determine the verbs and nominalizations that encoded those events. We then categorized them into classes of verbs with the same case frames. We then wrote subject-verb-object patterns for each of the classes. This was a sequential process, rather than an interleaved one.
With the open-domain system augmented by a module for learning patterns that we are developing, users will be able to scan a text, tag the events of interest, move strings of text into a template, and have the system hypothesize the pattern that that action instantiates. But the system, in analyzing the tagged portion of the text, will be able to do much more than a crude analysis into verb groups and noun groups. It will in many cases know the verb and its case frame and thus the role the noun groups play. In a way, users, without realizing it, will be constructing a mapping not so much directly from the text to their templates, but from the system's case frames to the templates, and that is a much more reliable process.
In addition, users will have direct access to the case frames. If they are interested, for example, in promotions, they can call up the case frame for ``promote'', restrict the arguments as they please (e.g., the position of CEO only, or Japanese companies only), and thereby have a pattern.
Moreover, the system will automatically make use of whatever inferential relations have been encoded. If users specify an interest in successions, not only would the subject-verb-object pattern for ``succeed'' be built, so would patterns for verbs such as ``promote'' and ``resign''. Of course, not all ways of encoding a concept will be captured like this, but the most common ones will, and the users' tasks will be made immensely easier.
The current technology has hit a ceiling at around 60% recall and precision, and we believe that the reason for this is that this is how much information a typical text ``wears on its sleeve''. To get the rest of the information requires inference. In the past, however, inferencing has been a very inefficient operation and has been knowledge-intensive, requiring a huge investment of labor in building knowledge bases. The approach described here is a step along the way toward getting much of the benefit of inferential relations among concepts, without losing the advantages of the cascaded finite-state technology. In a way, it is to inferencing what FASTUS has been to traditional parsing.