next up previous
Next: The Basic Ontology Up: Open-Domain Information Extraction from Previous: Open-Domain System

Methodology

Subject-verb-object patterns were developed for the most frequent content words in the Wall Street Journal. The corpus of texts used for this analysis was approximately one thousand articles from the Wall Street Journal, from several consecutive days in each of the years 1987, 1988, and 1989.1 The articles were grouped by date and SGML-tagged.

A frequency count was run over the corpus and the words were listed by descending frequency. Different morphological forms of verbs, including nominalizations, were grouped together in the frequency count. For instance ``pay'' is the 141st most frequent root and listed under it were the forms ``pays'', ``paid'', ``payment'', and ``payments''. High frequency verbs were selected for analysis if they met one of two criteria. First, verbs were chosen whose arguments (subject, object, recepient) were likely to be fairly constrained in their usage. Thus, ``have'' was out, and ``appoint'' was in. Second, verbs were chosen whose usage in the business news corpus was likely to be substantially different from its use in ordinary English, written or spoken.

For each of the words selected, a list was generated of all the sentences in the corpus in which that word appears. For each word, a chart was constructed with the head words of the subject, object, and prepositional objects. Where the head word alone did not capture the concept, prenominal nouns and adjectives were sometimes included. The following are six examples:

Chrysler Corp. estimates that health costs add $700 to the price of each of its cars, about $300 to $500 more per car than foreign competitors pay for health.

In an interest-rate options contract, a client pays a fee to a bank for custom-tailored protection against adverse interest-rate swings for a specified period.

Last year, Du Pont agreed to pay $4.5 million for rights to superconductor work at the University of Houston.

Congress still is struggling to dismantle the unpopular Catastrophic Care Act of 1988, which boosted benefits for the elderly and taxed them to pay for the new coverage.

Manville, a forest and building products concern, has offered to pay the trust $500 million for a majority of Manville's convertible preferred stock.

The trust, which was created as part of Manville's bankruptcy-law reorganization to compensate victims of asbestos-related diseases, ultimately expects to receive $2.5 billion from Manville, but its cash flow from investments has so far lagged behind its payments to victims.

These yield the following five entries in the chart for ``pay'':

 Subject Object Object of ``to'' Object of ``for''  
          
 ``competitors'' ``$700'' -- ``health''  
 ``client'' ``fee'' ``bank'' ``protection''  
 ``Du Pont'' ``$4.5 million'' -- ``rights''  
 -- -- -- ``coverage''  
 ``Manville'' ``$500 million'' ``trust'' ``stock''  
 ``it'' [``Manville''] ``payment'' ``victims'' --  

The words in each argument column were examined to determine what class or classes they fell into. In this manner, the elements of an ontology of basic and complex entities were hypothesized. Superclass-subclass relations among the classes were recognized by frequent occurrences of alternations in the rules, and a hierarchy was built up. For example, the above table suggests the classes of Company, Person, and Organization for the Subject and the Object of ``to'', and Money for the Object. The Object of ``for'' can take a wide variety of types of entities and events. Company is a subclass of Organization. The class of Country is also common as the Subject and Recipient of ``pay''. The alternation of Country, Person, and Organization is frequent enough in the patterns that a superclass subsuming all three was posited-``Coperorg''.

At the beginning of the effort there was no fixed ontology. Rather the ontology was built in an iterative process in which classes would be added if their instances appeared frequently in the texts and subclass relations would be added if justified by the examples in the corpus. Analysis of more verbs led to further modifications of the ontology so that it has evolved over the course of the project.

Finite-state rules encoding the pattern of usage for each verb were then written, with the arguments specified in terms of the categories supplied by the ontology.

The final step for each verb was to specify what syntactic variations the verb could be participate in. Most can occur in active and passive clauses, infinitives, relative clauses, and so on. In addition, some are ``middle verbs''. That is, the object of the verb, used transitively, can be the subject of the verb, used intransitively:

They resumed the talks.
The talks resumed.

Verbs can also be ``symmetric''; that is, a ``with'' complement can be

conjoined with the subject:

The company met with the union.
The company and the union met.

Verbs can also be nominalized. Two varieties of nominalizations are used here. ``Act nominalizations'' are those in which noun refers to the event itself.

John acted hostilely.
John committed a hostile act.

``Actor nominalizations'' are those in which refers to one of

the participants of the event.

John acts.
John is an actor.

We use the term ``actor nominalization'' to cover cases where the referent is not just the agent but any participant in the event.

Japan exports rice to Russia.
Rice is a Japanese export.

IBM priced the computer at $15,000.
The price of the computer was $15,000.

The purpose of noting these syntactic facts is to enable the automatic generation of linguistic variants from the base subject-verb-object patterns.


next up previous
Next: The Basic Ontology Up: Open-Domain Information Extraction from Previous: Open-Domain System
Jerry Hobbs 2004-02-24