next up previous
Next: TEMPLATE GENERATION Up: The Generic Information Extraction Previous: LEXICAL DISAMBIGUATION

COREFERENCE RESOLUTION

This module turns a tree-like semantic structure, in which there may be separate nodes for a single entity, into a network-like structure in which these nodes are merged. This module resolves coreference for basic entities such as pronouns, definite noun phrases, and ``one'' anaphora. It also resolves the reference for more complex entities like events. That is, an event that is partially described in the text may be identified with an event that was found previously; or it may be a consequence of a previously found event, as a death is of an attack; or it may fill a role in a previous event, as an activity in a joint venture.

Three principal criteria are used in determining whether two entities can be merged. First, semantic consistency, usually as specified by a sort hierarchy. Thus, ``the Japanese automaker'' can be merged with ``Toyota Motor Corp.'' For pronouns, semantic consistency consists of agreement on number and gender, and perhaps on whatever properties can be determined from the pronoun's context; for example, in ``its sales'', ``it'' probably refers to a company.

Second, and more generally, there are various measures of compatibility between entities; for example, the merging of two events may be conditioned on the extent of overlap between their sets of known arguments, as well as on the compatibility of their types.

The third criterion is nearness, as determined by some metric. For example, we may want to merge two events only if they occur within sentences of each other (unless they are in The Financial Times). The metric of nearness may be something other than simply the number of words or sentences between the items in the text. For example, in resolving pronouns, we should favor the Subject over the Object in the previous sentence; this is simply measuring nearness along a different path.

These rules have to be developed manually (and by ``manually'' I mean ``cerebrally''). The sort hierarchy used in consistency checking is usually developed manually, although it would be interesting to know if researchers have begun to use WordNet or other thesauri for sort hierarchy development, or have attempted to use statistical means to infer a sort hierarchy.

The term ``discourse processing'' as used by MUC sites almost always means simply coreference resolution of application-relevant entities and events. There have been no serious attempts to recognize or use the structure of the text, beyond simple segmenting on the basis of superficial discourse particles for use in nearness metrics in coreference resolution.


next up previous
Next: TEMPLATE GENERATION Up: The Generic Information Extraction Previous: LEXICAL DISAMBIGUATION
Jerry Hobbs 2004-02-24