next up previous
Next: TEXT ZONING Up: The Generic Information Extraction Previous: The Generic Information Extraction

INTRODUCTION

An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.

Thus, to describe an information extraction system is to answer the following questions:

As an example, consider the parsing module. The parser is the transducer. The input is the sequence of words or lexical items that constitute the sentence. The output is a parse tree of the sentence. This adds information about predicate-argument and modification relations. Generally, no information is lost. The rules might be in the form of a unification grammar and be applied by a chart parser. The rules are generally acquired manually.

Any system will be characterized by its own set of modules, but generally they will come from the following set, and most systems will perform the functions of these modules somewhere.

  1. Text Zoner, which turns a text into a set of text segments.

  2. Preprocessor, which turns a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes.

  3. Filter, which turns a set of sentences into a smaller set of sentences by filtering out the irrelevant ones.

  4. Preparser, which takes a sequence of lexical items and tries to identify various reliably determinable, small-scale structures.

  5. Parser, whose input is a sequence of lexical items and perhaps small-scale structures (phrases) and whose output is a set of parse tree fragments, possibly complete.

  6. Fragment Combiner, which tries to turn a set of parse tree or logical form fragments into a parse tree or logical form for the whole sentence.

  7. Semantic Interpreter, which generates a semantic structure or logical form from a parse tree or from parse tree fragments.

  8. Lexical Disambiguation, which turns a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates.

  9. Coreference Resolution, or Discourse Processing, which turns a tree-like structure into a network-like structure by identifying different descriptions of the same entity in different parts of the text.

  10. Template Generator, which derives the templates from the semantic structures.

I will elaborate on each of these modules in turn.


next up previous
Next: TEXT ZONING Up: The Generic Information Extraction Previous: The Generic Information Extraction
Jerry Hobbs 2004-02-24