Next: TEXT ZONING
Up: The Generic Information Extraction
Previous: The Generic Information Extraction
An information extraction system is a cascade of transducers or
modules that at each step add structure and often lose information,
hopefully irrelevant, by applying rules that are acquired manually
and/or automatically.
Thus, to describe an information extraction system is to answer the
following questions:
- What are the transducers or modules?
- What are their input and output? Specifically,
- What structure is added?
- What information is lost?
- What is the form of the rules?
- How are the rules applied?
- How are the rules acquired?
As an example, consider the parsing module. The parser is the
transducer. The input is the sequence of words or lexical items that
constitute the sentence. The output is a parse tree of the sentence.
This adds information about predicate-argument and modification
relations. Generally, no information is lost. The rules might be in
the form of a unification grammar and be applied by a chart parser.
The rules are generally acquired manually.
Any system will be characterized by its own set of modules, but
generally they will come from the following set, and most systems will
perform the functions of these modules somewhere.
- Text Zoner, which turns a text into a set of text segments.
- Preprocessor, which turns a text or text segment into a sequence
of sentences, each of which is a sequence of lexical items, where a
lexical item is a word together with its lexical attributes.
- Filter, which turns a set of sentences into a smaller set of
sentences by filtering out the irrelevant ones.
- Preparser, which takes a sequence of lexical items and tries
to identify various reliably determinable, small-scale structures.
- Parser, whose input is a sequence of lexical items and perhaps
small-scale structures (phrases) and whose output is a set of parse
tree fragments, possibly complete.
- Fragment Combiner, which tries to turn a set of parse tree or
logical form fragments into a parse tree or logical form for the whole
sentence.
- Semantic Interpreter, which generates a semantic structure or
logical form from a parse tree or from parse tree fragments.
- Lexical Disambiguation, which turns a semantic structure with
general or ambiguous predicates into a semantic structure with
specific, unambiguous predicates.
- Coreference Resolution, or Discourse Processing, which turns a
tree-like structure into a network-like structure by identifying
different descriptions of the same entity in different parts of the
text.
- Template Generator, which derives the templates from the
semantic structures.
I will elaborate on each of these modules in turn.
Next: TEXT ZONING
Up: The Generic Information Extraction
Previous: The Generic Information Extraction
Jerry Hobbs
2004-02-24