next up previous
Next: PREPARSER Up: The Generic Information Extraction Previous: PREPROCESSOR

FILTER

This module uses superficial techniques to filter out the sentences that are likely to be irrelevant, thus turning the text into a shorter text that can be processed more quickly. There are two principal methods used in this module. In any particular application, subsequent modules will be looking for patterns of words that signal relevant events. If a sentence has none of these words, then there is no reason to process it further. This module may scan the sentence looking for these keywords. The set of keywords may be developed manually, or more rarely if ever, generated automatically from the patterns.

Alternatively, a statistical profile may be generated automatically of the words or -grams that characterize relevant sentences. The current sentence is evaluated by this measure and processed only if it exceeds some threshhold.



Jerry Hobbs 2004-02-24