next up previous
Next: The PHRASE PARSER Up: System Architecture Previous: The TOKENIZER

The PREPROCESSOR

The PREPROCESSOR accepts the tokens produced by the TOKENIZER as input and produces lexical items as output. A lexical item is defined as a token or sequence of tokens that has an entry in the system's lexicon. During this phase, multiwords are recognized. Proper names of individuals, locations and corporations are considered lexical items, and the PREPROCESSOR makes the first attempt to recognize them.

Case is very important for disambiguating proper and common nouns in English. In texts with both upper and lower case characters, capitalization provides very useful information about which words can or cannot be parts of names, which is not available in upper-case-only texts. Therefore, the PREPROCESSOR uses separate transducers for recognizing personal and corporate names for mixed case and upper-case-only texts.

There are three basic transducers for corporate names. There is a transducer that operates on both upper-only and mixed case texts that recognizes company names that do not appear with a standard suffix like ``Inc,'' or ``GmbH.'' There is a recognizer for mixed case text corporate names that basically accepts all capitalized words preceding a suffix like ``Inc,'' with some heuristics to avoid including capitalized words at the beginning of a sentence that are not part of the name. Upper case only texts present more of a problem, because the simple expedient of accepting any noun group preceding the corporate suffix leads to overgeneration of company names, particularly in cases of lexical ambiguity of the words involved. For example, a sentence like ``ALBION IRON & METAL SAW AN INCREASE IN PROFITS THIS YEAR'' would probably result in ``ALBION IRON & METAL SAW'' as the name of the company, because ``saw'' can be a noun as well as a verb. To prevent this kind of overgeneration of company names, we restrict the words that can combine to form company names to be a member of a list of product words that are likely to occur in names. ``Iron'' and ``metal'' occur on this list, while ``saw'' does not.

This heuristic for recognizing company names in upper-case-only texts caused the most serious problem we encountered in the walkthrough example. The first sentence of this example is

BRIDGESTONE SPORTS CO. SAID FRIDAY IT HAS SET UP A  JOINT VENTURE  IN
TAIWAN  WITH A LOCAL CONCERN AND A  JAPANESE  TRADING HOUSE TO PRODUCE GOLF
CLUBS TO BE SHIPPED TO  JAPAN.

It turns out that ``BRIDGESTONE'' is known in the lexicon to be the name of a company, however ``SPORTS'' was not on the list of product words. Therefore, the system recognized ``BRIDGESTONE'' as a company name and as the subject of the sentence, and ignored ``SPORTS CO.'' as an apositive.

When a company name is recognized, it is entered into the lexicon for the duration of the text, together with any possible aliases that can be predetermined. The lexicon is restored to its initial state at the end of a text so any mistakes or perverse company names will have no effect on subsequent processing. For example, if an article mentions ``Next, Inc.'' it is important to recognize ``Next'' as a company name for the duration of the text, but that could obviously cause havoc with other texts.

In summary, the preprocessor performs the following functions:

In case of ambiguity, the longest phrase beginning at the current point in the input string is selected.


next up previous
Next: The PHRASE PARSER Up: System Architecture Previous: The TOKENIZER
Jerry Hobbs 2004-02-24