next up previous
Next: Statistical Relevance Filter Up: Robust Processing of Real-World Previous: Evaluating the System

Handling Unknown Words

When an unknown word is encountered, three processes are applied sequentially.

  1. Spelling Correction. A standard algorithm for spelling correction is applied, but only to words longer than four letters.

  2. Hispanic Name Recognition. A statistical trigram model for distinguishing between Hispanic surnames and English words was developed and is used to assign the category Last-Name to some of the words that are not spell-corrected.

  3. Morphological Category Assignment. Words that are not spell-corrected or classified as last names, are assigned a category on the basis of morphology. Words ending in ``-ing'' or ``-ed'' are classified as verbs. Words ending in ``-ly'' are classified as adverbs. All other unknown words are taken to be nouns. This misses adjectives entirely, but this is generally harmless, because the adjectives incorrectly classified as nouns will still parse as prenominal nouns in compound nominals. The grammar will recognize an unknown noun as a name in the proper environment.

There were no unknown words in Message 99, since all the words used in the TST1 set had been entered into the lexicon.

In the first 20 messages of TST2, there were 92 unknown words. Each of the heuristics either did or did not apply to the word. If it did, the results could have been correct, harmless, or wrong. An example of a harmless spelling correction is the change of ``twin-engined'' to the adjective ``twin-engine''. A wrong spelling correction is the change of the verb ``nears'' to the preposition ``near''. An example of a harmless assignment of Hispanic surname to a word is the Japanese name ``Akihito''. A wrong assignment is the word ``panorama''. A harmless morphological assignment of a category to a word is the assignment of Verb to ``undispute'' and ``originat''. A wrong assignment is the assignment of Noun to ``upriver''.

The results were as follows:

  Unknown Applied Correct Harmless Wrong
Spelling 92 25 8 12 5
Surname 67 20 8 10 2
Morphological 47 47 29 11 7

If we look just at the Correct column, only the morphological assignment heuristic is at all effective, giving us 62%, as opposed to 32% for spelling correction and 40% for Hispanic surname assignment. However, harmless assignments are often much better than merely harmless; they often allow a sentence to parse that otherwise would not, thereby making other information in the sentence available to pragmatic interpretation. If we count both the Correct and Harmless columns, then spelling correction is effective 80% of the time, Hispanic surname assignment 90% of the time, and morphological assignment 86%.

Using the three heuristics in sequence meant that 85% of the unknown words were handled either correctly or harmlessly.


next up previous
Next: Statistical Relevance Filter Up: Robust Processing of Real-World Previous: Evaluating the System
Jerry Hobbs 2004-02-24