This module takes the text as a character sequence, locates the sentence boundaries, and produces for each sentence a sequence of lexical items. The lexical items are generally the words together with the lexical attributes for them that are contained in the lexicon. This module minimally determines the possible parts of speech for each word, and may choose a single part of speech. It makes the lexical attributes in the lexicon available to subsequent processing. It recognizes multiwords. It recognizes and normalizes certain basic types that occur in the genre, such as dates, times, personal and company names, locations, currency amounts, and so on. It handles unknown words, minimally by ignoring them, or more generally by trying to guess from their morphology or their immediate context as much information about them as possible. Spelling correction is done in this module as well.
The methods used here are lexical lookup, perhaps in conjunction with morphological analysis; perhaps statistical part-of-speech tagging; finite-state pattern-matching for recognizing and normalizing basic entities; standard spelling correction techniques; and a variety of heuristics for handling unknown words.
The lexicon might have been developed manually or borrowed from another site, but more and more they are adapted from already existing machine-readable dictionaries and augmented automatically by statistical techniques operating on the key templates and/or the corpus.