Next: The PREPROCESSOR
Up: System Architecture
Previous: The Walkthrough Example
The TOKENIZER is a simple transducer that accepts ascii
characters as input and produces a stream of tokens as output.
The tokenizer performs the following functions:
- Groups characters into ``words.''
- Computes value of numeric tokens.
- Detects abbreviations and determines sentence boundaries.
- Normalizes corporate prefixes and suffixes such as P.T. and Inc.
In case of ambiguity, the ambiguity is resolved in favor of the
longest token that can be formed starting at the current position in
the input stream.
The walkthrough text does not present any unusual difficulties
for the TOKENIZER.
Jerry Hobbs
2004-02-24