In recent years there has been an explosion in research on text understanding and on extracting information from real-world texts such as newspaper articles, as exemplified by the MUC evaluations (Sundheim 1992, Sundheim 1993). Tables are very common in these corpora, and very often the tables are rich in the information of interest. For example, a typical recent edition of the Wall Street Journal had twenty tables on a wide variety of subjects. Moreover, unless the tables are tagged with SGML, it is a problem that cannot be avoided. If a system attempts to process a table as though it were ordinary text, serious mistakes would be made. The row and column structure of the table will not be respected, and spurious constituents spanning internal boundaries inside the table will be recognized. Ignoring tables is simply not an option.
Yet there has been relatively little work on recognizing and interpreting tables. In part, this has been because priorities were elsewhere, in part because the problem was not deemed of sufficient linguistic interest. In part, it may be because the problem was viewed as too hard.
In fact, the problem of recognizing and interpreting tables is of significant linguistic interest. It presents in a pure form some of the most important problems in local pragmatics and discourse structure. When we encounter a table such as Table 1,
| Nancy Kerrigan | Disneyland |
| Pizza Hut | |
| Michael Jordan | Nike |
| Charles Barkley | Reebok |
Major athletes have endorsement contracts for over $1M as follows:
We would have to determine in what way each row of the table instantiates this summary statement. This is, in a pure form, the problem of determining how successive segments of text all instantiate or exemplify one general or summary statement.
Discovering this relation can be critical in information extraction tasks. The pretabular sentence may characterize a set of events of interest while the table lists those entities that participated in such an event. Failing to recognize the relation between the pretabular sentence and the items in the table would mean failing to recognize one event for each row of the table.
In brief, the task of recognizing and interpreting tables is of immense importance in information extraction applications, and is a problem of substantial linguistic interest in its own right. It is an excellent locus for research on discourse structure, because the fundamental issues arise there with particular purity.
The methods described here were developed in connection with a project for extracting information from unformatted military messages, and have been implemented in the Message Handler System, based on FASTUS (Hobbs et al., 1997). But the problems that were encountered and the techniques developed generalize to other varieties of text. We are currently testing some of the techniques on business sections of American newspapers.
We discuss first the method for recognizing when there is a table embedded in the text. This can be thought of as specifying the syntax of tables. We then discuss how we interpret tables. This can be thought of as specifying the semantics of tables.