next up previous
Next: Interpreting Tables Up: Recognizing and Interpreting Tables Previous: The Problem

Recognizing Tables

A table is a two-dimensional array. It consists of two or more records, frequently corresponding to the rows, where each record consists of two or more fields, frequently corresponding to the vertical columns. The information in a field will be referred to as an item. Records may span multiple lines of text. In the current implementation, fields are pieces of text vertically aligned over multiple records. Our corpus consists of military texts in ASCII text (as opposed to typeset text) so vertical alignment is recognized by the columns (character offsets from the beginning of each line) of the characters in the fields.

Consider the following table, with its pretabular sentence, headings and subheadings:

   FIELD EXERCISES WERE CONDUCTED BY THE FOLLOWING UNITS:

        UNIT             HOME BASE           LOCATION

   21 MAY 94:
      1ST MECH INF BN   FT SAM HOUSTON         LAFAYETTE
      2ND MECH INF BN     FT LEWIS      BATON ROUGE

   22 MAY 94:
      3RD MECH INF BN     MONTEREY          LAFAYETTE

Table 2

The first problem we face is recognizing this as a table in the stream of text, rather than as a long sentence with lots of spaces between words. This can be more difficult than it first might seem, because lots of spaces do sometimes occur between words and often in text spaces line up in successive lines accidentally.

This text provides an example which could be misinterpreted.  There
are three fields.  The second field ends with this sentence.  One
field stops here.  The problem occurred often in our corpus.

We also must recognize that outline formats are not tables.

Moreover, the edges of the fields can be quite ragged, sometimes by accident, sometimes by design, as when a list of decimal numbers is justified on the decimal point. In our corpus, tables were sometimes hastily typed and the alignment is erratic.

Our algorithm first analyzes each line of the text. Lines are classified in numerous ways: blank, short, centered, colon- or dash-terminated, separator (e.g., a repeated string of the same characters), or possible outline subdivision line (eg. ``A. Introduction''). The multiple space gaps are recorded in a data structure.

We then scan the text (ignoring blank lines and always breaking at text section boundaries) looking for sets of consecutive lines of potential tables. We look for at least two successive lines for which the following conditions hold:

1. If two or more spaces occur between words in a line, that qualifies as a potential field boundary (a ``gap''). If the lines already identified as potentially part of a table have a gap, then there must be at least a single space that is within or abuts the gap. The gap from the previous lines is intersected with the gap from this line.

2. In the set of lines, there must be no vertical overlap between fields. That is, different fields in different records may not have any column in common.

3. In the set of lines, if a line has any characters in a field, then those columns must intersect the columns occupied in other lines. That is, the same occupied field in different records must all have some vertical intersection. This can lead the system to recognize two fields where it originally recognized just one. In the table of presidents, only one field (the entire line) would be recognized from the first line but the subsequent lines allows the system to recognize that the first line could have been parsed as two fields.

4. There are at least two fields.

Rules 2 and 3 are relaxed for the first line of tables. It is often the case that headings may be longer than the data under them, which may result in a heading overlapping another field. Furthermore, headings may be centered while the data is left justified which can result in a short heading and short data failing to overlap. A heading line must have the same number of fields as the table below it and each field of the heading must overlap the corresponding field of the table. The following table of vehicles demonstrates problems with recognizing headings as parts of tables.

             UNIT      VEHICLES      TYPES

         1ST MECH INF BN  10  TANKS
         2ND MECH INF BN  24  TANKS, APCS, TRUCKS
         3RD MECH INF BN   3  TRUCKS

Table 3

Headings sometimes are present, and sometimes are not. During interpretation, the first record may be recognized as a heading and treated differently than the data of the other records.

The limitations of this approach are obvious. We cannot recognize a vertical list of items as a table, since there is only one field. Tables with fields that are so ragged that they overlap will not be recognized. However, including these cases would have significantly degraded the precision of the algorithm, since there would have been a large number of false positives.

In addition, the algorithm will not recognize separate fields where there is only one space between them. This can happen especially when the items in one field have a fixed length. The following is an example in which we recognize two fields rather than three:

        William Henry Harrison 1841 1841
         Died of pneumonia in office
        John Tyler             1841 1845
        James Knox Polk        1845 1849

Table 4

However, we can often recover from such cases because we allow multiple items in a field during interpretation.

Once potential tables and their fields have been identified, there are several more problems. Rows of a table can be interrupted with both subheadings and remarks, as in the tables of exercises and of presidents.

We allow for these by splicing together potential tables with compatible fields that are separated by a single line that doesn't fit the fields of either table. This allows tables to span subheadings, single-line remarks, and single rows that are misaligned enough not to match the fields of the other rows.

The next problem is to distinguish subheadings from remarks. An interruption in a table is classified as a subheading if one of the following conditions holds:

  1. The line ends in a colon, dash, or multiple dashes.

  2. The line begins to the left of the first column of the table.

  3. The line is centered.

Otherwise the interruption is classified as a remark. As will be seen in Section 3, the interpretation of subheadings constitutes an important part of the interpretation of tables. At present, remarks are ignored, although more properly they should be interpreted as normal sentences, but in the context provided by the previous record in the table.

Tables often have multiple-line fields within a record. Sometimes these multiple lines represent lists of items and sometimes they constitute a single item whose representation was too long to fit in the field. To confound this situation, tables also often have empty fields within a record. Recognizing when a line represents its own record rather than a continuation line of a previous record is done by noting how many fields exist in a line.

            UNIT        VEHS EQUIPMENT   COMMENT

        1ST MECH INF BN  10  TANKS       
        2ND MECH INF BN  24  TANKS
                             APCS
                             TRUCKS
        3RD MECH INF BN   3  TRUCKS
        4TH MECH INF BN   0             EQUIPMENT WAS ALREADY
                                        COMMITED ELSEWHERE

Table 5

When our system finds a line with only a single field in the middle of a table with multiple fields, that field is considered to be a continuation of the field from the previous line. Thus the EQUIPMENT field of the record for the 2ND MECH INF BN will be a sequence consisting of ``TANKS'', ``APCS'', and ``TRUCKS'' while the COMMENT field of the record for the 4TH MECH INF BN will be a sequence consisting of ``EQUIPMENT WAS ALREADY'' and ``COMMITED ELSEWHERE''. The interpretation process will need to determine whether the sequences are separate items or a single item.

If there are multiple fields in a single line, then that line is considered to be its own record rather than a continuation line. While this worked well for our corpus, a more general solution would require interleaving recognition and interpretation.

Missing fields in a table are sometimes intended to indicate that the information is unknown or irrelevant as in the multiline table. Other times fields (especially those on the left) are omitted as a kind of ellipsis or as a form of subtitle. We currently treat missing fields as missing data.

Not every sentence that immediately precedes a table is a pretabular sentence in the sense that it describes the contents of the table. During Table Recognition we attempt to identify those immediately preceding sentences that are pretabular sentences. An immediately preceding sentence is classified as pretabular if no paragraph break occurs between it and the table and one of the following conditions holds:

  1. It ends in a colon or a dash.

  2. It contains the words ``the following'' or ends with the words ``as follows''.

In the tables that were recognized, there were 11 with subheadings. The program recognized 8 of these as subheadings, with no false positives, for a recall of 73% and a precision of 100%. There were 9 tables with remarks or overflow lines.


next up previous
Next: Interpreting Tables Up: Recognizing and Interpreting Tables Previous: The Problem
Jerry Hobbs 2004-02-24