Wrappers facilitate access to Web-based information sources
by providing a uniform querying and data extraction capability. A wrapper for
the yellow pages source can take a query for a Mexican restaurant near Marina
del Rey, CA, for example, retrieve the Web page containing results of the query
and extract the restaurant’s name, its address and the phone number. Below is a
fragment of a Web page returned by a yellow pages source for the example query
and the data tuple extracted by the wrapper.
web page fragment
The wrappers used by our group are landmark-based, i.e., they rely on the layout of the Web page in order to extract data from it (Muslea et al. 1999, Kushmerick et al. 1997). The wrappers are created semi-automatically in the following way:
The advantages and disadvantages of such wrappers are outlined below.
The one in red is the major problem for information
integration applications. Even if a Web site changes its layout once a year, if
you have on the order of a thousand wrappers (comparison shopping applications,
for example), then you have to fix several wrappers every day.
Our research goal is to create a data representation scheme
and to develop efficient algorithms that allow us to learn the structure of
common data types. We can use this information for wrapper maintenance
applications. The first application, wrapper verification, detects when
the wrapper is not extracting data correctly from a Web site, possibly because
the layout of the site has changed in a way that breaks the extraction rules
learned by wrapper induction system. If the wrapper is not working correctly
because the Web site has changed, next task, wrapper reinduction, is
automatically revise the wrapper and learn new extraction rules. Although we
focus on Web applications, the learning technique can be used to learn about
data fields in general. We believe that it is a step towards interoperability
among information agent, where agents can exchange data without the need for a
Our objective is to learn the structure of common data fields, such as street addresses. Here are three examples of a street address:
523 S Robertson Boulevard
4676 Admiralty Way
One way to represent this information is as a sequence of characters, such as a regular expression (Goan et al. 1996). This description is very powerful and descriptive; however, many common data types don’t have a well-defined grammar, or it may be computationally expensive to learn it. Another way to represent information is as a collection of global features, such as the number of words and the density of numeric characters (Kushmerick 1999). These descriptions can be learned very quickly and efficiently, but we believe that representation level is too coarse for some information management tasks.
We propose an intermediate token-level representation that balances the descriptive power and specificity of the character-level representation with the compactness and computational efficiency of the global representation. Tokens are words that contain different types of characters. We use the token's character types to assign it to one or more general syntactic types: numeric, alphabetic, alphanumeric, punctuation, etc. These types form a hierarchy (see figure below) that allows for multi-level generalization. Thus, the token “apple” belongs to the specific type representing the word “apple”, as well as to general types of lowercased words, alphabetic words, and alphanumeric strings. This representation is flexible and may be extended to include new types and domain specific information.
We claim that the structure of many common data types can be described by a sequence of specific and general tokens, or a pattern. For complex data fields it is sufficient to use only the starting and ending patterns as the description of the data field. The collection of starting and ending patterns is data prototype. Thus, the data prototype for the street address examples above might be:
where _NUM represents the general category that represents numeric strings, and _CAPS is a general category representing all capitalized strings.
Because we do not have a source of explicit negative examples for each data
field, we chose to frame the learning problem as learning from positive
examples alone. The learning problem is to find the patterns that describe many
of the positive examples of data but are highly unlikely to describe a random
sequence of tokens. In this statement, all possible randomly generated
sequences of tokens are treated as implicit negative examples.
Stated this way, the learning problem is to find statistically significant patterns, i.e., those that describe a significant number of positive examples and are unlikely to have been generated by chance. We use hypothesis testing at 5% significance level to determine whether a sequence of tokens is significant. DataPro, which is described in Lerman & Minton 2000, is a greedy algorithm that finds significant patterns in a set of token sequences.
As described above, Web wrappers use the layout of HTML pages in order to extract data, so if the layout of the page changes, the wrapper may stop extracting correctly. The wrapper verification task is to automatically determine whether the wrapper is correctly extracting data from an information source (Kushmerick 1999). The verification task can be stated in the following way. Given a set of old examples that are known to be correct, and a set of test examples extracted by the wrapper from the new pages. Do the patterns learned from the old examples describe the same number of the new examples? If yes, the wrapper is extracting correctly, otherwise, the source may have changed.
Once we have determined that the wrapper is not working correctly, the next task is to rebuild it automatically. This task involves two steps:
Complex pages, like those that contain tables and lists, present a special challenge to the wrapper reinduction algorithm. This is because with the current wrapper induction method, the first and last elements of the list have to be unambiguously identified, in addition to at least three consecutive list elements. We are developing automatic methods to handle complex pages.
Copyright: USC Information Sciences