overview

demos

publications

people

 

 information agents home


Background

Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. A wrapper for the yellow pages source can take a query for a Mexican restaurant near Marina del Rey, CA, for example, retrieve the Web page containing results of the query and extract the restaurant’s name, its address and the phone number. Below is a fragment of a Web page returned by a yellow pages source for the example query and the data tuple extracted by the wrapper.
 

web page fragment

 

extracted data


 

 

 NAME    Casablanca Restaurant 
 STREET  220 Lincoln Boulevard
 CITY    Venice
 ZIP     90291
 PHONE   (310) 392-5751

The wrappers used by our group are landmark-based, i.e., they rely on the layout of the Web page in order to extract data from it (Muslea et al. 1999, Kushmerick et al. 1997). The wrappers are created semi-automatically in the following way:

  1. The user labels examples of data to extract on a several pages,
  2. These examples, along with the Web pages, are provided to the wrapper induction algorithm,
  3. which learns extraction rules for the data.

The advantages and disadvantages of such wrappers are outlined below.

Pros:

  • data extraction is fast and efficient
  • domain independent
  • extraction rules can be learned from a small number of examples

Cons:

  • wrappers are prone to failure when layout of the Web page changes
  • few validation tools to verify the wrapper is extracting data correctly (Kushmerick 1999)
  • no wrapper maintenance tools

The one in red is the major problem for information integration applications. Even if a Web site changes its layout once a year, if you have on the order of a thousand wrappers (comparison shopping applications, for example), then you have to fix several wrappers every day.
 

Research Goals

Our research goal is to create a data representation scheme and to develop efficient algorithms that allow us to learn the structure of common data types. We can use this information for wrapper maintenance applications. The first application, wrapper verification, detects when the wrapper is not extracting data correctly from a Web site, possibly because the layout of the site has changed in a way that breaks the extraction rules learned by wrapper induction system. If the wrapper is not working correctly because the Web site has changed, next task, wrapper reinduction, is automatically revise the wrapper and learn new extraction rules. Although we focus on Web applications, the learning technique can be used to learn about data fields in general. We believe that it is a step towards interoperability among information agent, where agents can exchange data without the need for a pre-specified protocol.
 

Data Representation and Learning

Our objective is to learn the structure of common data fields, such as street addresses. Here are three examples of a street address:

220 Lincoln Boulevard
523 S Robertson Boulevard
4676 Admiralty Way

One way to represent this information is as a sequence of characters, such as a regular expression (Goan et al. 1996). This description is very powerful and descriptive; however, many common data types don’t have a well-defined grammar, or it may be computationally expensive to learn it. Another way to represent information is as a collection of global features, such as the number of words and the density of numeric characters (Kushmerick 1999). These descriptions can be learned very quickly and efficiently, but we believe that representation level is too coarse for some information management tasks.

We propose an intermediate token-level representation that balances the descriptive power and specificity of the character-level representation with the compactness and computational efficiency of the global representation. Tokens are words that contain different types of characters. We use the token's character types to assign it to one or more general syntactic types: numeric, alphabetic, alphanumeric, punctuation, etc. These types form a hierarchy (see figure below) that allows for multi-level generalization. Thus, the token “apple” belongs to the specific type representing the word “apple”, as well as to general types of lowercased words, alphabetic words, and alphanumeric strings. This representation is flexible and may be extended to include new types and domain specific information.

We claim that the structure of many common data types can be described by a sequence of specific and general tokens, or a pattern. For complex data fields it is sufficient to use only the starting and ending patterns as the description of the data field. The collection of starting and ending patterns is data prototype. Thus, the data prototype for the street address examples above might be:

starting patterns:
_NUM _CAPS
ending patterns:
_CAPS Blvd
_CAPS _CAPS

where _NUM represents the general category that represents numeric strings, and _CAPS is a general category representing all capitalized strings.

Because we do not have a source of explicit negative examples for each data field, we chose to frame the learning problem as learning from positive examples alone. The learning problem is to find the patterns that describe many of the positive examples of data but are highly unlikely to describe a random sequence of tokens. In this statement, all possible randomly generated sequences of tokens are treated as implicit negative examples.
Stated this way, the learning problem is to find statistically significant patterns, i.e., those that describe a significant number of positive examples and are unlikely to have been generated by chance. We use hypothesis testing at 5% significance level to determine whether a sequence of tokens is significant. DataPro, which is described in Lerman & Minton 2000, is a greedy algorithm that finds significant patterns in a set of token sequences.
 

Applications to Wrapper Maintenance

As described above, Web wrappers use the layout of HTML pages in order to extract data, so if the layout of the page changes, the wrapper may stop extracting correctly. The wrapper verification task is to automatically determine whether the wrapper is correctly extracting data from an information source (Kushmerick 1999). The verification task can be stated in the following way. Given a set of old examples that are known to be correct, and a set of test examples extracted by the wrapper from the new pages. Do the patterns learned from the old examples describe the same number of the new examples? If yes, the wrapper is extracting correctly, otherwise, the source may have changed.

Once we have determined that the wrapper is not working correctly, the next task is to rebuild it automatically. This task involves two steps:

Wrapper reinduction

Complex pages, like those that contain tables and lists, present a special challenge to the wrapper reinduction algorithm. This is because with the current wrapper induction method, the first and last elements of the list have to be unambiguously identified, in addition to at least three consecutive list elements. We are developing automatic methods to handle complex pages.


Data sets

Archive containing sample pages retrived by wrappers on different dates. This data was used in the experiments in the JAIR 2003 paper.


10/17/2008

Copyright: USC Information Sciences Institute 2000-2008