| |
|
|
|
| |
|
|
|
|
|
|
|
|
| |
Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. A wrapper for the yellow pages source can take a query for a Mexican restaurant near Marina del Rey, CA, for example, retrieve the Web page containing results of the query and extract the restaurant’s name, its address and the phone number. Below is a fragment of a Web page returned by a yellow pages source for the example query and the data tuple extracted by the wrapper. |
|
| |
|
|
|
| |
The wrappers used by our group are landmark-based, i.e., they
rely on the layout of the Web page in order to extract data from it (Muslea
et al. 1999, Kushmerick et al. 1997). The wrappers are created
semi-automatically in the following way:
-
The user labels examples of data to extract on a several pages,
-
These examples, along with the Web pages, are provided to the wrapper induction
algorithm,
-
which learns extraction rules for the data.
|
|
| |
|
|
|
| |
The advantages and disadvantages of such wrappers are outlined below.
Pros:
-
data extraction is fast and efficient
-
domain independent
-
extraction rules can be learned from a small number of examples
|
Cons:
-
wrappers are prone to failure when layout of the
Web page changes
-
few validation tools to verify the wrapper is extracting data correctly
(Kushmerick 1999)
-
no wrapper maintenance tools
|
|
|
| |
|
|
|
| |
We have applied our general framework for creating information assistants to build an example travel assistant. The resulting system helps a user plan out a business trip from beginning to end. When you start the system, it first looks up the upcoming meetings in your calendar. After you have selected a meeting it extracts the dates for the meeting, looks up the location, checks the weather, and even makes a recommendation about whether you should fly, drive or take the train to the meeting. |
|
| |
|
|
|
| |
The one in red is the major problem for information integration applications. Even if a Web site changes its layout once a year, if you have on the order of a thousand wrappers (comparison shopping applications, for example), then you have to fix several wrappers every day. |
|
| |
|
|
|
| |
|
| |
|
|
|
| |
Efficient algorithms that allow us to learn the structure of common data
types.We can use this information for wrapper maintenance applications.
The first application, wrapper verification, detects when the wrapper
is not extracting data correctly from a Web site, possibly because the
layout of the site has changed in a way that breaks the extraction rules
learned by wrapper induction system. If the wrapper is not working correctly
because the Web site has changed, next task, wrapper reinduction,
is automatically revise the wrapper and learn new extraction rules. Although
we focus on Web applications, the learning technique can be used to learn
about data fields in general. We believe that it is a step towards interoperability
among information agent, where agents can exchange data without the need
for a pre-specified protocol. |
|
| |
|
|
|
| |
|
| |
|
|
|
| |
Our objective is to learn the structure of common data fields, such as street addresses. Here are three examples of a street address: |
|
| |
|
|
|
| |
220 Lincoln Boulevard
523 S Robertson Boulevard
4676 Admiralty Way |
|
| |
|
|
|
| |
One way to represent this information is as a sequence of characters,
such as a regular expression (Goan et al. 1996). This description is very
powerful and descriptive; however, many common data types don’t have a
well defined grammar, or it may be computationally expensive to learn it.
Another way to represent information is as a collection of global features,
such as the number of words and the density of numeric characters (Kushmerick
1999). These descriptions can be learned very quickly and efficiently,
but we believe that representation level is too coarse for some information
management tasks. |
|
| |
|
|
|
| |
We propose an intermediate token-level representation that balances
the descriptive power and specificity of the character-level representation
with the compactness and computational efficiency of the global representation.
Tokens are words that contain different types of characters. We use the
token's character types to assign it to one or more general syntactic types:
numeric, alphabetic, alphanumeric, punctuation, etc. These types form a
hierarchy
(see figure below) that allows for multi-level generalization. Thus, the
token ’apple’ belongs to the specific type representing the word ’apple’,
as well as to general types of lowercased words, alphabetic words, and
alphanumeric strings. This representation is flexible and may be extended
to include new types and domain specific information. |
|
| |
|
|
|
| |
 |
|
| |
|
|
|
| |
We claim that the structure of many common data types can be described by a sequence of specific and general tokens, or a pattern. For complex data fields it is sufficient to use only the starting and ending patterns as the description of the data field. The collection of starting and ending patterns is data prototype. Thus, the data prototype for the street address examples above might be: |
|
| |
|
|
|
| |
starting patterns:
_NUM _CAPS
ending patterns:
_CAPS Blvd
_CAPS _CAPS |
|
| |
|
|
|
| |
where _NUM represents the general category that represents numeric strings,
and _CAPS is a general category representing all capitalized strings. |
|
| |
|
|
|
| |
Because we do not have a source of explicit negative examples for each data field, we chose to frame the learning problem as learning from positive examples alone. The learning problem is to find the patterns that describe many of the positive examples of data but are highly unlikely to describe a random sequence of tokens. In this statement, all possible randomly generated sequences of tokens are treated as implicit negative examples. |
|
| |
|
|
|
| |
Stated this way, the learning problem is to find statistically significant patterns, I.e. those that describe a significant number of positive examples and are unlikely to have been generated by chance. We use hypothesis testing at 5% significance level to determine whether a sequence of tokens is significant. DataPro, which is described in Lerman & Minton 2000, is a greedy algorithm that finds significant patterns in a set of token sequences. |
|
| |
|
|
|
| |
|
| |
|
|
|
| |
As described above, Web wrappers use the layout of HTML pages in order to extract data, so if the layout of the page changes, the wrapper may stop extracting correctly. The wrapper verification task is to automatically determine whether the wrapper is correctly extracting data from an information source (Kushmerick 1999). The verification task can be stated in the following way. Given a set of old examples that are known to be correct, and a set of test examples extracted by the wrapper from the new pages. Do the patterns learned from the old examples describe the same number of the new examples? If yes, the wrapper is extracting correctly, otherwise, the source may have changed.
-
Learn data prototypes for the old examples.
-
Compute the proportion of the old examples described by each pattern.
-
Compute the distribution of new examples over the patterns
-
If two distributions are statistically different (according to the goodness
of fit method), the wrapper is not extracting correctly.
|
|
| |
|
|
|
| |
Once we have determined that the wrapper is not working correctly, the
next task is to rebuild it automatically. This task involves two steps:
-
data extraction - identify correct examples of data on new pages
-
wrapper induction - feed these examples, along with the new pages, to the
wrapper induction algorithm to learn new extraction rules
|
|
| |
|
|
|
| |
 |
|
| |
|
|
|
| |
Complex pages, like those that contain tables and lists, present a special
challenge to the wrapper reinduction algorithm. This is because with the
current wrapper induction method, the first and last elements of the list
have to be unambiguously identified, in addition to at least three consecutive
list elements. We are developing automatic methods to handle complex pages. |
|
| |
|
|
|
| |
|
|
|