[Wrapper Learning | Unstructured Extraction ]

Data Extraction:Wrapper Learning

With the expansion of the Web, computer users have gained access to a large variety of comprehensive information repositories. However, the Web is based on a browsing paradigm that makes it difficult to retrieve and integrate data from multiple sources. The most recent generation of information agents address this problem by enabling information from pre-specified sets of Web sites to be accessed via database-like queries. For each such Web site, the agents generally relies on a wrapper that extracts the information from the collection of similarly-looking Web pages.

Each wrapper consists of a set of extraction rules and the code required to apply those rules. Some systems depend on humans to write the necessary grammar rules. However, there are several reasons why this is undesirable. Writing extraction rules is tedious, time consuming and requires a high level of expertise. These difficulties are multiplied when an application domain involves a large number of existing sources or the format of the source documents changes over time.

Wrapper induction algorithms solve the above mentioned problems by learning/inducing the extraction rules based on user-labeled examples of useful data. We created a new machine learning method for wrapper induction that enables unsophisticated users to painlessly turn Web pages into relational information sources. Based on just a few labeled examples, the system learns highly expressive, hierarchical extraction rules. An intuitive overview of how the system works is given in our JAAMAS-2001 paper. The 30 test domains from the JAAMAS paper can be obtained from the RISE repository.

After deploying the system, we quickly found out that it is unrealistic to assume that a user is willing and has the skills to browse a large number of documents in order to identify a set of informative training examples. On difficult extraction tasks, where the Web pages contains various sorts of exceptions, the lack of informative examples leads to low accuracy extraction use. In order to fix this problem, we created co-testing, which, given a few labeled examples and a pool of unlabeled ones, identifies the most informative unlabeled examples and asks the user to label them (see the AAAI-2000 paper for an overview of co-testing, and the ECAI-2000 paper for a detailed discussion on co-testing and wrapper induction).

Data Extraction:Unstructured Extraction

The wrapper methods provide extraction techniques for semi-structured sources, such as similarly-looking Web pages, but lots of data on the World Wide Web exists in an unstructured and ungrammatical form. For example, a user selling an item on EBay will not necessarily use the same textual layout as other posts about the same item. Nor will this user necessarily conform to the rules of language in their post. This lack of structure and grammar makes it difficult to develop wrapper methods for extraction on this type of data.

To overcome this lack of characterisics for extraction, we infuse the extraction process with outside knowledge, which we call reference sets. A reference set is a collection of entities and the associated attributes. For example, a reference set of cars would include all known car makes, models, trim options and years. A reference set could come from a set of pages on the Web, a database, or a knowledge-base and ontology from the Semantic Web.

Exploiting these reference sets is a two step process. First, we take the record we are using for extraction, called the post, and we match it to a member of the reference set. This yields a set of attributes we can look for in the post. The second step extracts the items from the post that are most similar to the attributes from the matching reference set member.

As an example, consider the posts from the website www.BiddingForTravel.com, shown in the table Example Posts. These posts include information a user typed in regarding a hotel. They contain useful attributes such as the hotel name and the area of that hotel. Once we extract these attributes we could then query the data set on them and derive useful conclusions. Now, suppose we have a reference set such as that shown in the table Reference Set of Hotels. This reference set contains the hotel name and the hotel area. By exploiting methods of record linkage, we can find the record in this reference set that best matches each post, which in turn yields the attributes we can look for in the post. To see the best matching record from the reference set, click on a post from the Example Post table. This will highlight the best matching member of the reference set in the Reference Set of Hotels table and show the extracted results in the table Extracted Information.


Example Posts
Post from www.BiddingForTravel.com
$25 winning bid at holiday inn sel. univ. ctr.
4* Hyatt DT 8/18 $40 1 nite
Hol. Inn Greentree, $40 2/1

Reference Set of Hotels
Hotel NameHotel Area
Holiday InnGreentree
Holiday Inn SelectUniversity Center
Hyatt RegencyDowntown

Extracted information
Extracted Hotel NameExtracted Hotel Area

By matching a post to a member of the reference set, we have included clues that overcome the ungrammatical and unstructured nature of the posts. For more details on the algorithms, see either the IJCAI-2005 paper or Matthew Michelson's Master's Thesis.