Each wrapper consists of a set of extraction rules and the code required to apply those rules. Some systems depend on humans to write the necessary grammar rules. However, there are several reasons why this is undesirable. Writing extraction rules is tedious, time consuming and requires a high level of expertise. These difficulties are multiplied when an application domain involves a large number of existing sources or the format of the source documents changes over time.
Wrapper induction algorithms solve the above mentioned problems by learning/inducing the extraction rules based on user-labeled examples of useful data. We created a new machine learning method for wrapper induction that enables unsophisticated users to painlessly turn Web pages into relational information sources. Based on just a few labeled examples, the system learns highly expressive, hierarchical extraction rules. An intuitive overview of how the system works is given in our JAAMAS-2001 paper. The 30 test domains from the JAAMAS paper can be obtained from the RISE repository.
After deploying the system, we quickly found out that it is unrealistic to assume that a user is willing and has the skills to browse a large number of documents in order to identify a set of informative training examples. On difficult extraction tasks, where the Web pages contains various sorts of exceptions, the lack of informative examples leads to low accuracy extraction use. In order to fix this problem, we created co-testing, which, given a few labeled examples and a pool of unlabeled ones, identifies the most informative unlabeled examples and asks the user to label them (see the AAAI-2000 paper for an overview of co-testing, and the ECAI-2000 paper for a detailed discussion on co-testing and wrapper induction).
To overcome this lack of characterisics for extraction, we infuse the extraction process with outside knowledge, which we call reference sets. A reference set is a collection of entities and the associated attributes. For example, a reference set of cars would include all known car makes, models, trim options and years. A reference set could come from a set of pages on the Web, a database, or a knowledge-base and ontology from the Semantic Web.
Exploiting these reference sets is a two step process. First, we take the record we are using for extraction, called the post, and we match it to a member of the reference set. This yields a set of attributes we can look for in the post. The second step extracts the items from the post that are most similar to the attributes from the matching reference set member.
As an example, consider the posts from the website www.BiddingForTravel.com, shown in the table Example Posts. These posts include information a user typed in regarding a hotel. They contain useful attributes such as the hotel name and the area of that hotel. Once we extract these attributes we could then query the data set on them and derive useful conclusions. Now, suppose we have a reference set such as that shown in the table Reference Set of Hotels. This reference set contains the hotel name and the hotel area. By exploiting methods of record linkage, we can find the record in this reference set that best matches each post, which in turn yields the attributes we can look for in the post. To see the best matching record from the reference set, click on a post from the Example Post table. This will highlight the best matching member of the reference set in the Reference Set of Hotels table and show the extracted results in the table Extracted Information.
| Post from www.BiddingForTravel.com |
|---|
| $25 winning bid at holiday inn sel. univ. ctr. |
| 4* Hyatt DT 8/18 $40 1 nite |
| Hol. Inn Greentree, $40 2/1 |
| Hotel Name | Hotel Area |
|---|---|
| Holiday Inn | Greentree |
| Holiday Inn Select | University Center |
| Hyatt Regency | Downtown |
| Extracted Hotel Name | Extracted Hotel Area |
|---|---|
By matching a post to a member of the reference set, we have included clues that overcome the ungrammatical and unstructured nature of the posts. For more details on the algorithms, see either the IJCAI-2005 paper or Matthew Michelson's Master's Thesis.