| |
|
|
|
| |
|
|
|
|
|
|
|
|
| |
With the expansion of the Web, computer users have gained access to a large variety of comprehensive information repositories. However, the Web is based on a browsing paradigm that makes it difficult to retrieve and integrate data from multiple sources. The most recent generation of information agents address this problem by enabling information from pre-specified sets of Web sites to be accessed via database-like queries. For each such Web site, the agents generally relies on a wrapper that extracts the information from the collection of similarly-looking Web pages. |
|
| |
|
|
|
| |
Each wrapper consists of a set of extraction rules and the code required to apply those rules. Some systems depend on humans to write the necessary grammar rules. However, there are several reasons why this is undesirable. Writing extraction rules is tedious, time consuming and requires a high level of expertise. These difficulties are multiplied when an application domain involves a large number of existing sources or the format of the source documents changes over time. |
|
| |
|
|
|
| |
Wrapper induction algorithms solve the above mentioned problems by learning/inducing the extraction rules based on user-labeled examples of useful data. We created a new machine learning method for wrapper induction that enables unsophisticated users to painlessly turn Web pages into relational information sources. Based on just a few labeled examples, the system learns highly expressive, hierarchical extraction rules. An intuitive overview of how the system works is given in Hierarchical wrapper induction for semistructured information sources (JAAMAS). The 30 test domains from the JAAMAS paper can be obtained from the RISE repository. |
|
| |
|
|
|
| |
After deploying the system, we quickly found out that it is unrealistic to assume that a user is willing and has the skills to browse a large number of documents in order to identify a set of informative training examples. On difficult extraction tasks, where the Web pages contains various sorts of exceptions, the lack of informative examples leads to low accuracy extraction use. In order to fix this problem, we created co-testing, which, given a few labeled examples and a pool of unlabeled ones, identifies the most informative unlabeled examples and asks the user to label them (see the Selective sampling with redundant views for an overview of co-testing, and the Selective sampling with naive co-testing: Preliminary results for a detailed discussion on co-testing and wrapper induction). |
|
| |
|
|
|