Publications

Accurately and reliably extracting data from the web: A machine learning approach

Abstract

A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.

Date
September 11, 2025
Authors
Craig A Knoblock, Kristina Lerman, Steven Minton, Ion Muslea
Journal
Intelligent exploration of the web
Pages
275-287
Publisher
Physica-Verlag HD