Accurately and reliably extracting data from the web: A machine learning approach

Abstract

A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.

Date: November 12, 2025
Authors: Craig A Knoblock, Kristina Lerman, Steven Minton, Ion Muslea
Journal: Intelligent exploration of the web
Pages: 275-287
Publisher: Physica-Verlag HD

View Paper

Information Sciences Institute

Publications

Accurately and reliably extracting data from the web: A machine learning approach

Abstract