A machine learning approach to accurately and reliably extracting data from the web

Abstract

A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.

Date: October 14, 2025
Authors: C Knoblock, Kristina Lerman, Steven Minton, Ion Muslea
Journal: Proceedings of the IJCAI Workshop on Adaptive Text Extraction and Mining

View Paper

Information Sciences Institute

Publications

A machine learning approach to accurately and reliably extracting data from the web

Abstract