Publications

Automatic data extraction from lists and tables in web sources

Abstract

We describe a technique for extracting data from lists and tables and grouping it by rows and columns. This is done completely automatically, using only some very general assumptions about the structure of the list. We have developed a suite of unsupervised learning algorithms that induce the structure of lists by exploiting the regularities both in the format of the pages and the data contained in them. Among the tools used are AutoClass for automatic classification of data and grammar induction of regular languages. The approach was tested on 14 Web sources providing diverse data types, and we found that for 10 of these sources we were able to correctly find lists and partition the data into columns and rows.

Date
January 1, 1970
Authors
Kristina Lerman, Craig Knoblock, Steven Minton
Journal
IJCAI-2001 Workshop on Adaptive Text Extraction and Mining
Volume
98