Software: Karma

Karma is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs.

GithubProject Page

MapFinder: Harvesting maps on the Web

Maps are one of the most valuable documents for gathering geospatial information about a region. We use a Content Based Image Retrieval (CBIR) technique to built an accurate and scalable system, MapFinder, that can discover standalone images as well as images embedded within documents on the Web that are maps. The implementation provided here has the capabilities of extracting WaterFilling features from images, and classifying a given image as a map or nonmap. We also provide the data collected by us for our experiments.

Download CodeDownload Data (1.5 GB) MapFinder Project Paper

ARX and Phoebus: Information Extraction from Unstructured and Ungrammatical Text on Web

The project presents two implementations for performing information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum posting titles. The ARX system is an automatic approach to exploiting reference sets for this extraction. The Phoebus system presents a machine learning approach exploiting reference sets.

DownloadARX Project PaperPhoebus Project Paper

BSL: A system for learning blocking schemes

Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.The BSL system presented here does this in the supervised setting of record linkage. This means that given some training matches, it can discover rules (a blocking scheme) to efficiently generate candidate matches between the sets.

GithubProject Paper

EIDOS: Efficiently Inducing Definitions for Online Sources

Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.The BSL system presented here does this in the supervised setting of record linkage. This means that given some training matches, it can discover rules (a blocking scheme) to efficiently generate candidate matches between the sets.

GithubProject Paper 

Wrapper maintenance

Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. When wrapper stops working due to changes in the layout of web pages, our task is to automatically reinduce the wrapper. The data sets used for experiments in our JAIR 2003 paper contain web pages downloaded from two dozen sources over a period of a year.

Data Set