Downloads
Software: Karma
Karma is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs.
MapFinder: Harvesting maps on the Web
Maps are one of the most valuable documents for gathering geospatial information about a region. We use a Content Based Image Retrieval (CBIR) technique to built an accurate and scalable system, MapFinder, that can discover standalone images as well as images embedded within documents on the Web that are maps. The implementation provided here has the capabilities of extracting WaterFilling features from images, and classifying a given image as a map or nonmap. We also provide the data collected by us for our experiments.
ARX and Phoebus: Information Extraction from Unstructured and Ungrammatical Text on Web
The project presents two implementations for performing information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum posting titles. The ARX system is an automatic approach to exploiting reference sets for this extraction. The Phoebus system presents a machine learning approach exploiting reference sets.
BSL: A system for learning blocking schemes
Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.The BSL system presented here does this in the supervised setting of record linkage. This means that given some training matches, it can discover rules (a blocking scheme) to efficiently generate candidate matches between the sets.
EIDOS: Efficiently Inducing Definitions for Online Sources
Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.The BSL system presented here does this in the supervised setting of record linkage. This means that given some training matches, it can discover rules (a blocking scheme) to efficiently generate candidate matches between the sets.