Downloads

Software: Karma

Karma is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs.

Github Project Page

MapFinder: Harvesting maps on the Web

Maps are one of the most valuable documents for gathering geospatial information about a region. We use a Content Based Image Retrieval (CBIR) technique to built an accurate and scalable system, MapFinder, that can discover standalone images as well as images embedded within documents on the Web that are maps. The implementation provided here has the capabilities of extracting WaterFilling features from images, and classifying a given image as a map or nonmap. We also provide the data collected by us for our experiments.

Download Code Download Data (1.5 GB) MapFinder Project Paper

ARX and Phoebus: Information Extraction from Unstructured and Ungrammatical Text on Web

The project presents two implementations for performing information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum posting titles. The ARX system is an automatic approach to exploiting reference sets for this extraction. The Phoebus system presents a machine learning approach exploiting reference sets.

Download ARX Project Paper Phoebus Project Paper

BSL: A system for learning blocking schemes

Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.The BSL system presented here does this in the supervised setting of record linkage. This means that given some training matches, it can discover rules (a blocking scheme) to efficiently generate candidate matches between the sets.

Github Project Paper

EIDOS: Efficiently Inducing Definitions for Online Sources

Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.The BSL system presented here does this in the supervised setting of record linkage. This means that given some training matches, it can discover rules (a blocking scheme) to efficiently generate candidate matches between the sets.

Github Project Paper

Information Sciences Institute

Center on Knowledge Graphs

Available for Download

Software: Karma

MapFinder: Harvesting maps on the Web

ARX and Phoebus: Information Extraction from Unstructured and Ungrammatical Text on Web

BSL: A system for learning blocking schemes

EIDOS: Efficiently Inducing Definitions for Online Sources