Exploiting secondary sources for unsupervised record linkage

Abstract

XML, Web services, and the Semantic Web have opened the door for new and exciting information integration applications. Information sources on the web are controlled by different organizations or people, utilize different text formats, and have varying inconsistencies. Therefore, any system that integrates information from different data sources must identify common entities from these sources. Data from many online sources does not contain enough information to accurately link the records using state of the art record linkage systems. There is an inherent need for learning in these systems, most of the time requiring a user in the loop, to accurately link records across datasets. In this paper we describe a novel approach to exploiting additional data sources to design an unsupervised record linkage method. Our evaluation using real world data sets shows that the performance of unsupervised learning in a record linkage system is on par with traditional supervised learning methods.

Date: August 30, 2004
Authors: Martin Michalowski, Snehal Thakkar, Craig A Knoblock
Journal: 2004 VLDB Workshop on Information Integration on the Web

View Paper

Information Sciences Institute

Publications

Exploiting secondary sources for unsupervised record linkage

Abstract