Publications

Mining the Heterogeneous Transformations between Data Sources to Aid Record Linkage.

Abstract

Heterogeneous transformations are translations between strings that are not characterized by a single function. Eg, nicknames, abbreviations and synonyms are heterogeneous transformations while edit distances are not. Such transformations are useful for information retrieval, information extraction and text understanding. They are especially useful in record linkage, where the problem is to determine whether two records refer to the same entity by examining the similarities between their fields. However, heterogeneous transformations are usually created manually and without assurance they will be useful. This paper presents a data mining approach to discover heterogeneous transformations between two data sets, without labeled training data, which can then be used to aid record linkage. In addition to simple transformations, our algorithm finds combinatorial transformations, such as synonyms and abbreviations together. Our experiments demonstrate that our approach can discover many types of specialized transformations, and we show that by exploiting these transformations we can improve record linkage accuracy. Our approach makes discovering and exploiting heterogeneous transformations more scalable and robust by lessening the domain and human dependencies.

Date
September 22, 2025
Authors
Matthew Michelson, Craig A Knoblock
Conference
IC-AI
Pages
422-428