Publications

An Overview of Entity Resolution: From Classical Approaches to Large Language Models

Abstract

Entity Resolution (ER) is the problem of determining which records in one or more datasets refer to the same underlying real-world entity. Despite its half-century history, ER is still not completely solved and remains critical across numerous domains: healthcare systems must link patient records across institutions, e-commerce platforms deduplicate product catalogs, knowledge bases extract and consolidate entities from diverse sources, and fraud detection systems identify related accounts and transactions. This overview is designed for practitioners and advanced data scientists who must solve ER problems on real datasets. Rather than surveying all ER algorithms-a task more appropriate for researchers developing new methods-we synthesize the key concepts, design decisions, and practical approaches that have proven effective over decades. We describe the universal workflow common to nearly all ER systems: blocking for computational efficiency, feature-based matching, and evaluation methodology. We explain how this workflow extends from tabular data to semi-structured and graph-based data. We briefly discuss recent research trends including scalability techniques, domain-specific approaches, collective methods, and privacy-preserving solutions. Throughout, we emphasize the practical aspects of ER, and ground the discussion with concrete examples from real applications.

Date
2017
Authors
Mayank Kejriwal
Source
Mayank Kejriwal