| |
|
|
|
| |
|
|
|
|
|
|
|
|
| |
|
| |
|
|
|
| |
The task of object identification occurs when integrating information from multiple websites. The same data objects can exist in inconsistent text formats across sites, making it difficult to identify matching objects using exact text match. Examples of the object identification problem are shown in Figure 1. |
|
| |
|
|
|
| |
 Figure 1. Different represenations of restaurtants across different websites |
In the first example the restaurant referred to as "Art's Deli" on the one website may appear "Art's Delicatessen" on the other. Because of this problem, the objects' instances cannot be compared using equality, they must be judged according to text similarity in order to identify if the objects are the same. When two objects are determined the same, a mapping is created between them. |
|
| |
|
|
|
| |
|
| |
|
|
|
| |
When integrating information from multiple websites, the same data objects can exist in inconsistent text formats across sites, making it difficult to identify matching objects using exact text match. We have developed a record linkage system called Active Atlas which compares the objects' shared attributes in order to identify matching objects. Certain attributes are more important for deciding if a mapping should exist between two objects. Previous methods of object identification have required manual construction of object identification rules or mapping rules for determining the mappings between objects. This manual process is time consuming and error-prone. In our approach, Active Atlas learns to tailor mapping rules, through limited user input, to a specific application domain. Within a single site, entities (e.g., people, places, countries, companies, etc.) are usually named in a consistent fashion. However, across sites, the same entities may be referred to with different names. |
|
| |
|
|
|
| |
 Figure 2. Active Atlas Architecture |
Our approach is to select a primary source for an entity's name and then provide a mapping from that source to each of the other sources where a different naming scheme is used. One way to do this is to create a mapping table, which specifies for each entry in one data source what the equivalent entity is called in another data source. Alternatively, if the mapping is computable, it can be represented by a mapping function, which is a program that converts one form into another form. |
|
| |
|
|
|
| |
We have developed a semi-automated method for building mapping tables and functions by analyzing the underlying data in advance. The method attempts to pair each entity in one source with a corresponding entity (or entities, in some cases) in another source. The basic idea is to use information retrieval techniques to provide an initial mapping, and then to apply machine learning techniques to improve the mapping. The initial mapping matches entities from two sources based on their textual similarity. In the subsequent learning phase, the system learns two types of rules to help improve/verify the initial mapping. Transformation rules identify textual transformations like acronyms, abbreviations, and phrase orderings that are common in the domain. For instance, the system can learn that "Rep" is a commonly used abbrievation for ``Republic'', or that one source commonly employs acronyms, or that one source represents person names as "LastName, Firstname", while the other uses "FirstName LastName". The system also learns Mapping rules which are used when we can compare entities along multiple attributes. |
|
| |
|
|
|
| |
We are prototyping an active learning method for learning transformation rules and mapping rules. In our approach, a human is asked to verify some of the pairs in the initial mapping, i.e., to indicate whether the pairs are correctly or incorrectly matched. Then the system attempts to learn new rules, and selects additional pairs for the human to verify. The system selects the pairs that will be most valuable for confirming/disconfirming the hypotheses explored by the learning algorithm. The goal is to obtain a mapping for which the system is highly confident, while minimizing the time the human must spend. |
|
| |
|
|
|
| |
In addition to Active Atlas (a standalone system developed in Java), we have taken the underlying technologies developed for Active Atlas and written a consolidation operator for , an execution platform for information agents we have developed. This allows for record linkage between two data sources to be done in a streaming, dataflow style execution. |
|
| |
|
|
|
| |
|
| |
|
|
|
| |
A difficult problem encountered when performing record linkage is the degree of certainty with which matches are proposed and rejected. A record linkage system, such as Active Atlas, is only as good as the labeled data it has received and is therefore limited in accuracy with respects to its classification of matches. In our research of record linkage, we have found that there exists a "grey" area in the classification of potential matches. A record linkage system is able to classify obvious matches and non-matches. However, a class of potential matches is also present. This class needs to be dealt with in a different and more compelling manner. |
|
| |
|
|
|
| |
This class of potential matches falls into the "grey" area of classification. These potential matches cannot be classified with full confidence as a match yet they posses a confidence score high enough to be considered as potentially matched. This presents the need for a secondary source to help resolve this discrepancy. A secondary source would provide the system with additional information which it could use to help in the classification of the match. The following example helps illustrate a need for secondary sources. |
|
| |
|
|
|
| |
Let us consider the restaurant domain. Record linkage is performed on two different data sources, each source comprised of records referring to a particular restaurant. The system returns all matches, however there exists one returned match that has a lower then expected confidence score. Looking at the record, the system should determine that the phone number field is the source of the inconsistency. Breaking up the field into the area code and telephone exchange, it should determine that the telephone number is matched even though the area codes are not the same. With the availability of a secondary source which contains a telephone area code's history, it could determine that the phone numbers in question are in fact the same since one area code is the successor of the other. |
|
| |
|
|
|
| |
 Figure 3. Axiom model available to the mediator. |
|
| |
|
|
|
| |
Our approach involves combining the mediator with a modified version of Active Atlas to improve the performance of both systems. The mediator system utilizes domain descriptions to transform user queries into a datalog program that queries various data sources and processes the data from these sources to answer the query. This data is then passed on to the record linkage system and used as additional information in the consolidation process. This approach differs from traditional mediators in two ways: (1) we utilize a record linkage system to consolidate data from various data sources, and (2) our approach automatically improves the performance of the object consolidation system by utilizing information from the secondary sources. |
|
| |
|
|
|