[Example | Research | Semantic Labeling | Inducing Definitions | Related Projects ]

Source Modeling:

Synopsis

In this project, we develop techniques for discovering services and automatically generating models of the functionality they provide. These models can then be used to index the services (in search engines) or to compose them using an information mediator (such as Prometheus).


Introduction

New services appear on the Internet all the time. In this project we are interested in the problem of enabling systems to take advantage of the new services without the need for reprogramming them. In order to for a system to make use of the new service it needs to understand what functionality the service provides. So in this project we develop techniques for automatically modeling this functionality by probing the services with reasonable inputs. Once the system understands the functionality being provided it can incorporate the service into a new or existing workflow.

In the diagram below, an information mediator is providing uniform access to a set of heterogeneous flight services. When the user asks for the cheapest flight between Milan (MXP) and Pittsburgh (PIT), the mediator inspects the definitions (models) it has for different services. It discovers that both United and Lufthansa offer flights between these destinations and calls the relevant operations of each.

A mediator providing uniform access to a set of known, but heterogeneous services
Figure 1: A mediator providing uniform access to heterogeneous services

The limitations of such a system are demonstrated by the addition of a new service from Alitalia. Because the mediator doesn't have a definition or model for the new service, it cannot make use of the service even though the information it provides would be relevant to the current query. Thus to integrate new services automatically into a mediator we need to discover a semantic model for the service, but where should such a model come from?

One answer to this problem is to rely on standardisation. If all service providers make use of the same schemas and operations to describe their services then the mediator can simply reuse the semantic definitions to assess which sources are relevant to the query.

Unfortunately, the assumption that all providers will adhere to a common standard is not always reasonable. Another approach is to try to learn a model for each new service automatically. A mixture of schema matching (Rahm & Bernstein 2001) and service classification (Heß & Kushmerick 2003) techniques can be used to hypothesize what functionality the new service might be providing.

In this project we follow this second approach, taking the idea one step further by actively querying sources to see if the output agrees with that defined by our model of them. The example below shows how we go about doing this:

Example: Learning a New Source Definition

In this example, we want to learn a definition for a newly discovered service called RateFinder. Figure 2 demonstrates the scenario. The mediated schema (on the left of the diagram) contains two semantic data-types: currency and rate. Examples of these types are {USD, EUR, AUD} and {1936.2, 1.3058, 0.53177} respectfully. The schema also contains a relation called exchange, which relates two currency values to a rate value.

A diagram containing two sources and a mediator as described in the text
Figure 2: A simple problem

In addition to the mediated schema, we have a definition for a source called LatestRates, which is known and available to the system. (The $-symbol indicates that the attribute is an input.) :

LatestRates($country1, $country2, ratio) :- exchange(country1, country2, ratio).

With this background knowledge, the problem is to discover a definition for the new source: RateFinder($fromCountry,$toCountry,val). The steps followed to learn a definition for this new source are as follows:

  • Step 1: Use metadata to classify input types and invoke the source
  • Guided by meta-data similarity (such as the edit distance between attribute labels), the system can hypothesize that the input fields of this new service are both of type currency. To test this hypothesis, it invokes the new source with examples of this semantic type, producing tuples: {<EUR,USD,1.3080>, <USD,EUR,0.7645>, ....} The fact that tuples are returned for these inputs means the classification is probably correct.

  • Step 2: Use the instance data produced to classify the output types
  • Taking into account both the meta-data and the newly generated instances of the output type, the system can classify the output attribute to be of type rate.

  • Step 3: Generate plausible source definitions
  • With the type signature complete, the system now needs to find out what relationship exists between the input and output attributes. This relationship can be written as a source definition. Based on the signature we have (at least) two possible definitions. The candidates are labeled def_1 and def_2 and both involve the exchange predicate:

    def_1($fromCountry, $toCountry, val):- exchange(fromCountry, toCountry, val).
    def_2($fromCountry, $toCountry, val):- exchange(toCountry, fromCountry, val).

  • Step 4: Reformulate candidates in terms of the other sources
  • To test which, if any, of these two source definitions is correct, we need to generate tuples which adhere to each definition and compare them with the tuples produced by the source itself. The system can generate tuples for each definition by reformulating the definitions in terms of the other known services, in this case LatestRates:

    def_1($fromCountry, $toCountry, val) :- LatestRates(fromCountry, toCountry, val).
    def_2($fromCountry, $toCountry, val) :- LatestRates(toCountry, fromCountry, val).

  • Step 5: Invoke services and compare output
  • Invoking the two reformulated definitions using the same inputs as before produces the tuples in the table below:

    inputRateFinderdef_1def_2
    < EUR, USD >1.30801.30770.7647
    < USD, EUR >0.76450.76471.3077
    < EUR, AUD >1.68671.68980.5918

    The tuples produced by def_1 are “very similar” although not identical to those produced by RateFinder. Once the system has seen a “sufficient” number of tuples, it will conclude that the best definition for the new service is indeed def_1.




Research Areas

The research performed in this project can be divided into two main areas. The first is to recognize the semantic types present in the type signature of the new source (semantic labeling). The second is to induce a definition for the source given the type signature. In the next sections we describe these two areas of research.

Semantic Labeling

Our approach to classifying the input and output data-types relies on background knowledge captured in a Domain Model (DM). The Domain Model contains a hierarchy of semantic types, e.g., Currency and Rate, as well as a set of domain predicates. Our system is able to populate the DM automatically by querying known sources (i.e., sources with known definitions) and collecting data from them. Thus a light-weight DM must be created manually initially, but it is semi-automatically populated using proprietary Web wrapper technology, and, more importantly, can be extended by automatically incorporating new sources as described below.

In previous work (Lerman and Minton 2000; Lerman, Minton et al. 2003 , we have developed a domain-independent pattern language that allows us to represent the structure of data as a sequence of tokens or token types. These can be specific tokens, such as 'USC', as well as general token types. The general types have regular expression-like recognizers, which simply identify the syntactic category to which the token's characters belong: 2DIGIT, NUMBER, CAPS, etc. The symbolic representation of data content by patterns of tokens and token types is concise and powerful. These patterns can be efficiently learned from examples of a data field. In our previous research, we found that learned patterns allowed us to locate examples of the field on Web pages with considerable accuracy in a wide variety of information domains (book sellers, phone books, used cars, …) (Lerman, Minton et al. 2003; Lerman, Gazen et al. 2004). Our system uses learned patterns to recognize data instances and label them by their semantic category - names, addresses, phone numbers, car makes and models, currency, flight status and weather information, stock prices, etc. - from a variety of sources, such as HTML pages, Web services, spreadsheets, text files, emails and other types of documents. We are also planning to include any available metadata (such as mark-up, or labels in HTML or WSDL files) to improve semantic classification of data.

One of the benefits of our approach is that the DM is flexible and extensible. A new source, for example, directory services for a German company, can be added to it and populated by querying the source. The system will then use examples to learn country-specific phone numbers and will recognize future examples of "German phone numbers."

Inducing Source Definitions

In this research area, we are interested in techniques for learning the relationship between the attributes of a sevice. In other words, we are trying to learn the function that the service applies to its inputs in order to produce its outputs, (steps 3 to 5 of the example above). We restrict the problem to that of dealing with those operations of a service which only produce information and do not have any affects which change the “state of the world”. For example, we are primarily interested in modeling operations which provide information such as the cost of a flight, but not those which perform a transaction, such as to buy the ticket.

As shown in the example above, the semantics of information producing actions can be modeled using DATALOG source definitions. We follow the Local-as-View (LAV) (Levy et al. 1995) approach to modeling operations in which each source is described as a view over the mediated schema. The problem of discovering the functionality of a new service can thus be seen as the problem of discovering the LAV source description for each of the operations it provides.

We employ Inductive Logic Programming (ILP) search techniques to generate feasible definitions for the target source. We then reformulated these definitions in terms of the known services. Finally we invoke both the new and the known services and compare outputs. A more detailed description of our approach can be found in (Carman and Knoblock 2005), and a simple demo of the system is available here.

The major difficulties and areas of research in this work, deal with techniques for:

  • selecting valid inputs to use in invoking different services
  • approximating the set of tuple that would be returned by certain definitions
  • deciding when two similar tuples refer to the same entity (aka record linkage)



  • Related Projects

    In this section we outline some other research we are doing which relates strongly to source modeling.

    Wrapper Maintenance

    Background

    Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. A wrapper for the yellow pages source can take a query for a Mexican restaurant near Marina del Rey, CA, for example, retrieve the Web page containing results of the query and extract the restaurant’s name, its address and the phone number. Below is a fragment of a Web page returned by a yellow pages source for the example query and the data tuple extracted by the wrapper.

    The wrappers used by our group are landmark-based, i.e., they rely on the layout of the Web page in order to extract data from it (Muslea et al. 1999, Kushmerick et al. 1997). The wrappers are created semi-automatically in the following way:

    1. The user labels examples of data to extract on a several pages,
    2. These examples, along with the Web pages, are provided to the wrapper induction algorithm,
    3. which learns extraction rules for the data.
    The advantages and disadvantages of such wrappers are outlined below.
    Pros:
    • data extraction is fast and efficient
    • domain independent
    • extraction rules can be learned from a small number of examples
    Cons:
    • wrappers are prone to failure when layout of the Web page changes
    • few validation tools to verify the wrapper is extracting data correctly (Kushmerick 1999)
    • no wrapper maintenance tools
    The one in red is the major problem for information integration applications. Even if a Web site changes its layout once a year, if you have on the order of a thousand wrappers (comparison shopping applications, for example), then you have to fix several wrappers every day.

    Research Goals

    Our research goal is to create a data representation scheme and to develop efficient algorithms that allow us to learn the structure of common data types.We can use this information for wrapper maintenance applications. The first application, wrapper verification, detects when the wrapper is not extracting data correctly from a Web site, possibly because the layout of the site has changed in a way that breaks the extraction rules learned by wrapper induction system. If the wrapper is not working correctly because the Web site has changed, next task, wrapper reinduction, is automatically revise the wrapper and learn new extraction rules. Although we focus on Web applications, the learning technique can be used to learn about data fields in general. We believe that it is a step towards interoperability among information agent, where agents can exchange data without the need for a pre-specified protocol.

    Data Representation and Learning

    Our objective is to learn the structure of common data fields, such as street addresses. Here are three examples of a street address:

    220 Lincoln Boulevard
    523 S Robertson Boulevard
    4676 Admiralty Way

    One way to represent this information is as a sequence of characters, such as a regular expression (Goan et al. 1996). This description is very powerful and descriptive; however, many common data types don’t have a well defined grammar, or it may be computationally expensive to learn it. Another way to represent information is as a collection of global features, such as the number of words and the density of numeric characters (Kushmerick 1999). These descriptions can be learned very quickly and efficiently, but we believe that representation level is too coarse for some information management tasks.

    We propose an intermediate token-level representation that balances the descriptive power and specificity of the character-level representation with the compactness and computational efficiency of the global representation. Tokens are words that contain different types of characters. We use the token's character types to assign it to one or more general syntactic types: numeric, alphabetic, alphanumeric, punctuation, etc. These types form a hierarchy (see figure below) that allows for multi-level generalization. Thus, the token “apple” belongs to the specific type representing the word “apple”, as well as to general types of lowercased words, alphabetic words, and alphanumeric strings. This representation is flexible and may be extended to include new types and domain specific information.

    We claim that the structure of many common data types can be described by a sequence of specific and general tokens, or a pattern. For complex data fields it is sufficient to use only the starting and ending patterns as the description of the data field. The collection of starting and ending patterns is data prototype. Thus, the data prototype for the street address examples above might be:

    starting patterns:
    _NUM _CAPS
    ending patterns:
    _CAPS Blvd
    _CAPS _CAPS

    where _NUM represents the general category that represents numeric strings, and _CAPS is a general category representing all capitalized strings.

    Because we do not have a source of explicit negative examples for each data field, we chose to frame the learning problem as learning from positive examples alone. The learning problem is to find the patterns that describe many of the positive examples of data but are highly unlikely to describe a random sequence of tokens. In this statement, all possible randomly generated sequences of tokens are treated as implicit negative examples.
    Stated this way, the learning problem is to find statistically significant patterns, I.e. those that describe a significant number of positive examples and are unlikely to have been generated by chance. We use hypothesis testing at 5% significance level to determine whether a sequence of tokens is significant. DataPro, which is described in Lerman & Minton 2000, is a greedy algorithm that finds significant patterns in a set of token sequences.

    Applications to Wrapper Maintenance

    As described above, Web wrappers use the layout of HTML pages in order to extract data, so if the layout of the page changes, the wrapper may stop extracting correctly. The wrapper verification task is to automatically determine whether the wrapper is correctly extracting data from an information source (Kushmerick 1999). The verification task can be stated in the following way. Given a set of old examples that are known to be correct, and a set of test examples extracted by the wrapper from the new pages. Do the patterns learned from the old examples describe the same number of the new examples? If yes, the wrapper is extracting correctly, otherwise, the source may have changed. Once we have determined that the wrapper is not working correctly, the next task is to rebuild it automatically. This task involves two steps:
    Wrapper reinduction

    Complex pages, like those that contain tables and lists, present a special challenge to the wrapper reinduction algorithm. This is because with the current wrapper induction method, the first and last elements of the list have to be unambiguously identified, in addition to at least three consecutive list elements. We are developing automatic methods to handle complex pages.



    References

    Rahm, E., and P. A. Bernstein, "A Survey of Approaches to Automatic Schema Matching," VLDB Journal 10, 4 (Dec. 2001), pp. 334-350

    A. Heß and N. Kushmerick. Automatically attaching semantic metadata to web services. In Proc. Workshop Information Integration on the Web, 2003. Int. Joint Conf. Artificial Intelligence.

    Answering Queries Using Views. Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, Divesh Srivastava. In proceedings of PODS, 1995.