In this project, we develop techniques for discovering services
and automatically generating models of the functionality they provide.
These models can then be used to index the services (in search engines)
or to compose them using an information mediator (such as
Prometheus).
New services appear on the Internet all the time. In this project we are interested in the problem of enabling systems to take advantage of the new services without the need for reprogramming them. In order to for a system to make use of the new service it needs to understand what functionality the service provides. So in this project we develop techniques for automatically modeling this functionality by probing the services with reasonable inputs. Once the system understands the functionality being provided it can incorporate the service into a new or existing workflow.
In the diagram below, an information mediator is providing uniform access to a set of heterogeneous flight services. When the user asks for the cheapest flight between Milan (MXP) and Pittsburgh (PIT), the mediator inspects the definitions (models) it has for different services. It discovers that both United and Lufthansa offer flights between these destinations and calls the relevant operations of each.
![]() |
| Figure 1: A mediator providing uniform access to heterogeneous services |
The limitations of such a system are demonstrated by the addition of a new service from Alitalia. Because the mediator doesn't have a definition or model for the new service, it cannot make use of the service even though the information it provides would be relevant to the current query. Thus to integrate new services automatically into a mediator we need to discover a semantic model for the service, but where should such a model come from?
One answer to this problem is to rely on standardisation. If all service providers make use of the same schemas and operations to describe their services then the mediator can simply reuse the semantic definitions to assess which sources are relevant to the query.
Unfortunately, the assumption that all providers will adhere to a common standard is not always reasonable. Another approach is to try to learn a model for each new service automatically. A mixture of schema matching (Rahm & Bernstein 2001) and service classification (Heß & Kushmerick 2003) techniques can be used to hypothesize what functionality the new service might be providing.
In this project we follow this second approach, taking the
idea one step further by actively querying sources to see if
the output agrees with that defined by our model of them.
The example below shows how we go about doing this:
Example: Learning a New Source DefinitionIn this example, we want to learn a definition for a newly discovered service called RateFinder. Figure 2 demonstrates the scenario. The mediated schema (on the left of the diagram) contains two semantic data-types: currency and rate. Examples of these types are {USD, EUR, AUD} and {1936.2, 1.3058, 0.53177} respectfully. The schema also contains a relation called exchange, which relates two currency values to a rate value.
In addition to the mediated schema, we have a definition for a source called LatestRates, which is known and available to the system. (The $-symbol indicates that the attribute is an input.) : With this background knowledge, the problem is to discover a definition for the new source: RateFinder($fromCountry,$toCountry,val). The steps followed to learn a definition for this new source are as follows:
Guided by meta-data similarity (such as the edit distance between attribute labels), the system can hypothesize that the input fields of this new service are both of type currency. To test this hypothesis, it invokes the new source with examples of this semantic type, producing tuples: {<EUR,USD,1.3080>, <USD,EUR,0.7645>, ....} The fact that tuples are returned for these inputs means the classification is probably correct. Taking into account both the meta-data and the newly generated instances of the output type, the system can classify the output attribute to be of type rate. With the type signature complete, the system now needs to find out what relationship exists between the input and output attributes. This relationship can be written as a source definition. Based on the signature we have (at least) two possible definitions. The candidates are labeled def_1 and def_2 and both involve the exchange predicate:
To test which, if any, of these two source definitions is correct, we need to generate tuples which adhere to each definition and compare them with the tuples produced by the source itself. The system can generate tuples for each definition by reformulating the definitions in terms of the other known services, in this case LatestRates:
Invoking the two reformulated definitions using the same inputs as before produces the tuples in the table below:
The tuples produced by def_1 are “very similar” although not identical to those produced by RateFinder. Once the system has seen a “sufficient” number of tuples, it will conclude that the best definition for the new service is indeed def_1. |
Our approach to classifying the input and output data-types relies on background knowledge captured in a Domain Model (DM). The Domain Model contains a hierarchy of semantic types, e.g., Currency and Rate, as well as a set of domain predicates. Our system is able to populate the DM automatically by querying known sources (i.e., sources with known definitions) and collecting data from them. Thus a light-weight DM must be created manually initially, but it is semi-automatically populated using proprietary Web wrapper technology, and, more importantly, can be extended by automatically incorporating new sources as described below.
In previous work (Lerman and Minton 2000; Lerman, Minton et al. 2003 , we have developed a domain-independent pattern language that allows us to represent the structure of data as a sequence of tokens or token types. These can be specific tokens, such as 'USC', as well as general token types. The general types have regular expression-like recognizers, which simply identify the syntactic category to which the token's characters belong: 2DIGIT, NUMBER, CAPS, etc. The symbolic representation of data content by patterns of tokens and token types is concise and powerful. These patterns can be efficiently learned from examples of a data field. In our previous research, we found that learned patterns allowed us to locate examples of the field on Web pages with considerable accuracy in a wide variety of information domains (book sellers, phone books, used cars, …) (Lerman, Minton et al. 2003; Lerman, Gazen et al. 2004). Our system uses learned patterns to recognize data instances and label them by their semantic category - names, addresses, phone numbers, car makes and models, currency, flight status and weather information, stock prices, etc. - from a variety of sources, such as HTML pages, Web services, spreadsheets, text files, emails and other types of documents. We are also planning to include any available metadata (such as mark-up, or labels in HTML or WSDL files) to improve semantic classification of data.
One of the benefits of our approach is that the DM is flexible and extensible. A new source, for example, directory services for a German company, can be added to it and populated by querying the source. The system will then use examples to learn country-specific phone numbers and will recognize future examples of "German phone numbers."
In this research area, we are interested in techniques for learning the relationship between the attributes of a sevice. In other words, we are trying to learn the function that the service applies to its inputs in order to produce its outputs, (steps 3 to 5 of the example above). We restrict the problem to that of dealing with those operations of a service which only produce information and do not have any affects which change the “state of the world”. For example, we are primarily interested in modeling operations which provide information such as the cost of a flight, but not those which perform a transaction, such as to buy the ticket.
As shown in the example above, the semantics of information
producing actions can be modeled using DATALOG source
definitions. We follow the Local-as-View (LAV) (Levy
et al. 1995) approach to modeling operations in which each source is described as a view
over the mediated schema. The problem of discovering the
functionality of a new service can thus be seen as the problem
of discovering the LAV source description for each of
the operations it provides.
We employ Inductive Logic Programming (ILP) search techniques to generate feasible
definitions for the target source. We then reformulated these definitions in terms
of the known services. Finally we invoke both the new and the known services and
compare outputs. A more detailed description of our approach can be found in (Carman
and Knoblock 2005), and a simple demo of the system is available here.
The major difficulties and areas of research in this work, deal with techniques for:
In this section we outline some other research we are doing which relates
strongly to source modeling.
Wrappers facilitate access to Web-based information sources by providing
a uniform querying and data extraction capability. A wrapper for the yellow
pages source can take a query for a Mexican restaurant near Marina del
Rey, CA, for example, retrieve the Web page containing results of the query
and extract the restaurant’s name, its address and the phone number. Below
is a fragment of a Web page returned by a yellow pages source for the example
query and the data tuple extracted by the wrapper.
The wrappers used by our group are landmark-based, i.e., they
rely on the layout of the Web page in order to extract data from it (Muslea
et al. 1999, Kushmerick et al. 1997). The wrappers are created
semi-automatically in the following way:
Related Projects
Wrapper Maintenance
Background
The advantages and disadvantages of such wrappers are outlined below.
Pros:
|
Cons:
|
220 Lincoln Boulevard
523 S Robertson Boulevard
4676 Admiralty Way
One way to represent this information is as a sequence of characters, such as a regular expression (Goan et al. 1996). This description is very powerful and descriptive; however, many common data types don’t have a well defined grammar, or it may be computationally expensive to learn it. Another way to represent information is as a collection of global features, such as the number of words and the density of numeric characters (Kushmerick 1999). These descriptions can be learned very quickly and efficiently, but we believe that representation level is too coarse for some information management tasks.
We propose an intermediate token-level representation that balances the descriptive power and specificity of the character-level representation with the compactness and computational efficiency of the global representation. Tokens are words that contain different types of characters. We use the token's character types to assign it to one or more general syntactic types: numeric, alphabetic, alphanumeric, punctuation, etc. These types form a hierarchy (see figure below) that allows for multi-level generalization. Thus, the token “apple” belongs to the specific type representing the word “apple”, as well as to general types of lowercased words, alphabetic words, and alphanumeric strings. This representation is flexible and may be extended to include new types and domain specific information.

We claim that the structure of many common data types can be described by a sequence of specific and general tokens, or a pattern. For complex data fields it is sufficient to use only the starting and ending patterns as the description of the data field. The collection of starting and ending patterns is data prototype. Thus, the data prototype for the street address examples above might be:
starting patterns:
_NUM _CAPS
ending patterns:
_CAPS Blvd
_CAPS _CAPS
where _NUM represents the general category that represents numeric strings, and _CAPS is a general category representing all capitalized strings.
Because we do not have a source of explicit negative examples for each
data field, we chose to frame the learning problem as learning from positive
examples alone. The learning problem is to find the patterns that describe
many of the positive examples of data but are highly unlikely to describe
a random sequence of tokens. In this statement, all possible randomly generated
sequences of tokens are treated as implicit negative examples.
Stated this way, the learning problem is to find statistically significant
patterns, I.e. those that describe a significant number of positive examples
and
are unlikely to have been generated by chance. We use hypothesis testing
at 5% significance level to determine whether a sequence of tokens is significant.
DataPro, which is described in Lerman & Minton 2000, is a greedy algorithm
that finds significant patterns in a set of token sequences.

Complex pages, like those that contain tables and lists, present a special challenge to the wrapper reinduction algorithm. This is because with the current wrapper induction method, the first and last elements of the list have to be unambiguously identified, in addition to at least three consecutive list elements. We are developing automatic methods to handle complex pages.
Rahm, E., and P. A. Bernstein, "A Survey of Approaches to Automatic Schema Matching," VLDB Journal 10, 4 (Dec. 2001), pp. 334-350
A. Heß and N. Kushmerick. Automatically attaching semantic metadata to web services. In Proc. Workshop Information Integration on the Web, 2003. Int. Joint Conf. Artificial Intelligence.
Answering Queries Using Views. Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, Divesh Srivastava. In proceedings of PODS, 1995.