home

people

publications

downloads

labs

projects

demos

quarterly reports

 

 

 

 

 

 

 

 

 

Quarterly Report

 

Composing Web Sources with .NET

Quarterly Report

January 1, 2002 -- March 31, 2002

 

Personnel

Craig Knoblock

Cyrus Shahabi

Jose Luis Ambite

Snehal Thakkar

 Ching-chien Chen

Bo Han

University of Southern California

Information Sciences Institute

4676 Admiralty Way

Los Angeles, CA 90066

 

knoblock@isi.edu, shahabi@usc.edu, ambite@isi.edu,

thakkar@isi.edu, chingchc@usc.edu, bhan@isi.edu
Introduction

A key component of global awareness is the ability to access and integrate a wide variety of information sources and web services.  The .NET platform facilitates rapid development and deployment of the web services that act as information providers.  However, integrating heterogeneous information sources and web services still remains a challenge, in part because information sources often use different data formats, different access methods and have different accuracy.  In addition not all information sources allow accessing their data using web services.  Geospatial data integration is also hampered by the conflation issue due to the inaccuracy of different information sources as well as different projections used by different information sources.  For example, most online geocoders on the web that provide geographic coordinates for given address are often as much off as a half a block or more and do not provide any bound on the accuracy of the information.  In this report we describe one approach to solve these problems and provide the ability to rapidly integrate different information sources using web services or wrapper building techniques.   Furthermore, we describe an application that demonstrates the use of the tools by integrating Microsoft’s Terraservice with various white page directories on the web. 

Integrating satellite images, map and other geospatial data with information from web sources presents three major challenges.   The first challenge is to extract structured data from semi-structured sources, for example, accessing white page websites and extracting name, address and telephone numbers from the white pages.  The second challenge is find very accurate geographic coordinates for the addresses extracted from the white page data sources.  While there are several geocoders available online, most of them are off by as much as half  a block and do not provide any bounds on inaccuracy.  Finally, extracted information must be superimposed on the satellite imagery or maps using the geographic coordinates and conflation related issues must be addressed.

  The remainder of this report is organized as follows.  Section 2 discusses the tools we have developed in previous work. Section 3 describes the Building Finder application in detail and describes our approach to solve the above-mentioned challenges. Finally, in Section 4 concludes the report by re-capping key technologies utilized by the Building Finder application.

2.   Previous Work

In previous projects we have developed an information agent architecture termed Ariadne to enable users to easily create their own specialized Web-based information agents [Knoblock et al. 2001].  The research focus was on technology for rapidly constructing information agents to extract, query, and integrate data from web sites as well as databases.  The resulting system includes tools for constructing wrappers that access and extract information from web sites, and query planning technology for dynamically and efficiently answering queries using these sources.             

An essential component of Ariadne is a wrapper around each individual data source, which provides uniform access to the information sources so that they can be queried as if they were relational databases.  We have developed machine learning techniques to build wrappers around websites with relatively minimal user assistance.  We have developed several wrappers around popular web sources of different categories, such as the CNN website for news information and Yahoo Weather for weather related information, using the wrapper building tools.

We have also developed Theseus, an efficient execution system for information agents [Barish et al. 2000].  More specifically, Theseus empowers users by (a) simplifying the process of specifying agents that can manage and monitor dynamic data sources, such as web sites, and (b) providing a flexible, optimized platform for agent execution.  Theseus allows users to specify a query execution plan to build a new information agent by using existing information agents.

3.   Integrating Web Source with Web Services

We have developed a demonstration application, called BuildingFinder, which allows a user to input an area of interest and provides satellite imagery of the area with the houses and streets superimposed on the satellite image. In the process, the Building Finder application solves three major challenges to integrate information from Microsoft Terraservice with various whitepage directories on the web.  In Section 3.1, we describe our approach to extract strctured address information from the white page directories.    Section 3.2 describes the details about integration of satellite imagery with information from white page directories.

3.1   Extraction of Data from Online Sources

The Building Finder application utilizes the white pages data from SwitchBoard (http://www.switchboard.com) white pages.  The SwitchBoard white pages site was wrapped using our wrapper building tools. Once a wrapper is built, the system can treat the SwitchBoard white pages website as a database, i.e., the system can send queries to the wrapper such as, “show all people in the city of El Segundo whose last name starts with the letter A”. The wrapper returns an XML document that contains the results of the query and can be parsed using any XML parser.  Figure 1 shows an example output for the Switchboard white pages wrapper.

 

Text Box:  
Figure 1 Whitepage Data Extraction

 

 

 

 

 

 

 

3.2   Integrating Imagery and Addresses

The Building Finder application integrates information from several web sources including, Microsoft Terraservice, Landmark service web services, white pages website and the street information from U.S. Census Tigerline files from a local database.   XML web services are web applications that support standard protocols such as SOAP and XML[Vasudevan 2001]. Terraservice provides access to a large repository of satellite images and topographic maps using SOAP and XML protocols[Barclay et al. 1999].  The Landmark service allows easy access to USGS gazetteer, which provides important points such as hospitals, schools, and churches in the United States.  The issue however is how to integrate the imagery obtained from the Terraservice with the white page data and address the issue of conflation.  The Building Finder application uses the Theseus execution system to integrate information from the Terraservice and white pages websites.

Figure 2 shows the architecture for the Building Finder application.  Using Theseus, a user can define an integration plan that queries the streets from a database containing street network information.  The result of the query is set of tuples consisting of street name, city, state and zip code, which is used to query the Switchboard white pages agent to find the addresses related to the streets.  The result of the Switchboard white pages website is then provided to the geocoder agent, which in turn provides the latitude and longitude for the addresses.  In parallel Theseus also obtains a satellite image from Terraservice for the given area of interest.  Finally, the latitude and longitude points representing different addresses and information about the addresses are superimposed on the satellite imagery.  Figure 3 shows a screen shot of the Building Finder Application.

 

Text Box:  Figure 2 Application Architecture

 

 

 

 

 

 

 

 

 

Text Box:  
Figure 3 Screenshot of the System in Action

 

 

 

 

 

 

 

 

 

 

 

 

 

4.   Next Steps

While the Building Finder application depicts an exciting application that successfully integrates information from different web sources and web services, it has some issues as well.  First, the geocoder used for the Building Finder application does not have enough accuracy to pinpoint the buildings in satellite image at 1 meter per pixel resolution.  We are building a more accurate geocoder by integrating property tax information from several property tax sites with U.S. Census Bureau’s Tigerline files.  Second, we are also working on improving the accuracy of the Tigerline data using conflation techniques.  The data for the conflation algorithms is also being extracting from online data sources.  Third, we are working on a general approach to converting wrappers to web services.  We plan to make those web services available in UDDI as green pages.

 

References

 

[Barclay et al. 1999] Barclay, Tom, Gray, Jim and Stuz, Don 1999. Microsoft TerraServer: A Spatial Data Warehouse, Microsoft Corporation.

               

[Barish et al. 2000] Barish, Greg, DiPasquo, Daniel, Knoblock, Craig A. and Minton, Steven 2000. A dataflow approach to agent-based information management. In Preceedings of the 2000 International Conference of on Artificial Intelligence, Las Vegas, NV.

               

[Knoblock et al. 2001] Knoblock, Craig, Minton, Steven, Ambite, Jose-Luis, Ashish, Naveen, Muslea, Ion, Philpot, Andrew and Tejada, Sheila 2001. The ARIADNE Approach to Web-Based Information Integration. International Journal on Intelligent Cooperative Information Systems (IJCIS) 10(1-2): 145-169.

               

[Vasudevan 2001] Vasudevan, Venu A Web Services Primer http://www.xml.com/pub/a/2001/04/04/webservices/ 2001

               

 

This research is sponsored by an un-restricted cash gift from: