|
|
Composing
Web Sources with .NET
Quarterly
Report
January 1, 2002 -- March 31, 2002
Personnel
Craig
Knoblock
Cyrus
Shahabi
Jose
Luis Ambite
Snehal
Thakkar
Ching-chien Chen
Bo
Han
University
of Southern California
Information
Sciences Institute
4676
Admiralty Way
Los
Angeles, CA 90066
A key component of global
awareness is the ability to access and integrate a wide variety of
information sources and web services.
The .NET platform facilitates rapid development and deployment of the
web services that act as information providers. However, integrating heterogeneous information sources and web
services still remains a challenge, in part because information sources often
use different data formats, different access methods and have different
accuracy. In addition not all information
sources allow accessing their data using web services. Geospatial data integration is also
hampered by the conflation issue due to the inaccuracy of different
information sources as well as different projections used by different
information sources. For example,
most online geocoders on the web that provide geographic coordinates for
given address are often as much off as a half a block or more and do not
provide any bound on the accuracy of the information. In this report we describe one approach to
solve these problems and provide the ability to rapidly integrate different
information sources using web services or wrapper building techniques. Furthermore, we describe an application
that demonstrates the use of the tools by integrating Microsoft’s
Terraservice with various white page directories on the web.
Integrating
satellite images, map and other geospatial data with information from web
sources presents three major challenges.
The first challenge is to extract structured data from semi-structured
sources, for example, accessing white page websites and extracting name,
address and telephone numbers from the white pages. The second challenge is find very accurate geographic
coordinates for the addresses extracted from the white page data
sources. While there are several
geocoders available online, most of them are off by as much as half a block and do not provide any bounds on
inaccuracy. Finally, extracted
information must be superimposed on the satellite imagery or maps using the
geographic coordinates and conflation related issues must be addressed.
The remainder of this report is organized
as follows. Section 2 discusses the
tools we have developed in previous work. Section 3 describes the Building
Finder application in detail and describes our approach to solve the
above-mentioned challenges. Finally, in Section 4 concludes the report by
re-capping key technologies utilized by the Building Finder application.
2. Previous Work
In previous projects we have developed an information agent
architecture termed Ariadne to enable users to easily create their own
specialized Web-based information agents [Knoblock et al. 2001]. The
research focus was on technology for rapidly constructing information agents
to extract, query, and integrate data from web sites as well as
databases. The resulting system
includes tools for constructing wrappers that access and extract information
from web sites, and query planning technology for dynamically and efficiently
answering queries using these sources.
An essential component of Ariadne is a
wrapper around each individual data source, which provides uniform access to
the information sources so that they can be queried as if they were
relational databases. We have
developed machine learning techniques to build wrappers around websites with
relatively minimal user assistance.
We have developed several wrappers around popular web sources of
different categories, such as the CNN website for news information and Yahoo
Weather for weather related information, using the wrapper building tools.
We have also developed Theseus, an efficient
execution system for information agents [Barish et al. 2000].
More specifically, Theseus empowers users by (a) simplifying the
process of specifying agents that can manage and monitor dynamic data
sources, such as web sites, and (b) providing a flexible, optimized platform
for agent execution. Theseus allows
users to specify a query execution plan to build a new information agent by
using existing information agents.
3. Integrating Web Source
with Web Services
We have developed a
demonstration application, called BuildingFinder, which allows a user to
input an area of interest and provides satellite imagery of the area with the
houses and streets superimposed on the satellite image. In the process, the
Building Finder application solves three major challenges to integrate
information from Microsoft Terraservice with various whitepage directories on
the web. In Section 3.1, we describe
our approach to extract strctured address information from the white page
directories. Section 3.2 describes
the details about integration of satellite imagery with information from
white page directories.
3.1 Extraction of Data
from Online Sources
The Building Finder application
utilizes the white pages data from SwitchBoard (http://www.switchboard.com)
white pages. The SwitchBoard white
pages site was wrapped using our wrapper building tools. Once a wrapper is
built, the system can treat the SwitchBoard white pages website as a
database, i.e., the system can send queries to the wrapper such as, “show all
people in the city of El Segundo whose last name starts with the letter A”.
The wrapper returns an XML document that contains the results of the query
and can be parsed using any XML parser. Figure 1 shows an example
output for the Switchboard white pages wrapper.

3.2 Integrating Imagery
and Addresses
The Building Finder application integrates
information from several web sources including, Microsoft Terraservice,
Landmark service web services, white pages website and the street information
from U.S. Census Tigerline files from a local database. XML web services are web applications
that support standard protocols such as SOAP and XML[Vasudevan 2001]. Terraservice provides access to a large
repository of satellite images and topographic maps using SOAP and XML
protocols[Barclay et al. 1999]. The
Landmark service allows easy access to USGS gazetteer, which provides
important points such as hospitals, schools, and churches in the United
States. The issue however is how to
integrate the imagery obtained from the Terraservice with the white page data
and address the issue of conflation.
The Building Finder application uses the Theseus execution system to
integrate information from the Terraservice and white pages websites.
Figure 2 shows the architecture for the
Building Finder application. Using
Theseus, a user can define an integration plan that queries the streets from
a database containing street network information. The result of the query is set of tuples consisting of street
name, city, state and zip code, which is used to query the Switchboard white
pages agent to find the addresses related to the streets. The result of the Switchboard white pages
website is then provided to the geocoder agent, which in turn provides the
latitude and longitude for the addresses.
In parallel Theseus also obtains a satellite image from Terraservice
for the given area of interest.
Finally, the latitude and longitude points representing different
addresses and information about the addresses are superimposed on the
satellite imagery. Figure 3 shows a
screen shot of the Building Finder Application.


4.
Next Steps
While the Building Finder
application depicts an exciting application that successfully integrates information
from different web sources and web services, it has some issues as well. First, the geocoder used for the Building
Finder application does not have enough accuracy to pinpoint the buildings in
satellite image at 1 meter per pixel resolution. We are building a more accurate geocoder by integrating
property tax information from several property tax sites with U.S. Census
Bureau’s Tigerline files. Second, we
are also working on improving the accuracy of the Tigerline data using
conflation techniques. The data for
the conflation algorithms is also being extracting from online data
sources. Third, we are working on a
general approach to converting wrappers to web services. We plan to make those web services
available in UDDI as green pages.
References
[Barclay et al. 1999] Barclay, Tom, Gray, Jim and Stuz, Don 1999.
Microsoft TerraServer: A Spatial Data Warehouse, Microsoft Corporation.
[Barish et al. 2000] Barish, Greg, DiPasquo, Daniel, Knoblock,
Craig A. and Minton, Steven 2000. A dataflow approach to agent-based
information management. In Preceedings
of the 2000 International Conference of on Artificial Intelligence, Las
Vegas, NV.
[Knoblock et al. 2001] Knoblock, Craig, Minton, Steven, Ambite, Jose-Luis,
Ashish, Naveen, Muslea, Ion, Philpot, Andrew and Tejada, Sheila 2001. The
ARIADNE Approach to Web-Based Information Integration. International Journal on Intelligent Cooperative Information Systems
(IJCIS) 10(1-2): 145-169.
[Vasudevan 2001] Vasudevan, Venu A Web Services Primer http://www.xml.com/pub/a/2001/04/04/webservices/
2001
|