
The EDC Project
More information:
EDC project at Columbia University
EDC project at ISI
EDC project: contact information
EDC project publications
EDC demonstrations
related pages and data
dg.o conference series
Digital Government Research Center (DGRC)
Project Overview
Abstract
The massive amount of statistical and text data available from Federal Agencies has created a set of daunting challenges to both research and analysis communities. These problems include heterogeneity, size, distribution, and control of terminology. At the Digital Government Research Center (DGRC) we are investigating solutions to three key problems, namely, (1) ontological mappings for terminology standardization; (2) data integration across data bases with high speed query processing; and (3) interfaces for query input and presentation of results. This collaboration between researchers from Columbia University and the Information Sciences Institute (ISI) of the University of Southern California employs technology developed at both locations, in particular ISI's SENSUS ontology, ISI's SIMS multi-database access planner, Columbia's LKB automated dictionary and terminology analysis system, and others. The pilot application targets gasoline data from BLS, EIA, Census, and other agencies.
Somewhat more detail
The DGRC Energy Data Collection (EDC) Project was started in the National Science Foundation's Digital Government program in 1999. It is developing solutions to three key problems in accessing large distributed data collections: (1) sophisticated planning of access to multiple distributed and heterogeneous databases; (2) the use of a large ontology as metadata for terminology standardization and user guidance; (3) flexible multilingual user interfaces for query input, ontology browsing, and result presentation.
The EDC team has worked with representatives of major Federal and State statistics agencies and other organizations and individuals on a regular basis, to collect and disseminate statistical data. Representatives of these agencies, primarily from the Census Bureau, the Bureau of Labor Statistics (BLS), and the Energy Information Administration (EIA) of the Department of Energy (DoE), and the California Energy Commission (CEC). Other agencies we have met with include the National Center for Health Statistics (NCHS) and the Los Angeles County Administration. The Energy Information Administration provides extensive monthly energy data to the public on its Internet site http://www.eia.doe.gov. This site is heavily browsed, receiving hundreds of thousands of hits a month, even though most of the information is available only as downloads of standard web (HTML) pages or as prepared PDF documents. Monthly data can only be obtained for the last few years in this manner. Current facilities thus provide only limited access to this very rich data source. Some portion of the data is also accessible by querying, but there are two serious problems hampering the current query system. First, it does not provide visibility for the many definitions and footnotes that explain the complex nature of the data and to changes that occur in series over time. Lack of awareness of such explanatory information often makes incomparable figures appear to be comparable. Second, the difficulty of defining queries makes the querying system useful only to expert users.
The EDC Project is addressing both problems. Techniques are being developed and implemented to attempt to make the complexities of data series either transparent, or more visible to users, depending on whether they can be handled independently by the system or not. And novel query facilities and other data analysis and presentation capabilities are being developed, that will be usable by the more common potential user of EIA data--the non-expert. Our aim is to build a system that, though still a research prototype, will be of benefit to various segments of the public, judging by the high demand for monthly energy data. Besides the large number of browsers accessing EIA's Web site, hardcopy publications further disseminate EIA's monthly energy data. Such paper publications include the Monthly Energy Review and the International Petroleum Statistics Report.
In order to support homogeneous access to multiple databases, the EDC project focuses on several principal research directions: database management and access planning; construction of an overarching ontology in which terms are defined and standardized; systems that automatically extract domain terms and definitional information from glossaries and other text; and user-friendly interface. In more detail: