The EDC Project

More information:
EDC project at Columbia University
EDC project at ISI
EDC project: contact information
EDC project publications
EDC demonstrations
related pages and data
dg.o conference series
Digital Government Research Center (DGRC)


Project Overview

Abstract

The massive amount of statistical and text data available from Federal Agencies has created a set of daunting challenges to both research and analysis communities. These problems include heterogeneity, size, distribution, and control of terminology. At the Digital Government Research Center (DGRC) we are investigating solutions to three key problems, namely, (1) ontological mappings for terminology standardization; (2) data integration across data bases with high speed query processing; and (3) interfaces for query input and presentation of results. This collaboration between researchers from Columbia University and the Information Sciences Institute (ISI) of the University of Southern California employs technology developed at both locations, in particular ISI's SENSUS ontology, ISI's SIMS multi-database access planner, Columbia's LKB automated dictionary and terminology analysis system, and others. The pilot application targets gasoline data from BLS, EIA, Census, and other agencies.


Somewhat more detail

The DGRC Energy Data Collection (EDC) Project was started in the National Science Foundation's Digital Government program in 1999. It is developing solutions to three key problems in accessing large distributed data collections: (1) sophisticated planning of access to multiple distributed and heterogeneous databases; (2) the use of a large ontology as metadata for terminology standardization and user guidance; (3) flexible multilingual user interfaces for query input, ontology browsing, and result presentation.

The EDC team has worked with representatives of major Federal and State statistics agencies and other organizations and individuals on a regular basis, to collect and disseminate statistical data. Representatives of these agencies, primarily from the Census Bureau, the Bureau of Labor Statistics (BLS), and the Energy Information Administration (EIA) of the Department of Energy (DoE), and the California Energy Commission (CEC). Other agencies we have met with include the National Center for Health Statistics (NCHS) and the Los Angeles County Administration. The Energy Information Administration provides extensive monthly energy data to the public on its Internet site http://www.eia.doe.gov. This site is heavily browsed, receiving hundreds of thousands of hits a month, even though most of the information is available only as downloads of standard web (HTML) pages or as prepared PDF documents. Monthly data can only be obtained for the last few years in this manner. Current facilities thus provide only limited access to this very rich data source. Some portion of the data is also accessible by querying, but there are two serious problems hampering the current query system. First, it does not provide visibility for the many definitions and footnotes that explain the complex nature of the data and to changes that occur in series over time. Lack of awareness of such explanatory information often makes incomparable figures appear to be comparable. Second, the difficulty of defining queries makes the querying system useful only to expert users.

The EDC Project is addressing both problems. Techniques are being developed and implemented to attempt to make the complexities of data series either transparent, or more visible to users, depending on whether they can be handled independently by the system or not. And novel query facilities and other data analysis and presentation capabilities are being developed, that will be usable by the more common potential user of EIA data--the non-expert. Our aim is to build a system that, though still a research prototype, will be of benefit to various segments of the public, judging by the high demand for monthly energy data. Besides the large number of browsers accessing EIA's Web site, hardcopy publications further disseminate EIA's monthly energy data. Such paper publications include the Monthly Energy Review and the International Petroleum Statistics Report.

In order to support homogeneous access to multiple databases, the EDC project focuses on several principal research directions: database management and access planning; construction of an overarching ontology in which terms are defined and standardized; systems that automatically extract domain terms and definitional information from glossaries and other text; and user-friendly interface. In more detail:

  • Information Integration and Database Access. Our research concentrates on effective methods to identify and describe the contents of databases so that useful information can be accurately and efficiently located even when precise answers are unavailable. At USC/ISI we have deployed the SIMS database access planner that uses domain models to reason about effective data access. Gathering data from various sources, including EIA, Census, and Bureau of Labor Statistics, we have wrapped over 50,000 data tables in over 120 databases for testing information integration, using wrapping tools developed at ISI in other projects; see example data sources. Early research at Columbia University focused on the aggregation of data. Current research at Columbia focuses on fast access to and display of large quantities of data.
  • Ontology Construction. We have extended USC/ISI's large formal SENSUS ontology at USC/ISI, viewable via the SENSUS browser DINO. SENSUS currently contains over 90,000 terms linked together into a subsumption (isa) network, with additional links for part-of, pertains-to, and so on. To use it in database access planning, we incorporated domain models, such as these (example measures, example gasolines) and have extended and developed new automated concept-to-ontology alignment algorithms.
  • Extraction of terms and defintional information from text, performed at Columbia University, involves the construction of various systems to automatically identify suitable information, extract it, and format it in order to pass it along for incorporation into the ontology. One experiment involves the extraction of approx. 6000 terms across agencies (EIA, Census SICS and NAICS codes, EPA) and the automatic handling of acronyms, using Columbia's LEXING system with the Acro-cat acronym expander.
  • User Interface Development. We are implementing several user interfaces that afford the user various ways of forming queries, including browsing the ontology, creating queries from cascaded menus, and entering natural language (English; plans call for Spanish and possibly Chinese to be added). The interface has the capability of handling integrated querying and presentation of results. An early prototype of the cascaded menu query formation interface.