|
After
Sept. 11, authorities said information-stove-piping by intelligence
agencies was one of the biggest stumbling blocks in the fight against
terrorism. Now, two leading researchers discuss different approaches
to merging government files, and cracking open their secrets.
By:
Jamie Callan
Language Technologies Institute, Carnegie Mellon University
W. Bruce Croft
Computer Science Department, University of Massachusetts, Amherst
Eduard Hovy
Digital Government Research Center
Information Sciences Institute, University of Southern California
Introduction
The
terrorist events of September 11 reminded everyone of the need for
accurate and timely government intelligence. Some of the information
that might prevent disasters is secret, and therefore inaccessible.
But in many cases, the information is present somewhere, and freely
available. The problem is getting hold of it in a usable form.
Unfortunately,
while present-day government in almost all its branches has collected,
analyzed and stored information, most of it non-uniform. Information
is all over the place, in hundreds of different formats and systems
and versions. You don't know where to find it, how to access it,
or how to convert it to a format you can work with once you actually
have it.
One
of the principal problems facing those trying to standardize non-homogeneous
data sets is variation in terminology. For example, what one agency
calls salary, another might call income, and a third calls wages,
while using salary to mean something else entirely. Example: one
agency might calculate monthly average prices of unleaded gasoline
in California by measuring wholesale rates each month, while another
measures prices at selected pumps weekly and averages them. The
results will differ, but both will be called "average monthly gasoline
prices in California".
Clearly,
this state of affairs causes confusion for not only Government workers,
but also for journalists, congressional staffers, students, the
general public, and intelligence officers. All would benefit from
government information systems that locate, retrieve, and integrate
desired information quickly, handling transparently the details
of which databases contain the information or in what format it
is presented. No system should expect its patrons to trust its results
unquestioningly, so these information systems should also make it
easy to examine the relationships among documents and/or databases
with similar content if desired.
The
basis for any new system is metadata, that is data that describes
data or collections of data. The Dewey Decimal system, the Library
of Congress Subject Headings, Medical Subject Headings (MESH), and
many other controlled vocabularies (sometimes called ontologies)
are all familiar forms of metadata. Each document is catalogued
by a small number of terms from the controlled vocabulary, as is
each information request, and matching them is very simple.
But
practical experience has shown that integrating vast and disparate
term sets and data definitions to create new forms of metadata is
fraught with difficulty. The U.S. Government has funded several
metadata initiatives, including the Government Information Locator
Service (GILS) and the Advanced Search Facility (ASF) (http://www.gils.net/,
http://asf.gils.net/). These projects
perform exemplary work in establishing a structure of cooperation
and standards between agencies, including structural information
(formats, encodings, links). However, they do not focus on the actual
creation of metadata, nor do they define the algorithms needed to
generate it.
Experience
with traditional forms of metadata has shown that it is expensive
and time-consuming to produce, that people (e.g., authors) often
resist creating it when there is no immediate or direct benefit,
and that information-seekers often find it difficult to relate their
requests to pre-specified ontologies or controlled vocabularies.
Generating a common ontology for a domain also tends to be controversial.
New standards for communicating metadata, such as XML, do nothing
to address the underlying issue of where it originates. Controlled
vocabularies and relatively static ontologies are not solid foundations
for information systems that must cover a wide range of subjects,
support rapid integration of new information, be easy for the general
population to use, and can only be maintained at moderate expense.
Large-scale use of metadata requires new answers to fundamental
questions.
Recently,
the Digital Government program of the National Science Foundation
has funded a number of projects to address the challenge of integrating
large, heterogeneous, widely distributed and disparate Government
data collections. In this paper, we describe two complementary approaches:
large ontology-based data access planning using small domain models
semi-automatically acquired, and dynamic metadata creation from
language models
More
(pdf)
|