Digital Government Research Center (DGRC)

Meeting with Census, EIA, and BLS

Focus on Ontologies Metadata and Definition Analysis — March 31, 2000

Judith L. Klavans, Columbia University

 

Agenda:

10-11:30 Presentation of Project progress; Judith Klavans

11:30-2:00 working lunch, discussion and feedback

Attendees and Discussants: John Marshall, Renee Miller, Andrew Hait, Samuel Highsmith, Mark Wallace, Dan Gillman, Fred Conrad, Cathy Dippo

Presentation Summary: Slides from Judith’s presentation can be found at www.cs.columbia.edu/digigov. The slides from this meeting were updated and presented at the Interface 2000 meeting. Point to "Publications and Presentations" and then to "Interface 2000". A live demo of the working lexical knowledge based creation system was also given. This working system can be found through the Columbia Digital Government Web site by clicking on "Projects" and then on the Lexical Knowledge Base (LKB) tools. Since the focus of the meeting in Washington was on ontologies, we reviewed the details of input and output. However, an update was also given on the initial state of the user interface as well as the information integration components, both of which can be found in the slides.

Discussion:

The notes that follow were taken by Judith during the meeting. They are not intended as a full coverage of everything discussed due to the difficulty of simultaneously taking notes and participating in the discussion. All attendees participated actively although only some are cited by name below. The notes have been vetted by all attendees for correctness.

1. Viewing the source data

Judith showed the source data, with the colorful interface that she and her student, Brian Whitman, designed specifically for display of different views on the lexical knowledge base. Initial discussion from the group focused on ways to view this data. There are two potential client groups for the ontology. The first is composed of analysts, who will use the ontology as a way to understand the organizational logic of the knowledge. The second potential group is composed of more casual, less expert users who might be interested in a display of a limited set of ontology data. Dan Gillman commented that it is difficult to evaluate the way we have looked at the data without seeing the new user interface. This led us into a discussion of the UI. Leading into the next meeting in the summer or fall on the UI, Judith asked about what people wanted in general from the UI. Cathy would like to consider the utility of including relevant industry-based statistics rather than restricting research to product-based statistics. This addition would lead to increased complexity but would also add significant amounts of data and metadata.

2. UI desiderata

The experience of the people who have already built UI’s for these kinds of applications is the users want a wall-less interface, i.e., they don’t want to know where it’s from. They are only interested in what the UI contains that can answer their questions. Also, they prefer fewer levels in one or two screens, not on a long branch of a tree. Since this meeting is in source data and not the UI we separated the two topics for the bulk of the remaining discussion.

3. Metadata and Terms

Metadata drive systems, so users need to know what a term is. The important metadata will help divide the landscape on user interface, will present a problem space to people, and will provide criteria to select to navigate through it. Some items are important, and then more sophisticated users will need more detail, say a second level.

Andy added that on the cd-rom application, the user is given a description of that data. Right now in that list there is a related data files topic. In their world, they point to 1997 economic geographic census file. They would like to see the lexical knowledge base running behind the scenes, so a user could then go out an find something related from a different agency. An application for a particular industry or product could be running behind the scenes. A user could ask: is it close enough to my interest to look?

Jack raised the question of in what way this will help people identify that a particular term is broad or narrow. A term may contextually mean different things. For example, a user might decide that 3-week month or 4-week month doesn’t matter. Or current vs. constant dollars doesn’t matter. This needs to be pulled together in an organized fashion.

Mark suggested that we out to be able to point out the differences in related sources. Metadata repositories are populated by hand now, but we could do this automatically with text pulled from different places. E.g. census building corporate metadata: forms, collection metadata, design, processing, dissemination. There are different classes of metadata needed to derive applications. We need a central storehouse, a formal catalogue. Right now, someone has to keypunch and edit manually from different files.

Dan commented that 8 or 9 agencies are building metadata repositories.

Cathy pointed out that a wizard could be used to alert users to terminologies and definitions, but that it would need to be careful on the level of detail. For example, even pulling out these terms is complex. The goal would be for a system to "read" paper 63 and then pull out all relevant terms. Let’s say tech paper 63 gives documentation on a survey, e.g. a current population survey (unemployment), there are also smaller related documents for all surveys. The wizard would take that and put that into a metadata repository. In the data is the fact that CPS is measured on the 12th of the month, collection the week of 19th. This would become part of the derived metadata. Then a user reading information from EIA and BLS could look at these basic facts.

Renee raised questions on reconciling different definitions. If EPA has something on aviation gasoline, we need to link together, with other agency data. The integration process could even be aggregation from one place. The ontology could do this, in part. Another option is that data could be partially fused based on the meaning. The question is: what appears in the final product? How do the different steps feed into one another? Further analysis will hopefully point out more. What needs to be done is to get a vision of the linking and ontology. For example, we mention aviation gasoline, say in NAICS 324110, in the definition of that industry. Then we want to go to EIA definition of aviation gasoline to see how it relates. We envisioned in a search tool where someone asks about a topic such as gasoline. The interface shows terms with linkages. This would be built into our interface.

4. Uses of Metadata

We have a wide variety of documents with all kinds of metadata. Want to classify the metadata, e.g., description of survey, universe, variables, if this helps in parsing that information. Fred and Gary asked how the back end would or could be viewed. We have, for example, huge tables; we need to know how to help people manipulate in and around. Is it best to use lexical analysis or the ontology?

Jack and Andrew brought up layers of an interface. The user queries by type of interface. For example, SENSUS is for the designer/analyst so that interface would be internal only to the system.

5. Data Exchange

On exchanging information we discussed an XML labeling paradigm for data exchange and for DTD construction.

 

As part of this assemblage of ontology and metadata, one thing to remember is that there is a hierarchy to that metadata. We need to find and display the hierarchical information.

The conclusion was that some people, such as analysts, care a lot about hierarchy, ontology, and metadata. But others simply don’t care about the metadata. At what point do we have an obligation to tell them more? Such questions should be explored in focus groups and by contact with users in a usability lab to determine usefulness issues. In sum, interface with its data could be great but users need to be able to do what they want to do. This will be the topic of the next meeting.