WebScripter Architecture

WebScripter consists of two main software components (as shown in Figure C-1): the Report Authoring Environment for defining reports, and the Report Instantiation component that fills in reports with instance data and posts them back to the Web.

 


Figure C-1




























The sub-components of the Report Instantiation component are shown in the upper-right corner of the figure. The input to this component is the Report Representation, a data structure that records the skeleton of a report and the DAML manipulation scripts to be used to fill in the skeleton with data.

Report representations can contain four different kinds of scripts: generation-scripts that generate columns of data by querying the Web or the knowledge-base; extraction-scripts that extract bits of information from retrieved pages and place the results in other cells; filtering-scripts that eliminate rows of data; and join-scripts that cross-reference information from multiple reports to fuse data.

The Report Interpreter uses these scripts to fill the report with data. It uses the generation scripts to query the knowledge base for DAML instances that match a specification, and populates the corresponding column. It then applies the extraction, filtering and join scripts to populate the rest of the cells with data. The Report Interpreter output has an associated XML format, which the HTML generator converts into a DAML-enhanced HTML document that gets posted on the Web. By default, the generator produces a simple HTML table. The user can supply an XSL style sheet to customize the appearance of the report if they so desire.

The sub-components of the Report Authoring Environment are shown in the lower-left part of the figure. The objective of the authoring environment is to enable an ordinary user to easily produce the possibly intricate network of scripts that define a report. The goal is to hide the scripts from users and enable them to specify the report by showing the system the pieces of information that go into the report. To do so, users interact primarily with the Report Worksheet and the Web Browser. The Report Worksheet presents users a spreadsheet where they can specify a report. In a typical interaction, users will first specify how to generate the rows of a report (e.g., by entering a URL and asking the system to populate the column with instances of the same type). When the user specifies the generator for a column, the Report Builder queries the knowledge base for matching instances and populates the column. The user can proceed to work sideways (typically to the right) to specify how to compute the value of other cells based on values of already filled in cells. In order to make this end-user accessible, the user will be able to click on a cell to open the document from where the value was extracted, and then copy and paste (or drag) another part of that document into another cell. For example, if the user double-clicks on the cell for USC, the USC home page opens on the browser, and the user can copy and paste the address into another cell. The Report Builder will infer the DAML manipulation script that computes the pasted value from the DAML instances defined in the original document. To make this work we will implement a Proxy Server that performs two jobs. First, it keeps track of page navigation so that the Report Builder can infer scripts that span across multiple pages. Second, it uses Dynamic HTML to enhance the presentation of DAML-enabled web pages to highlight the page portions that have underlying DAML-markup, and to enable the copy and paste (or drag) operations to transfer the relevant DAML markup to the pasted location. We will use programming by demonstration techniques from our previous work to generalize examples to produce scripts. When the examples are ambiguous, the Report Builder will engage the user in a Clarification Dialogue.

The Knowledge Base supports the reasoning requirements of the Report Builder. The Report Builder needs to compute the instances, super-types and sub-types of a category, and other similarly simple requests. We hope that the DAML community will provide an appropriate Knowledge Base, but we are prepared to use PowerLoom or a relational database as a fallback. We need a Web crawler that combs the Web for DAML-enabled pages and retrieves the corresponding instances and ontologies into the knowledge base. We also expect the DAML community to provide one.

The Ontology Triangulation Reasoner is our component for bridging ontologies. Given an entity in one ontology, the Triangulation Reasoner will identify closely related entities in other ontologies. It will do so by using redundancy. The basic idea is the following: categories A and B are likely to be closely-related if the reasoner can find a category C that contains most of the instances of A, most of the instances of B, and not much else. We plan to use variations on this idea to enable the Report Builder to generalize categories (e.g., if the user put instances of W3C:University in one column, he may also want instances of UN:University, computed based on the number of instances declaring to be both and the prevalence of these classifications in university indices).

The software deliverables of our project are all the shaded components of Figure C-1. The software will be delivered as Java JAR files. Presuming permission by the owners of the components we use but did not build, we will repackage those in our software distribution and provide an “installer” program, making it as easy as possible to deploy the complete system.