Demos & Downloads
Abstract Meaning Representation (AMR)
The AMR Bank is a set of English sentences paired with simple, readable semantic representations. We aim for it to spur new research in natural language understanding, generation, and translation. Using Ulf Hermjakob's powerful AMR Editor, an annotator can translate a sentence into its meaning in about 10 minutes.
Mashup Construction with Karma
The basic issues involved in the mashup creation process are data retrieval, source modeling, data cleaning, data integration, and data visualization. With Karma, all of these issues are addressed in one seamless interactive process and the user indirectly solves each issue by providing only examples. In this demo, we create a mashup that gathers data from different kinds of data sources such as Excel spreadsheet, Google News website, CSV file and a database in an emergency management scenario.
Generating plans to automatically integrate data across sources. We can utilize various extraction techniques to extract data from a wide variety of sources. However, different sources often have different schemas, access methods, and coverage.
To exploit the road network in raster maps, the first step is to extract the pixels that constitute the roads and then vectorize the road pixels. Identifying colors that represent roads in raster maps for extracting road pixels is difficult.
We implemet an approach that minimizes the required user input for identifying the road colors representing the road network in a raster map.
Building Mashups by Example
Creating a Mashup, a web application that integrates data from multiple web sources to provide a unique service, involves solving multiple problems, such as extracting data from multiple web sources, cleaning it, and combining it together.
Existing work relies on a widget paradigm where users address those problems during a Mashup building process by selecting, customizing, and connecting widgets together. While these systems claim that their users do not have to write a single line of code, merely abstracting programming methods into widgets has several disadvantages.
Interactive Data Integration
In many scenarios, such as emergency response or ad hoc collaboration, it is critical to reduce the overhead in integrating data. Here, the goal is often to rapidly integrate ÒenoughÓ data to answer a specific question.
Ideally, one could perform the entire process interactively under one unified interface: defining extractors and wrappers for sources, creating a mediated schema, and adding schema mappings Ñ while seeing how these impact the integrated view of the data, and refining the design accordingly.
Carmel is a finite-state transducer package written by Jonathan Graehl at USC/ISI. Carmel includes code for handling finite-state acceptors and transducers, weighted transitions, empty transitions on input and output, composition, k-most likely input/output strings, and both Bayesian (Gibbs sampling) and EM (forward-backward) training.
Tiburon is a tree transducer package written by Jonathan May at USC/ISI. Tiburon is designed to handle weighted regular tree grammars, context-free grammars, and both tree-to-tree and tree-to-string transducers, and can perform composition, intersection, application, determinization, inside/outside training, pruning, return k-most likely trees, Viterbi derivations, and other useful things.
ARX and Phoebus: Information Extraction from Unstructured and Ungrammatical Text on Web
The project presents two implementations for performing information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum posting titles. The ARX system is an automatic approach to exploiting reference sets for this extraction. The Phoebus system presents a machine learning approach exploiting reference sets.
BSL: A system for learning blocking schemes
Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.
EIDOS: Efficiently Inducing Definitions for Online Sources
The Internet is full of information sources providing various types of data from weather forecasts to travel deals. These sources can be accessed via web-forms, Web Services or RSS feeds. In order to make automated use of these sources, one needs to first model them semantically. Writing semantic descriptions for web sources is both tedious and error prone.
This anonymized data set consists of the voting records for 3553 stories promoted to the front page over a period of a month in 2009. The voting record for each story contains id of the voter and time stamp of the vote. In addition, data about friendship links of voters was collected from Digg.