University of Southern California
ISI Site Signature

University of Southern California


Craig A. Knoblock
 
 
  
 Software Downloads  
 
 MapFinder: Harvesting maps on the Web 
  Maps are one of the most valuable documents for gathering geospatial information about a region. We use a Content Based Image Retrieval (CBIR) technique to built an accurate and scalable system, MapFinder, that can discover standalone images as well as images embedded within documents on the Web that are maps. The implementation provided here has the capabilities of extracting WaterFilling features from images, and classifying a given image as a map or nonmap. We also provide the data collected by us for our experiments.  
   
 Description | Download Code | Download Data (1.5 GB) | MapFinder Project Paper  
 
 ARX and Phoebus: Information Extraction from Unstructured and Ungrammatical Text on Web 
  The project presents two implementations for performing information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum posting titles. The ARX system is an automatic approach to exploiting reference sets for this extraction. The Phoebus system presents a machine learning approach exploiting reference sets.  
   
 Description | Download | ARX Project Paper | Phoebus Project Paper  
 
 BSL: A system for learning blocking schemes 
  Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.  
   
 Description | Download | Project paper  
 
 EIDOS: Efficiently Inducing Definitions for Online Sources 
  The Internet is full of information sources providing various types of data from weather forecasts to travel deals. These sources can be accessed via web-forms, Web Services or RSS feeds. In order to make automated use of these sources, one needs to first model them semantically. Writing semantic descriptions for web sources is both tedious and error prone.  
   
 Description | Download | Project Paper  
 
 
Background