University of Southern California
ISI Site Signature

University of Southern California


Information Integration
Research Group

 
    
  Downloads 
 
 MapFinder: Harvesting maps on the Web 
  Maps are one of the most valuable documents for gathering geospatial information about a region. We use a Content Based Image Retrieval (CBIR) technique to built an accurate and scalable system, MapFinder, that can discover standalone images as well as images embedded within documents on the Web that are maps. The implementation provided here has the capabilities of extracting WaterFilling features from images, and classifying a given image as a map or nonmap. We also provide the data collected by us for our experiments.  
   
 Description | Download Code | Download Data (1.5 GB) | MapFinder Project Paper  
 
 ARX and Phoebus: Information Extraction from Unstructured and Ungrammatical Text on Web 
  The project presents two implementations for performing information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum posting titles. The ARX system is an automatic approach to exploiting reference sets for this extraction. The Phoebus system presents a machine learning approach exploiting reference sets.  
   
 Description | Download | ARX Project Paper | Phoebus Project Paper  
 
 BSL: A system for learning blocking schemes 
  Record linkage is the problem of determining the matches between two data sources. However, as data sources become larger and larger, this task becomes difficult and expensive. To aid in this process, blocking is the efficient generation of candidate matches which can then be examined in detail later to determine whether or not they are true matches. So, blocking is a preprocessing step to make record linkage a more scalable process.  
   
 Description | Download | Project paper  
 
 EIDOS: Efficiently Inducing Definitions for Online Sources 
  The Internet is full of information sources providing various types of data from weather forecasts to travel deals. These sources can be accessed via web-forms, Web Services or RSS feeds. In order to make automated use of these sources, one needs to first model them semantically. Writing semantic descriptions for web sources is both tedious and error prone.  
   
 Description | Download | Project Paper  
 
 
 Digg 2009 
  This anonymized data set consists of the voting records for 3553 stories promoted to the front page over a period of a month in 2009. The voting record for each story contains id of the voter and time stamp of the vote. In addition, data about friendship links of voters was collected from Digg.  
 Download Digg 2009 data set  
 
 Twitter 2010 
  This data set contains information about URLs that were tweeted over a 3 week period in the Fall of 2010. In addition to tweets, we also the followee links of tweeting users, allowing us to reconstruct the follower graph of active (tweeting) users.  
 Download Twitter 2010 data set  
 
 Flickr personal taxonomies 
  This anonymized data set contains personal taxonomies constructed by 7,000+ Flickr users to organize their photos, as well as the tags they associated with the photos. Personal taxonomies are shallow hierarchies (trees) containing collections and their constituent sets (aka photo-albums) and collections.  
 Download Flickr data set  
 
 Wrapper maintenance 
  Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. When wrapper stops working due to changed in the layout of web pages, our task is to automatically reinduce the wrapper. The data sets used for experiments in our JAIR 2003 paper contain web pages downloaded from two dozen sources over a period of a year.  
 Data set  
 
 
 Centrality 
  Social network analysis methods examine topology of a network in order to indentify its structure, for example, who the important nodes are. Centrality, however, depends on both network topology (or social links) and the dynamical processes (or flow) taking place on the network, which determines how ideas, pathogens, or influence flow along social links. Click the link below to see Matlab code for calculating random walk-based centrality (PageRank) and epidemic diffusion-based centrality (given by Bonacich's Alpha-Centrality).  
 Matlab code to calculate PageRank and Alpha-Centrality.  
 
 
 Software: PSA 
  PSA: Single-Pass On-Line Learning -- Learning from Unlimited Training Examples  
  PSA is a step-size adjustment method for gradient-based algorithms. We showed that PSA computes a close approximation of theoretically optimal step-size in linear time with regards to the dimension of the parameter space.  
 Project page | More software and data downloads  
 
 Web Service: AIIAGMT 
  AIIAGMT: Gene Mention Tagger for Biological Text Mining.  
  AIIAGMT is one of the world's most accurate gene mention taggers.  
 Web service | Mirror site at USC/ISI | Project Home  
 
 Web Service: FastSNP 
  FastSNP: SNP Prioritization by Functional Analysis -- Assessing Risk of Biomarks  
  FastSNP is one of the world's most widely used SNP prioritization tools. Biomarkers identified by using FastSNP have been translated for clinical uses.  
 Web service | More Biological Web services  
 
 
Background