Kristina Lerman


Data Sets

Digg 2009
This anonymized data set consists of the voting records for 3553 stories promoted to the front page over a period of a month in 2009. The voting record for each story contains id of the voter and time stamp of the vote. In addition, data about friendship links of voters was collected from Digg.
Download Digg 2009 data set

Twitter 2010
This data set contains information about URLs that were tweeted over a 3 week period in the Fall of 2010. In addition to tweets, we also the followee links of tweeting users, allowing us to reconstruct the follower graph of active (tweeting) users.
Download Twitter 2010 data set

Flickr personal taxonomies
This anonymized data set contains personal taxonomies constructed by 7,000+ Flickr users to organize their photos, as well as the tags they associated with the photos. Personal taxonomies are shallow hierarchies (trees) containing collections and their constituent sets (aka photo-albums) and collections.
Download Flickr data set

Wrapper maintenance
Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. When wrapper stops working due to changed in the layout of web pages, our task is to automatically reinduce the wrapper. The data sets used for experiments in our JAIR 2003 paper contain web pages downloaded from two dozen sources over a period of a year.
Data set


Social network analysis methods examine topology of a network in order to indentify its structure, for example, who the important nodes are. Centrality, however, depends on both network topology (or social links) and the dynamical processes (or flow) taking place on the network, which determines how ideas, pathogens, or influence flow along social links. Click the link below to see Matlab code for calculating random walk-based centrality (PageRank) and epidemic diffusion-based centrality (given by Bonacich's Alpha-Centrality).
More: Matlab code to calculate PageRank and Alpha-Centrality

Content Map Equation: community detection in heterogeneous networks
This code finds communities in networks in which nodes have attributes. The approach, described in this paper, finds best compression on a random walk on a network that also takes node attributes into account. 
Download: ContentMapEquation on Github

LA-CTR: limited attention collaborative topic regression for social recommendation
This is a C implementation of limited attention collaborative topic regression for recommendations (LA-CTR) model, which is fully described in Kang and Lerman, 2013. Original CTR ( code has been modified to implement LA-CTR model. Please cite Kang and Lerman (2013) LA-CTR: A Limited Attention Collaborative Topic Regression for Social Media, in Proc. of AAAI.
Download:, LA_CTR