Information-theoretic Ideas in Machine Learning
Saturday July 9
(Room: Sutton North)
Greg Ver Steeg and Aram Galstyan
Information Sciences Institute
University of Southern California
Slides from the tutorial are available below. Some of the social network slides from a previous tutorial were messed up in the import. They are available here.
The objective of this tutorial is to provide a gentle introduction to basic information-theoretic concepts and to demonstrate how those concepts can be applied in the context of machine learning. Information theory was originally developed to describe engineered communication systems. Applying these ideas in new contexts introduces several challenges. We will discuss some of the main problems and potential solutions: picking the right measures, estimating information quantities from limited data, and interpreting results.
We will consider basic and ubiquitous quantities like mutual information (which is nevertheless fraught with pitfalls in estimation and interpretation). We will also explore the more exciting possibility of using information-theoretic ideas as a principled theoretical foundation for machine learning. In this vein we will consider different ways of decomposing information and notable ideas such as InfoMax, ICA, and the information bottleneck.
The emergence of Information Theory as a scientific discipline is commonly attributed to a 1948 landmark paper by Claude Shannon where he laid down the basic principles of data transmission through a noisy communication channel. In particular, Shannon's theory tells us that the amount of information we can send through the noisy channel is related to a quantity called "mutual information". Mutual Information between two random variables (e.g., transmitted and received messages) measures the average reduction in the uncertainty in one variable, if we know the value of the other variable. This concept is illustrated using the Venn diagram below: Here the yellow and light blue areas denote the uncertainty in variables X and Y, respectively. Those uncertainties are quantified by the corresponding entropies H(X) and H(Y). The mutual information then corresponds to the area of the intersection. The noisy channel is a powerful framework that has been found numerous applications in speech recognition, machine translation, text summarization, and so on.
What does this have to do with influence, human speech, or social media? This abstract framework is remarkably flexible. What if the input is some statement made or tweeted by Alice? Then the “noisy channel” consists of (e.g.) sound waves, the ear drum, and the brain of Bob. Now Bob “outputs” some statement and we can ask what the information capacity is of the link between Alice and Bob.
More generally, in recent years information-theoretic concepts have been used successfully to characterize processes in dynamic social networks and social media. For instance, Ghosh et. al. used information-theoretic approach to classification of user activity on Twitter . In particular, they traced the user activity connected with particular URL, and identified two features, time-interval entropy, and user entropy. Using just these two features they were able to categorize content based on the collective user response it generates. Vet Steeg and Galstyan proposed to use predictability as a measure of influence between two social media users [1,2]. In particular, they introduced content transfer, an information-theoretic measure with a predictive interpretation that directly quantifies the strength of the effect of one user's generated content on another's in a completely model-free way. Their xperiments with Twitter data showed that content transfer is able to capture non-trivial, predictive relationships even for pairs of users not linked in the follower or mention graph.
Scope of the tutorial
We will begin with a survey on topics such as random variables, entropy, mutual information, and conditional mutual information, focusing on developing a deeper intuition for what these quantities represent. After demonstrating common pitfalls, we will demonstrate practical, state of the art methods for estimating entropic measures from limited data samples. We will discuss various famous approaches to information-theoretic learning like Infomax, ICA, sparse coding, the information bottleneck, and CorEx. Finally, we will show how these tools can be fruitfully applied to real-world machine learning problems in complex systems like social media, finance, psychometrics, biology, and more. Possible examples include discovering meaningful relationships from social signals using transfer entropy [1, 2], use of entropic measures for classifying temporal activity patterns of users in social media, characterizing randomness in social interactions on Twitter , and information-theoretic methods for community detection in social networks .
The following are a few recommended publications.