Information-theoretic Ideas in Machine Learning
IJCAI’16 Tutorial

Saturday July 9
(Room: Sutton North)

Greg Ver Steeg and Aram Galstyan
Information Sciences Institute
University of Southern California



The objective of this tutorial is to provide a gentle introduction to basic information-theoretic concepts and to demonstrate how those concepts can be applied in the context of machine learning. Information theory was originally developed to describe engineered communication systems. Applying these ideas in new contexts introduces several challenges. We will discuss some of the main problems and potential solutions: picking the right measures, estimating information quantities from limited data, and interpreting results.

We will consider basic and ubiquitous quantities like mutual information (which is nevertheless fraught with pitfalls in estimation and interpretation). We will also explore the more exciting possibility of using information-theoretic ideas as a principled theoretical foundation for machine learning. In this vein we will consider different ways of decomposing information and notable ideas such as InfoMax, ICA, and the information bottleneck.


The emergence of Information Theory as a scientific discipline is commonly attributed to a 1948 landmark paper by Claude Shannon where he laid down the basic principles of data transmission through a noisy communication channel. In particular, Shannon's theory tells us that the amount of information we can send through the noisy channel is related to a quantity called "mutual information". Mutual Information between two random variables (e.g., transmitted and received messages) measures the average reduction in the uncertainty in one variable, if we know the value of the other variable. This concept is illustrated using the Venn diagram below: Here the yellow and light blue areas denote the uncertainty in variables X and Y, respectively. Those uncertainties are quantified by the corresponding entropies H(X) and H(Y). The mutual information then corresponds to the area of the intersection. The noisy channel is a powerful framework that has been found numerous applications in speech recognition, machine translation, text summarization, and so on.

What does this have to do with influence, human speech, or social media? This abstract framework is remarkably flexible. What if the input is some statement made or tweeted by Alice? Then the “noisy channel” consists of (e.g.) sound waves, the ear drum, and the brain of Bob. Now Bob “outputs” some statement and we can ask what the information capacity is of the link between Alice and Bob.

More generally, in recent years information-theoretic concepts have been used successfully to characterize processes in dynamic social networks and social media. For instance, Ghosh et. al. used information-theoretic approach to classification of user activity on Twitter [4]. In particular, they traced the user activity connected with particular URL, and identified two features, time-interval entropy, and user entropy. Using just these two features they were able to categorize content based on the collective user response it generates. Vet Steeg and Galstyan proposed to use predictability as a measure of influence between two social media users [1,2]. In particular, they introduced content transfer, an information-theoretic measure with a predictive interpretation that directly quantifies the strength of the effect of one user's generated content on another's in a completely model-free way. Their xperiments with Twitter data showed that content transfer is able to capture non-trivial, predictive relationships even for pairs of users not linked in the follower or mention graph.

Scope of the tutorial

We will begin with a survey on topics such as random variables, entropy, mutual information, and conditional mutual information, focusing on developing a deeper intuition for what these quantities represent. After demonstrating common pitfalls, we will demonstrate practical, state of the art methods for estimating entropic measures from limited data samples. We will discuss various famous approaches to information-theoretic learning like Infomax, ICA, sparse coding, the information bottleneck, and CorEx. Finally, we will show how these tools can be fruitfully applied to real-world machine learning problems in complex systems like social media, finance, psychometrics, biology, and more. Possible examples include discovering meaningful relationships from social signals using transfer entropy [1, 2], use of entropic measures for classifying temporal activity patterns of users in social media, characterizing randomness in social interactions on Twitter [5], and information-theoretic methods for community detection in social networks [3].


The following are a few recommended publications.  


  1. Greg Ver Steeg and Aram Galstyan Information-Theoretic Measures of Influence Based on Content Dynamics. WSDM, 2013.
  2. Greg Ver Steeg and Aram Galstyan Information Transfer in Social Media. WWW, 2012.
  3. Martin Rosvall and Carl T. Bergstrom. An information-theoretic framework for resolving community structure in complex networks. PNAS, 104(18), 2007.
  4. Rumi Ghosh, Tawan Surachawala, and Kristina Lerman. Entropy-based classification of retweeting activity on twitter. SNA-KDD, August 2011.
  5. Chunyan Wang and Bernardo A. Huberman. How random are online social interactions? Nature Scientific Reports, 2:633, 2012.
  6. Terry Bossomaier, Lionel Barnett, and Michael Harre Information and phase transitions in socio-economic systems CAS Modeling Review, 2013.
  7. Mizuki Oka and Takashi Ikegami. Exploring Default Mode and Information Flow on the Web PLOS One, 2013.
  8. Simon DeDeo, Robert Hawkins, Sara Klingenstein, and Tim Hitchcock. Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems Entropy, 2013.
  9. A Kraskov, H Stögbauer, P Grassberger. Estimating Mutual Information PRE 2004.
  10. Christopher Quinn, Negar Kiyavash, Todd P. Coleman. Directed Information Graphs, arXiv:1204.2003.
  11. Y. Liu, T. Bahadori, H. Li. Sparse-GEV: Sparse Latent Space Model for Multivariate Extreme Value Time Serie Modeling, ICML 2012.
  12. Greg Ver Steeg and Aram Galstyan. Maximally Informative Hierarchical Representations of High-Dimensional Data, AISTATS, 2016.