Greg Ver Steeg

Correlation Explanation

Correlation Explanation (CorEx) is an information-theoretic principle for learning abstract representations that are maximally informative about the data. This approach is based on a series of results demonstrating how the information in complex (high-dimensional) systems can be modularly and hierarchically decomposed. 


The first paper describing the principle of Correlation Explanation and some applications (NIPS2014)

The next paper describes how CorEx leads to a hierarchy of representations that are maximally informative about the data (AISTATS 2015)

The "information sieve" is an alternate approach in which structure is learned incrementally (ICML 2016).

The information sieve for continuous variables is more practical and is introduced here: Sifting Common Information from Many Variables (IJCAI-17). 

The linear version of CorEx exhibits a unique "blessing of dimensionality" for recovering latent factor structure and excellent performance for estimating covariance matrices with high-dimensional, under-sampled data (code)

Some preliminary applications besides those in the papers above include topic modeling, analyzing Alzheimer's diseasegene expression, and finance


Open-source code implementing CorEx is available on github. There is a special linear version and binary topic model version for sparse data. The information sieve code (for discrete variables) and for continuous variables is there. A summary and comparison of different approaches is in this paper.

Please contact me if you are interested in trying development versions that are more flexible (

Large-scale visualizations (last updated 2015)

Big 5 Personality Test (1) (2) Using the results from about 10k people answering a 50 question survey, we try to "reverse engineer" the 5 major personality traits they are meant to measure. CorEx automatically determines that there should be 5 groups on the first level and the groups of questions discovered perfectly correspond to the big 5 personality traits. We were unable to reproduce this result with other methods. (Though ICA, if given the number of clusters, gets close.)  

Twenty newsgroups We take the top 10000 words as our variables and learn a hierarchical model of abstract features. Currently, text is understood through various feature engineering efforts that attempt to represent aspects of language like sentiment, style, or topics. Intuitively, all these aspects of language manifest through correlations in word usage. We want to explore the extent to which CorEx finds features corresponding to all of these aspects. When abstract features are predictive of the labels of different groups in the dataset, we indicate it with a label. 

Human Genome Diversity Project (force) The force layout uses distances according to the first representation layer. Mouse-over for details about each individual. Colors represent broad geographic regions: Africa, Europe, Asia, Middle East, Oceania, America. Note that the hierarchical representation learns features (in an unsupervised way) that are nearly perfect predictors of the ethnic origins of Native Americans, Subsaharan Africa, and Native Americans, e.g. The "root" node of the tree represents broad geographic splits between Africa, EurAsia, and the "East" (including China, Japan, Oceania, America). 

Cyber-physical systems  Sensors from a cyber-physical system. A group of sensors identified as containing functional relationships (blue) corresponded to various memory measurements. "Source description" (red) was a text field labeling the version of the software. We were able to detect changes in the software from altered correlations among connected variables. Check out the relationships between variables that predicted version changes. The discovered latent factor is indicated with the color of the point. It turns out that "source description" was just a text field that the programmer changed after he finished a series of code edits. We were actually able to detect the version change before he documented it!

Finance  (Hierarchical structure) We considered monthly returns for companies in the S&P 500. We found that the strongest signals corresponded to industries: oil, energy, banking. CorEx can be used to quantify anomalies, and the crash in 2008 shows up as the most unusual event of the decade. One of the reconstructed latent factors almost perfectly predicts changes in the S&P500, despite the fact that it is a function of only 38 company returns!

Neuroscience: We are also working on multi-modal data including MRIs, brain networks, and blood measures for people with neural disorders like Alzheimer's Disease. 

Gene Expression: Experiments underway (some very exciting preliminary results here!). 

Phenotypes  Consider "documents" to be phenotypes listed on a database of biomedical studies and "words" are variables. We want to find groups of correlated words for describing phenotypes, to make searching for relevant studies easier. Note that some stemming has been done.

Technical notes: Mathematically, the total correlation (or multi-information) among variables (at some layer of the hierarchy) is minimized conditioned on their parents in the next level. Design elements all have meaning. In the "bubble" plot, white circles represent the original variables or features. The size of white circles represents the amount they "contribute" to an abstract feature at the next layer in the hierarchy. The size of the circle containing a group of white circles represents the total correlation or multi-information contained in that group of variables. When validation labels are available, we check whether the automatically learned features correspond with any known labels about the data. We measure this using ARI: Adjusted Rand Index, where 1 is a perfect score and 0 is random or Precision in the case of twenty newsgroups. Number of levels and clusters can be determined automatically. Because the method prefers to find the strongest correlations, the results are very robust to missing data/noise. Better results from improving the scaling of the information-theoretic optimization are forthcoming.