Greg Ver Steeg

Preliminary CorEx Visualizations

 

Correlation Explanation (CorEx) is an information-theoretic method for discovering a hierarchy of abstract representations for complex data. This representation is optimized to be maximally informative about the data. 

Papers

The first paper describing the method and some applications is here (to appear at NIPS 2014).

In the next paper, we describe how the hierarchy of representations leads to a sequence of successively tighter bounds on the information in the data.

Code

Open-source code implementing CorEx is available on github.

Large-scale visualizations

Big 5 Personality Test (1) (2) Using the results from about 10k people answering a 50 question survey, we try to "reverse engineer" the 5 major personality traits they are meant to measure. CorEx automatically determines that there should be 5 groups on the first level and the groups of questions discovered perfectly correspond to the big 5 personality traits. We were unable to reproduce this result with other methods. (Though ICA, if given the number of clusters, gets close.)  

Twenty newsgroups We take the top 10000 words as our variables and learn a hierarchical model of abstract features. Currently, text is understood through various feature engineering efforts that attempt to represent aspects of language like sentiment, style, or topics. Intuitively, all these aspects of language manifest through correlations in word usage. We want to explore the extent to which CorEx finds features corresponding to all of these aspects. When abstract features are predictive of the labels of different groups in the dataset, we indicate it with a label. 

Human Genome Diversity Project (force) The force layout uses distances according to the first representation layer. Mouse-over for details about each individual. Colors represent broad geographic regions: Africa, Europe, Asia, Middle East, Oceania, America. Note that the hierarchical representation learns features (in an unsupervised way) that are nearly perfect predictors of the ethnic origins of Native Americans, Subsaharan Africa, and Native Americans, e.g. The "root" node of the tree represents broad geographic splits between Africa, EurAsia, and the "East" (including China, Japan, Oceania, America). 

Cyber-physical systems  Sensors from a cyber-physical system. A group of sensors identified as containing functional relationships (blue) corresponded to various memory measurements. "Source description" (red) was a text field labeling the version of the software. We were able to detect changes in the software from altered correlations among connected variables. Check out the relationships between variables that predicted version changes. The discovered latent factor is indicated with the color of the point. It turns out that "source description" was just a text field that the programmer changed after he finished a series of code edits. We were actually able to detect the version change before he documented it!

Finance  (Hierarchical structure) We considered monthly returns for companies in the S&P 500. We found that the strongest signals corresponded to industries: oil, energy, banking. CorEx can be used to quantify anomalies, and the crash in 2008 shows up as the most unusual event of the decade. One of the reconstructed latent factors almost perfectly predicts changes in the S&P500, despite the fact that it is a function of only 38 company returns!

Neuroscience: We are also working on multi-modal data including MRIs, brain networks, and blood measures for people with neural disorders like Alzheimer's Disease. 

Gene Expression: Experiments underway. 

Phenotypes  Consider "documents" to be phenotypes listed on a database of biomedical studies and "words" are variables. We want to find groups of correlated words for describing phenotypes, to make searching for relevant studies easier. Note that some stemming has been done.

Technical notes: Mathematically, the total correlation (or multi-information) among variables (at some layer of the hierarchy) is minimized conditioned on their parents in the next level. Design elements all have meaning. In the "bubble" plot, white circles represent the original variables or features. The size of white circles represents the amount they "contribute" to an abstract feature at the next layer in the hierarchy. The size of the circle containing a group of white circles represents the total correlation or multi-information contained in that group of variables. When validation labels are available, we check whether the automatically learned features correspond with any known labels about the data. We measure this using ARI: Adjusted Rand Index, where 1 is a perfect score and 0 is random or Precision in the case of twenty newsgroups. Number of levels and clusters can be determined automatically. Because the method prefers to find the strongest correlations, the results are very robust to missing data/noise. Better results from improving the scaling of the information-theoretic optimization are forthcoming.

Groups: