USC Information Sciences Institute � KOJAK Graph Simplification

VAST 2008 Challenge
Mini Challenge 3:  Cell Phone Calls 

Authors and Affiliations:

Hans Chalupsky, Information Sciences Institute, University of Southern California, [email protected]

Student team: NO

Tool(s):

KOJAK is an integrated suite of link discovery tools that support

 

�         group detection (KOJAK Group Finder)

�         anomaly detection (KOJAK UNICORN)

�         pattern matching (KOJAK Pattern Finder)

�         relationship and graph simplification (KOJAK Simplifier)

 

KOJAK generally operates on data represented by labeled graphs where nodes represent typed entities such as persons, organizations, events, etc. and links represent different kinds of relationships between these entities (such graphs are also referred to as semantic graphs or attributed relational graphs). Graphs might be represented explicitly, or implicitly as views over relational data (e.g., stored in an RDBMS). Optionally, graphs can be enhanced with background ontologies and logic-based inferencing supported by the PowerLoom KR&R system.

 

KOJAK is not a visualization tool per se, but primarily a data analysis and knowledge discovery tool.� Despite the different emphasis, its components can support visual analytics tasks as demonstrated by this submission.� In particular, the KOJAK Simplifier graph simplification component supports the reduction and abstraction of large data graphs for purposes of visualization which is what we are primarily exploiting here.� KOJAK has a simple GUI and visualization component which we used during the initial experimentation phase.� Later on, we switched to Pajek to generate more sophisticated visualizations for the graphs generated by KOJAK.

 

KOJAK has been developed over the course of seven years primarily with DoD funding.� Chief architect and developer of the system is Hans Chalupsky. Large portions of code, algorithms and good ideas have been contributed by Jafar Adibi, Shou-de Lin, Eric Melz, Tom Russ and Andre Valente.� More information and papers about KOJAK can be found here.

 

Two Page Summary:   YES

    KOJAK Summary

 

 

ANSWERS:


Phone-1: What is the Catalano/Vidro social network, as reflected in the cell phone call data, at the end of the time period  

   PhoneNodes.txt

   PhoneLinks.txt

 


Phone-2:  Characterize the changes in the Catalano/Vidro social structure over the ten day period.

Detailed Answer:

 

We worked on this challenge over the course of seven days (about 50-60 hours), probably two thirds of which were spent improving and streamlining the tools, implementing some new functionality, figuring out how to use Pajek for visualization, etc.� The other third was spent actually exploring the data and trying to solve the problem.

 

Our general approach was as follows: since our main expertise is not in the area of visualization but in the area of link discovery, we were interested to see whether a link discovery system such as KOJAK could still usefully support a more interactive and exploratory visual analytics task.� This is in contrast to evaluations in the past, where KOJAK has been used primarily fully automatically, batch-processing large amounts of data with minimal user intervention.

 

In our solution process we used KOJAK tools such as its UNICORN anomaly detector and its Simplifier graph simplification component to generate leads and data products that can then be visualized with off-the-shelf network visualization components such as KOJAK's own simple visualizer, and later on the Pajek package which supports more sophisticated features.� Looking at these visualizations we formed hypotheses which we then tried to support or refute using, for example, PowerLoom to query the data in many different ways, or the KOJAK Group Finder to provide an alternative way to form the final social network.

 

We started by translating the data into a couple of different formats supported by KOJAK: one is a CSV format that can easily be loaded into a KOJAK graph object, and the other was a logic format as supported by our PowerLoom KR&R system.� Given the relatively small size of the data, it was not necessary to host the data on a relational database (also supported).

 

Our first attempt was to see whether we could use the association strength computations underlying the KOJAK Group Finder tool (based on a mutual information model - see [1]) to see whether there are any unusually strong connections between any pairs of connected nodes.� The Group Finder needs a set of seed nodes to create a group and was therefore not directly applicable at this stage (it came in useful at the end of the process).� Unfortunately, the strength measures were not very informative by themselves.

 

Our next attempt was to use the KOJAK UNICORN tool [2] to find anomalous nodes in the connection graph defined by the data.� UNICORN operates on semantic graphs and works best if there are different types of relations between nodes represented by different edge labels.� It combines these labels in all possible ways and then computes statistical correlations between a particular sequence of labels and a node.� These correlation values are then combined into feature vectors called "semantic profiles" that characterize the graph neighborhood of a node to a certain depth.� Once we have a set of semantic profiles (one for each node), we can use standard outlier detection to find nodes with abnormal profiles (see [2] for more details).� Given that this data only had "phoneCall" links, we were somewhat skeptical at this point whether UNICORN could be usefully applied here.

 

Here is the result of our first attempt to run UNICORN on the connection graph between callers, ignoring call direction and call frequency.� This also shows how KOJAK is used via a command interface that uses a Lisp-based syntax to perform operations such as loading data and running analysis tools (it is not a Lisp program, however):

 

|= (find-abnormal-nodes :graph (clv graph)

����������������������� :ignore-edge-direction? true

����������������������� :min-path-length 1 :max-path-length 4

����������������������� :top-n 20 :k 1 :strong-outliers? true)

 

((<NODE 0> 0.991667)

�(<NODE 306> 0.975)

�(<NODE 1> 0.9)

�(<NODE 5> 0.891667)

�(<NODE 309> 0.883333)

�(<NODE 23> 0.733333)

�(<NODE 14> 0.708334)

�(<NODE 2> 0.641667)

�(<NODE 3> 0.633333)

�(<NODE 8> 0.616667)

�(<NODE 107> 0.616667)

�(<NODE 19> 0.583334)

�(<NODE 41> 0.583334)

�(<NODE 200> 0.566667)

�(<NODE 13> 0.533333)

�(<NODE 52> 0.458333)

�(<NODE 145> 0.416667)

�(<NODE 360> 0.333333)

�(<NODE 21> 0.325)

�(<NODE 11> 0.3166665))

 

These are the top 20 most abnormal nodes among the 400 nodes in the graph (based on all the data).� Given that node 200 mentioned in the scenario description was part of this list made us hopeful that UNICORN was picking up on something useful in this data.� Actually, if our final analysis is correct, this list already contains most of the major players and their aliases.� This is somewhat surprising, given that this data isn't really that suited for UNICORN, but what it is picking up on here is unusual connectivity patterns (high/low path frequencies) and their combinations which seems to give us some interesting starting points.� To give UNICORN more label combinations to work with, we mapped call frequencies between parties onto symbolic high/low and later high/medium/low ranges which seemed to work well.

 

Our next step was to use the KOJAK Simplifier tool.� The Simplifier first computes a set of global abnormal nodes using UNICORN and then a set of locally abnormal nodes connected to the global seeds.� It then fills in the graph between the global and local nodes given some user-specified parameters.� The result is a simplified graph that contains important nodes with relevant connections between them.� The hope is that such a simplified graph shows important elements and structure of the data.� The following command produced the graph shown in Figure 1 which was one of the first visualizations generated during our investigation:

 

|= (simplify-graph :graph (clv graph)

������������������ :ignore-edge-direction? true

������������������ :min-path-length 1 :max-path-length 4

���������� ��������:top-global-N 10 :top-local-N 5 :k 1

������������������ :max-connect-length 2

������������������ :strong-outliers? true)

 

The parameters control how many global and local nodes are produced, and how edges are filled in in the resulting graph.� Here we add all paths up to "max-connect-length" 2 between global and local abnormal nodes.� It generally requires some experimentation with these parameters to generate a graph with the right amount of information.� Ideally, this would be done interactively with immediate feedback in a GUI, for now, we have to run multiple times and load the graph into a visualization tool to see the result.

 

Figure 1: Initial simplification experiment using KOJAK's native visualization package

 

The layout in Figure 1 was generated with KOJAK's own visualization tool that uses a simple spring-based layout model.� The graph shows some very interesting structure with nodes 200, 0 and 300 in the middle (200 was manually pulled out) and the wheels around 1, 2 and 5.� After some further experimentation with the simplification parameters and a detour into producing visualizations with Pajek instead of KOJAK, we get the network in Figure 2:

 

Figure 2: �Catalano network at Day 10

 

The double wheel patterns shown in Figures 1 and 2 made us suspect that the wheel centers are actually aliases of the same person.� This in turn triggered an investigation into how to ascertain or refute such alias hypotheses.� We added some additional functionality to find entities with likely duplicates based on their connectivity patterns (a dual functionality to the outlier computation), and all the double wheel centers turned out to be candidates. Moreover, we started to suspect that 0, 200 and 300 might all be the same person as well, however, the similarities there were not as striking.

 

Again we were looking for ways to refute such alias hypotheses.� The simplest constraint is that two suspected aliases should not call each other which is satisfied by all the candidates.� A more sophisticated constraint is that they should never be at two different places at the same time (unless phones are shared by different people).� This led us to try to use the cell tower location information and times between calls to see whether there were two calls close in time by, for example, 200 and 300, that were done in two locations too far from each other for the same person.� We represented tower location with 16 approximately 5km zones (in PowerLoom) approximately linearizing the island.� Unfortunately, the constraint wasn't satisfied even for a single phone assuming driving as the fastest way of transportation.� For example, about 10% of the calls of 309 were problematic in this respect (maybe they used a helicopter or plane).� As a side-effect of this, however, when we compared the calls of 1 and 309 using a PowerLoom query, we discovered that they occupied disjoint time ranges which in turn very much supported the initial alias hypotheses.� The same was true for 200 and 300 (with 300 exclusively calling on days 8-10).� The problem was that they didn't have any connectivity overlap.� However, using the previous stronger alias hypotheses and a little bit of alias rule modeling in PowerLoom, we could easily see that they are connected to the same people:

 

|= (retrieve all (connected-to p200 ?x) :sort-by :values)

There are 6 solutions:

� #1: ?X=p1

� #2: ?X=p137

� #3: ?X=p2

� #4: ?X=p3

� #5: ?X=p5

� #6: ?X=p97

|= (retrieve all (connected-to p300 ?x) :sort-by :values)

There are 7 solutions:

� #1: ?X=p126

� #2: ?X=p268

� #3: ?X=p306

� #4: ?X=p309

� #5: ?X=p332

� #6: ?X=p360

� #7: ?X=p397

|= (retrieve all (and (connected-to p200 ?x)

��������������������� (connected-to p300 ?x)))

No solutions.�� ����������;;; that is no overlap here

|= (assert (and (alias p1 p309)

��������������� (alias p2 p397)

��������������� (alias p3 p360)

��������������� (alias p5 p306)))

(|P|(alias p1 p309) |P|(alias p2 p397) |P|(alias p3 p360) |P|(alias p5 p306))

|= (retrieve all (and (connected-to p200 ?x)

��������������������� (connected-to p300 ?x))

���������������� :sort-by :values)

There are 4 solutions:�� ;;; now we have overlap

� #1: ?X=p1

� #2: ?X=p2

� #3: ?X=p3

� #4: ?X=p5

 

That is, after asserting the alias hypotheses, 4 out of 6 friends of 200 overlap with those of 300 and they are all with the suspected core network.� Given that and the temporal disjointness led us to assume that they are the same person.

 

We then time-sliced the network shown in Figure 2 using its nodes (and layout) and filling in more and more links as days progressed.� As expected, there is a big jump between days 7 and 8 as shown in Figures 3 and 4.� All of the suspected aliases (except 360) appear the first time on day 8.� We don't have enough background information to produce a good explanation for this, except that maybe some external event forced the inner circle of the Catalano network to change to different phones, maybe to avoid detection.� Another possible explanation is the complete takeover of the group by different players, but the alias hypothesis seems more plausible to us.� Note that the wheels in Figure 4 are due to the fact that we fill in with paths of length 2 which now connect the newly occurring aliases.

 

Figure 3: �Catalano network time slice for Day 7

 

Figure 4: �Catalano network time slice for Day 8

 

At this point we've made up our mind that 5 (and 306) is Estaban based on the frequency of calls to 200 (300) and 1 (and 309) is David Vidro based on his role of handling much of the communication.� We assume 2 and 3 and their aliases are Juan and Jorge Vidro, but there is no information that would allow us to distinguish between them, so we made a somewhat arbitrary choice.� The final question is the role of 0?� Looking again at connectivity, we see that 0 and 200 and 300 are connected to the full inner circle (1,2,3,5) and no other people are.� We suspect that 0 is another alias for Ferdinando, also merging the two preserves the high-frequency link to Estaban and gives us a better acount for all the main players, but it is only a guess.

 

Finally, assuming we've found the players of the core group, it is time to find the complete extent of the relevant Catalano network.� Now the GroupFinder comes in handy, since now we have a seed group of 4 to start with.� The GroupFinder produces the following likely additional members in decreasing order of strength:

 

145 107 276 31 48 32 34 12 170 179 27 97 137 37 69 8 38

 

There are more candidates, but we threshold the group based on the number of connections to the core group (we require at least two).� Again, without further information this decision is somewhat arbitrary.� The resulting network is shown in Figure 5.

 

Figure 5: The final Catalano/Vidro social network generated by our analysis

 

 

References

 

[1] J. Adibi, H. Chalupsky, E. Melz and A. Valente. The KOJAK Group Finder: Connecting the Dots via Integrated Knowledge-Based and Statistical Reasoning. In Proceedings of the Sixteenth Innovative Applications of Artificial Intelligence Conference (IAAI-04), 2004

[2] S. Lin and H. Chalupsky. Discovering and Explaining Abnormal Nodes in Semantic Graphs. IEEE Transactions on Knowledge and Data Engineering. To appear. [draft pre-print]