[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7. Running the KOJAK Group Finder

After it has been appropriately configured for a dataset you can run KOJAK simply by calling the following script in the KOJAK directory:

 
% ./run-kojak

This will read data from specified data files and/or databases, extend the given seed groups and write results to output files and/or the KOJAK database.

While the number of configuration options, scripts and possibilities might seem daunting, the setup provided with the Example 1 dataset should work more or less out of the box for many common cases. All that needs to be done is to generate link data files that have the appropriate link types for the new domain and adapting the call to extend-groups in the run script (or the setting of ExpansionLinkTypes) to use the appropriate link types for this new domain. In the next section we provide a detailed run-through for this example.

7.1 Example Run 1  
7.2 Other Example Runs  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.1 Example Run 1

Below we provide annotated output from running the Group Finder end-to-end on the Example 1 dataset that can be found in the `data/example1/' directory. The default configuration file `configuration.dat' in the KOJAK directory is setup for this example dataset. Therefore, simply running

 
% ./run-kojak

from the command line will run this example. In the description below we repeat the commands from the corresponding run script `scripts/example1-run-kojak.plm' and format the output to make it read better in the manual. Commands were typed interactively at the PowerLoom |= prompt. The sequence of commands, their parameter settings and their generated output should otherwise be identical. The result is written to `example1-report.csv' in the KOJAK directory. This run should take less than a minute on a state-of-the-art PC desktop. Running on previously imported data without calling import-data-into-edb again should only take a couple of seconds.

This dataset was derived from one of IET's Y2.5 evaluation datasets. It is represented in comma-separated value (CSV) format and very simple and generic. It should be a good starting point to use for preparing a new dataset and applying the KOJAK Group Finder to it.

All commands used in this script are documented in more detail in Section 9. KOJAK Commands. The script uses a Lisp-ish input syntax which is the standard input syntax used by the underlying PowerLoom system. Knowledge of PowerLoom should not be required, but see http://www.isi.edu/isd/LOOM/PowerLoom/index.html for more information if necessary. Commands take zero or more required arguments and a set of options specified as keyword/value pairs. Default values for unspecified options are often taken from the configuration file.

The first two commands are a PowerLoom preamble and set up package and module information. All run scripts should start with these two commands.

 
|= (IN-PACKAGE "STELLA")

|= (IN-MODULE "KOJAK")

We start by initializing the KOJAK system. This will load ontology and EDB schema information using the scripts specified in the configuration. In this case we avoid loading the EDB schema information until later by setting :load-edb-schema? to FALSE (the reason for this is explained below). We also force clearing of old group hypotheses in the KOJAK DB regardless of what was specified in the configuration:

 
|= (initialize-kojak :load-edb-schema? FALSE :clear-old-hypotheses? TRUE)
[2005-APR-09 16:01:38.000 KOJAK] Loading KOJAK Ontology...
[2005-APR-09 16:01:38.000 KOJAK] Clearing out old hypotheses and configurations...
[2005-APR-09 16:01:38.000 SDBC] Connect via ODBC: DSN=KOJAK;DB=kojak_test;UID=root;

Print all current configuration settings to standard output which is useful for logging and archiving settings for a run:

 
|= (print-parameters)
------------------------------------------------------------
# Loaded KOJAK Configuration
ChunkSize = 0.1
ClearOldHypotheses = TRUE
Data = ("KJ:data;example1;ds1-group-seeds-all.csv" "KojakDB")
DatasetName = Example1
DatasetType = IET_2004
ExpansionDepth = 1
ExpansionLinkTypes = ("phoneCall" "telecon")
ExpansionMeasures = ("Sum-Count" "Group-MI" "Connectivity")
ExpansionMethods = ("MI" "Link-Count")
KojakDB = DSN=KOJAK;DB=kojak_test;UID=root;
KojakOutputDirectory = /tmp/
KojakRootDirectory = /home/hans/Projects/powerloom/sources/kojak2/group-finder/
LoadDataScript = KJ:scripts;load-data.plm
LoadEDBSchemaScript = KJ:scripts;load-edb-schema.plm
LoadOntologyScript = KJ:scripts;load-ontology.plm
LoadPrimaryDataScript = KJ:scripts;load-primary-data.plm
LoadPrimaryLDDataScript = KJ:scripts;load-primary-ld-data.plm
LoadPrimaryPLDataScript = KJ:scripts;load-primary-pl-data.plm
LoadSecondaryDataScript = KJ:scripts;load-secondary-data.plm
LogLevel = medium
MIWeakLinksIterations = 0
MIWeakLinksSaveRate = 0.5
ReportDirectory = 
ReportFile = ./example1-report.csv
ReportFormat = CSV-Table
ReportGroupFraction = 1.0
ReportMaxMembers = 100
ReportMemberWeights = TRUE
ReportMinMembers = 0
RunID = Run_2005-04-09 16:01:10.000
RunKojakScript = KJ:scripts;example1-run-kojak.plm
Threshold = 0.1
boostFactor = 2.0
boostMinStep = 4
------------------------------------------------------------

kojak-log can be used to print various logging information to standard output depending on the current logging level. The first argument specifies the minimum log level at which the information should be printed:

 
|= (kojak-log :low "Executing KOJAK over dataset " (get-parameter "DatasetName") "...")
[2005-APR-09 16:01:42.000 KOJAK] Executing KOJAK over dataset Example1...

|= (kojak-log :low "Results will be reported to file " (get-parameter "ReportFile" "report.iet"))
[2005-APR-09 16:01:43.000 KOJAK] Results will be reported to file ./example1-report.csv

Before we start loading any data, we import link data into the KOJAK EDB. Processing the data directly from the EDB is the most efficient way of dealing with large datasets. After the data has been imported, it can be analyzed multiple times with different parameters, group seeds, etc. without having to re-import it. To use it in this way, comment the import-data-into-edb command in this script after its first use. Otherwise, data will be cleared and re-imported every time this script is run. The :data-source option specifies from what file to import the data. If not specified, all CSV files specified in the Data parameter will be loaded. We also clear any pre-existing EDB content by setting :clear-EDB? to TRUE. Without that, data would be added incrementally which allows to import data spread over multiple files.

 
|= (import-data-into-edb :data-source "KJ:data;example1;ds1-links.csv" :clear-EDB? true)
[2005-APR-09 16:01:45.000 KOJAK] Importing data into KOJAK edb...
[2005-APR-09 16:01:45.000 KOJAK] Loading CSV evidence file KJ:data;example1;ds1-links.csv into EDB...
[2005-APR-09 16:01:45.000 KOJAK]   Storing 10002 tuples into EDB...
[2005-APR-09 16:01:46.000 KOJAK]   Storing tuples into EDB done.
[2005-APR-09 16:01:46.000 KOJAK]   Storing 10000 tuples into EDB...
[2005-APR-09 16:01:48.000 KOJAK]   Storing tuples into EDB done.
    ..............
[2005-APR-09 16:02:45.000 KOJAK]   Storing tuples into EDB done.
[2005-APR-09 16:02:45.000 KOJAK]   Storing 10009 tuples into EDB...
[2005-APR-09 16:02:48.000 KOJAK]   Storing tuples into EDB done.
[2005-APR-09 16:02:48.000 KOJAK]   Storing 4307 tuples into EDB...
[2005-APR-09 16:02:49.000 KOJAK]   Storing tuples into EDB done.

Now we are ready to load EDB schema information which loads link types and associated information from data currently loaded in the EDB. Had we done that before data was imported, we might get incorrect information left over from previously imported data. Schema information is by default loaded with the script specified in the configuration file.

 
|= (load-edb-schema)
[2005-APR-09 16:02:52.000 KOJAK] Loading EDB schema information from DSN=KOJAK;DB=kojak_test;UID=root;...
[2005-APR-09 16:02:53.000 PL] Redefining the logic object named edb
[2005-APR-09 16:02:53.000 PL] Processing check-types agenda...
[2005-APR-09 16:02:53.000 SDBC] Connect via ODBC: DSN=KOJAK;DB=kojak_test;UID=root;
[2005-APR-09 16:02:53.000 PL] Processing check-types agenda...

Now we load all datasets specified with the Data parameter in the configuration file. In this example, we load group seed information directly into PowerLoom and link data from the KojakDB (which we just filled in the import step above). EDB data is loaded by pointing the PowerLoom edb object to the KOJAK DB and then loading the LoadData script. This load script in fact loads very little data (only any seed group information in case that was loaded into the EDB - not the case in our example), plus some mapping rules that will allow us to access link data on demand in the group expansion step.

 
|= (load-data)
[2005-APR-09 16:02:56.000 KOJAK] Loading data...
[2005-APR-09 16:02:56.000 KOJAK] Loading CSV evidence file KJ:data;example1;ds1-group-seeds-all.csv into PowerLoom...
[2005-APR-09 16:02:56.000 KOJAK]   Finished loading; generated 69 assertions.

[2005-APR-09 16:02:56.000 PL] Processing check-types agenda...
[2005-APR-09 16:02:56.000 SDBC] Disconnecting ODBC connection: DSN=KOJAK;DB=kojak_test;UID=root;

Now we are ready to execute phase 1 which will generate seed groups. In this example this phase is very simple, since all seed groups are explicitly specified. IET datasets allow some more sophisticated processing and inference of seed groups and additional seed members given other information.

 
|= (kojak-log :low "Running KOJAK phase 1: group seed generation...")
[2005-APR-09 16:02:58.000 KOJAK] Running KOJAK phase 1: group seed generation...

retrieve-groups finds all groups that were asserted or are inferable and generates special group objects for them containing all member information, etc. These group objects can be accessed with commands such as get-all-groups and friends and are required as inputs to various other commands.

 
|= (retrieve-groups)
[2005-APR-09 16:03:00.000 KOJAK] Found 3 known groups...
[2005-APR-09 16:03:00.000 KOJAK]   Looking for group memberships...
[2005-APR-09 16:03:00.000 KOJAK]   Done retrieving groups and members.
(|GROUP|DATASET/UID-Group-6704 |GROUP|DATASET/UID-Group-8866 
 |GROUP|DATASET/UID-Group-34988)

retrieve-events finds all events that were asserted and for which we have some participant information (none available in this example).

 
|= (retrieve-events)
[2005-APR-09 16:03:03.000 KOJAK] Found 0 events...
[2005-APR-09 16:03:03.000 KOJAK]   Looking for event participants...
[2005-APR-09 16:03:03.000 KOJAK]   Done retrieving events and participants.
()

Next we hypothesize unknown groups based on known events where some partial participant information is known. Again, this is not relevant for this example and only exercised in IET datasets so far.

 
|= (hypothesize-unknown-groups :mode :conservative)
[2005-APR-09 16:03:04.000 KOJAK] Hypothesizing unknown group seeds...
[2005-APR-09 16:03:04.000 KOJAK]   Hypothesized 0 unknown group seeds.
()

Print out various information we found for logging purposes:

 
|= (kojak-log-objects :medium "Found " (get-nof-objects :AGENT) " seed members:" 'eol
                   (get-all-objects :agent))
[2005-APR-09 16:03:06.000 KOJAK] Found 63 seed members:

|AGENT|DATASET/UID-Indvdl-26714
|AGENT|DATASET/UID-Indvdl-5458
  .............
|AGENT|DATASET/UID-Indvdl-7146
|AGENT|DATASET/UID-Indvdl-39046

|= (kojak-log-objects :medium "Found the following known threat groups:" 'eol
                   (get-all-known-groups))
[2005-APR-09 16:03:08.000 KOJAK] Found the following known threat groups:

[GROUP DATASET/UID-Group-34988
  known members: (|AGENT|DATASET/UID-Indvdl-27264 |AGENT|DATASET/UID-Indvdl-13433
                  |AGENT|DATASET/UID-Indvdl-15781 |AGENT|DATASET/UID-Indvdl-31658
                  ................
                  |AGENT|DATASET/UID-Indvdl-27990 |AGENT|DATASET/UID-Indvdl-16643
                  |AGENT|DATASET/UID-Indvdl-26958 |AGENT|DATASET/UID-Indvdl-14179)
  inferred members: ()]
[GROUP DATASET/UID-Group-8866
  known members: (|AGENT|DATASET/UID-Indvdl-4816 |AGENT|DATASET/UID-Indvdl-11759
                  |AGENT|DATASET/UID-Indvdl-7961 |AGENT|DATASET/UID-Indvdl-18750
                  ................
                  |AGENT|DATASET/UID-Indvdl-26958 |AGENT|DATASET/UID-Indvdl-26888
                  |AGENT|DATASET/UID-Indvdl-30764 |AGENT|DATASET/UID-Indvdl-1774)
  inferred members: ()]
[GROUP DATASET/UID-Group-6704
  known members: (|AGENT|DATASET/UID-Indvdl-715 |AGENT|DATASET/UID-Indvdl-16831
                  |AGENT|DATASET/UID-Indvdl-11247 |AGENT|DATASET/UID-Indvdl-30806
                  ................
                  |AGENT|DATASET/UID-Indvdl-30510 |AGENT|DATASET/UID-Indvdl-24042
                  |AGENT|DATASET/UID-Indvdl-21092 |AGENT|DATASET/UID-Indvdl-23282)
  inferred members: ()]

|= (kojak-log-objects :medium "Hypothesized the following unknown group seeds:" 'eol
                   (get-all-unknown-groups))
[2005-APR-09 16:03:09.000 KOJAK] Hypothesized the following unknown group seeds:

|= (kojak-log-objects :medium "Found the following events:" 'eol
                   (get-all-objects :event))
[2005-APR-09 16:03:11.000 KOJAK] Found the following events:

The KOJAK database is not only used to store evidence data but also to store configuration information as well as resulting groups or intermediate hypotheses. Storing this information in the database is optional and only required if it is needed by some other component or to preserve it for some later run or analysis. For example, we can save seed groups here and reload them later for some different analysis run. Originally, this mechanism was used to communicate seed information to a mutual information component written in Matlab. However, it turns out that using the database as a kind of blackboard for storing and communicating results is useful in many situations.

 
|= (kojak-log :low "Storing group seeds and configuration in KOJAK DB...")
[2005-APR-09 16:03:14.000 KOJAK] Storing group seeds and configuration in KOJAK DB...

Stores all current parameter settings in the configuration table of the KOJAK DB under the current `RunID'. This configuration could be loaded later with db-load-configuration (NOT YET IMPLEMENTED).

 
|= (db-save-configuration)

Save known group seeds in the hypothesis table with :source and :category meta-information. The values used in these fields can be arbitrary and the ones used here only have historic significance. :source specifies the source module and :category the kind of hypothesis we are dealing with. These can be used later to select only specific hypotheses to load. The hypothesis table schema developed "organically" and has some deficiencies that will be eliminated in future versions. For example, if one wants to store hypotheses from the same dataset but multiple analysis runs, one would have encode the RunID manually in one of the :dataset, :source or :category fields.

 
|= (db-save-groups (get-all-known-groups)
                :dataset (get-parameter "DatasetName")
                :source "AV"
                :category "P1")

|= (db-save-groups (get-all-unknown-groups)
                :dataset (get-parameter "DatasetName")
                :source "AV"
                :category "T1")

Now we are ready to run phase 2 which forms the core of the KOJAK group detection process. In this step we extend seed groups using all available link data. We first grow a graph from the group seeds to all individuals reachable via the available link data. This graph extension can be done to a certain depth such as 2 or 3. The default depth is 1, since depending on fan-out growing to deeper depth might bring in very large amounts of data. Then we take the individuals found in the graph extension step and compute connection strength between them and the seed group given a variety of methods. A core method used is mutual information (MI), but we also have other methods available such as link counting or connectivity to group seeds. We can use multiple methods and measures and combine them using "bagging" which usually yields better performance than just using a single method or measure. Given these connection strength measures we can then compute a ranked list of extended members where those most strongly connected to the seed group are listed at the top.

 
|= (kojak-log :low "Running KOJAK phase 2: group extension...")
[2005-APR-09 16:03:23.000 KOJAK] Running KOJAK phase 2: group extension...

extend-groups takes a set of seed groups and a large number of optional parameters controlling the group expansion process. Most of these parameters take defaults from the configuration file, but we explicitly specify them here for documentation purposes.

:mode specifies whether groups will be expanded one at a time (:individual) or collectively (:collective). Collective expansion brings in more data which can improve quality but also increase run-time, since much more data might be loaded in at once. Boosting is not yet supported in :collective mode.

:link-types specifies the link types that should be considered to expand the graph and compute connection strength. These link types must match the ones in the link data. Using different link types particularly benefits the MI computation, since it can differentiate links of different types. A special anyLink type is also supported for EDB data, which treats all links as the same.

:methods specifies which strength computation methods should be used. Currently supported are link counting, MI and connectivity (which is computed implicitly as part of link counting). See the documentation of the ExpansionMethods parameter for more details.

Each strength computation method produces a number of associated measures such as maximum and average values, etc. See the documentation of ExpansionMeasures for a full list. Below we specify which measures we want to use for ranking members. The specified measures need to correspond to the :methods specified above. If more than one measure is specified, they will be averaged ("bagging").

Only keep the :top-N extended members in the resulting expanded groups (which is a simple form of preliminary thresholding). There are also various parameters for boosting control (boosting is disabled in this invocation). See 6.1 Configuration File for more information on boosting and associated parameters.

 
|= (extend-groups (get-all-groups)
                  :mode :individual
                  :link-types '("phoneCall" "telecon")
                  :methods '(:connectivity :link-count :group-mi)
                  :result-measures '(:sum-count :group-mi :connectivity)
                  ;:top-N 500
                  ;:boost-min-seeds 20 ;; no boosting by default
                  :boost-factor 1.75
                 :boost-min-step 3)
[2005-APR-09 16:03:36.000 KOJAK] Extending groups with options: (:MODE :INDIVIDUAL :LINK-TYPES ("phoneCall" "telecon") :METHODS (:CONNECTIVITY :LINK-COUNT :GROUP-MI) :RESULT-MEASURES (:SUM-COUNT :GROUP-MI :CONNECTIVITY) :BOOST-FACTOR 1.75 :BOOST-MIN-STEP 3)
[2005-APR-09 16:03:36.000 KOJAK] Extending group UID-Group-34988 with 22 seeds...
[2005-APR-09 16:03:36.000 SDBC] Connect via ODBC: DSN=KOJAK;DB=kojak_test;UID=root;
[2005-APR-09 16:03:36.000 KOJAK]   Retrieved a total of 317 phoneCall links from 22 seeds to 231 entities...
[2005-APR-09 16:03:38.000 KOJAK]   Retrieved a total of 2660 telecon links from 22 seeds to 2144 entities...
[2005-APR-09 16:03:40.000 KOJAK] Extending group UID-Group-8866 with 24 seeds...
[2005-APR-09 16:03:40.000 KOJAK]   Retrieved a total of 304 phoneCall links from 24 seeds to 226 entities...
[2005-APR-09 16:03:42.000 KOJAK]   Retrieved a total of 2165 telecon links from 24 seeds to 1902 entities...
[2005-APR-09 16:03:43.000 KOJAK] Extending group UID-Group-6704 with 20 seeds...
[2005-APR-09 16:03:44.000 KOJAK]   Retrieved a total of 297 phoneCall links from 20 seeds to 228 entities...
[2005-APR-09 16:03:45.000 KOJAK]   Retrieved a total of 2059 telecon links from 20 seeds to 1709 entities...

Next we merge expanded unknown group hypotheses with each other or known groups based on similarity. If we extend seeds for unknown groups we might wind up with a group that is already known or multiple unknown groups might really be the same group. Merging takes care of this duplicate elimination. In this example we do not have any unknown groups which makes this a no-op.

 
|= (kojak-log :low "Merging unknown groups...")
[2005-APR-09 16:03:50.000 KOJAK] Merging unknown groups...

If we merge groups thresholding them first is useful to get more accurate similarity measures. Thresholding is still a somewhat ad-hoc process, since there are no general indicators where the best place is for cutting the list. Specialized thresholding functions can be written that support thresholding for a particular domain (done for the IET datasets). There are also a few configuration parameters that allow some basic thresholding control (see the manual).

 
|= (threshold-groups (get-all-groups))

|= (merge-groups :similarity-epsilon 0.03
              :merge-by :weight
              :remove-merge-origins? TRUE)
[2005-APR-09 16:03:57.000 KOJAK]   0 unknown groups were merged with other groups.
()

Finally, we can output the resulting groups to the database and/or a file in multiple formats.

 
|= (kojak-log :low "Running KOJAK phase 3: storing and reporting results...")
[2005-APR-09 16:03:59.000 KOJAK] Running KOJAK phase 3: storing and reporting results...

As before, saving to the KOJAK DB is optional and could be commented out:

 
|= (db-save-groups (get-all-groups)
                :dataset (get-parameter "DatasetName")
                :source "AV"
                ;; for final:
                :category "F")
[2005-APR-09 16:04:02.000 SDBC] Renewing ODBC connection: DSN=KOJAK;DB=kojak_test;UID=root;
[2005-APR-09 16:04:02.000 SDBC] Connect via ODBC: DSN=KOJAK;DB=kojak_test;UID=root;
[2005-APR-09 16:04:02.000 SDBC] Disconnecting ODBC connection: DSN=KOJAK;DB=kojak_test;UID=root;

report-groups will do thresholding again, however, since we already did that above and no groups were merged, thresholding will be a non-op here. The output format can be specified via ReportFormat in the configuration, otherwise it will be inferred from the file extension. If ReportMemberWeights in the configuration is set to true (or specified here via the :report-member-weights? option) members will be output with member weight annotations. For this example, group reports will be written to `example1-report.csv' in the KOJAK directory.

 
|= (report-groups (get-all-groups))
[2005-APR-09 16:04:07.000 KOJAK] Group UID-Group-34988 has 100 potential members; reporting the top 100
[2005-APR-09 16:04:07.000 KOJAK] Group UID-Group-8866 has 100 potential members; reporting the top 100
[2005-APR-09 16:04:07.000 KOJAK] Group UID-Group-6704 has 100 potential members; reporting the top 100


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.2 Other Example Runs

There are two more preconfigured examples that are variations of Example 1. The only difference between Example 1 and Example 2 is that Example 2 uses some slightly different ontology for groups and members (defined in `kbs/example2-ontology.plm') which is mapped onto the Group Finder's internal ontology via the synonym specifications provided in `kbs/example2-seed-constraints.plm'. This provides a simple example how a different data ontology can be mapped so that it can be properly understood by the KOJAK Group Finder. Example 2 can be run like this:

 
% ./run-kojak -c config/example2.dat

Results will be written to `example2-report.csv' in the KOJAK directory.

The difference between Example 1 and Example 3 is that here we load all link data into PowerLoom instead of querying it from the database. This is only recommended for small datasets. Example 3 can be run like this:

 
% ./run-kojak -c config/example3.dat

Results will be written to `example3-report.csv' in the KOJAK directory.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Hans Chalupsky on October, 30 2007 using texi2html