8. Advanced Configuration

Data describes or represents some aspects of the world. For each such aspect there are infinitely many ways in which it could be represented or conceptualized as well as how such a representation might be physically realized on a storage medium such as a computer file or a database. This openness presents a major challenge for applying and deploying a generic link discovery system such as the KOJAK Group Finder.

To do its work KOJAK conceptualizes the world as a graph where nodes represent entities such as groups (e.g., human organizations) and their members (e.g., people), and where links represent different kinds of relationships between them, e.g., that an entity is member of a certain group or that entity A paid money to entity B, etc. When KOJAK is applied to a new type of dataset, the data needs to be appropriately mapped onto this internal conceptualization to generate meaningful results.

KOJAK uses the classes (or concepts) and relations defined in its generic groups ontology (see file `kbs/generic-groups-ontology.plm' and also 10. Group Finder Ontology) as its internal standard vocabulary to represent different kinds of groups, group members, membership relations, events, etc. This ontology also defines a set of abstract interface relations such as, for example, GROUPS/linkCount that allow the mutual information component to access link statistics for different types of links. Another example is the GROUPS/nameString relation which is used by the report generator to substitute name strings for entity IDs if such names are available.

In the simplest KOJAK configuration where data comes from CSV files seed groups and members are specified using this standard vocabulary, and links such as phoneCall or telecon use a standard representation format and are defined and mapped automatically by the KOJAK CSV file importer so that link statistics can be accessed via KOJAK's GROUPS/linkCount interface relation. In this case no special mapping is required, since the mapping is done in effect by the person who maps their data onto KOJAK's CSV input syntax.

There are basically three cases where this simple, automatic mapping is not sufficient:

If somebody wants to use a different vocabulary for things like groups and memberships. For example, if one wants to use the term ThreatGroup instead of the canonical Group or KnownGroup. This very simple case is illustrated by the Example 2 configuration and run script (see section 7.2 Other Example Runs).
If one wants to exploit additional domain knowledge or constraints to logically infer additional group members (besides those explicitly given). For example, one might want to add a rule that says that if two people participated in a certain type of link, say robbedBankTogether, then they must be in the same group. Therefore, if group membership of one person is known, group membership of the other person could be inferred from it. Rules of this type are used in the mapping for the IET synthetic datasets, for example, see the file `kbs/iet-y3-seed-constraints.plm' and the rules inferring threat groups and membership by participation in certain types of events.
If data comes from some existing relational database (with some arbitrary schema) it could potentially be very large and it might be cost prohibitive to export and translate all relevant data into KOJAK's CSV file format first. Instead, data should be accessed and aggregated on the database directly and the results mapped onto KOJAK relations to achieve maximal scalability. This has also the advantage of always running against the most current version of the database instead of a potentially outdated translation.

In practice, any combination of the above cases might arise. In the following we describe a set of mechanisms that can be used to configure KOJAK for such situations and to appropriately map data onto the internal representation that KOJAK needs. This will be done more or less in a tutorial style that uses the mapping for the synthetic Ali Baba dataset as an example.

8.1 PowerLoom

8.2 File and Module Structure

8.3 The Ali Baba Configuration

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.1 PowerLoom

The core mechanisms used to define mappings such as the one for the Ali Baba dataset are based on the PowerLoom knowledge representation and reasoning system. KOJAK is built on top of PowerLoom, so all of PowerLoom's functionality is available at any time. We will try to keep the description below self-contained, but a basic knowledge of PowerLoom and its capabilities will greatly facilitate understanding (see http://www.isi.edu/isd/LOOM/PowerLoom/ for more information about PowerLoom). For quick reference, an important subset of PowerLoom commands is documented in 9.1 Important PowerLoom Commands.

PowerLoom is a logic-based knowledge representation and reasoning (KR&R) system that provides a language to define classes (called concepts), relations, instances, logic rules and mappings between PowerLoom relations and external database tables. PowerLoom also has a logical inference engine and query language that allows a user (or KOJAK) to query relationships that are not explicitly represented but logically follow based on some of the known rules.

PowerLoom uses the Knowledge Interchange Format (or KIF) as its basic representation language. KIF is a version of predicate logic that uses a Lisp-based uniform expression syntax. For example, the rule that all men are mortal could be represented in KIF like this:

(forall (?x) (=> (man ?x) (mortal ?x)))

KIF variables are symbols that start with a question mark. Note the Lisp-style prefix syntax where each expression is enclosed in parentheses and starts with the operator followed by a list of arguments. Besides the parentheses, no punctuation is required. For example, compare the following two expressions:

man(?x) => mortal(?x) ;; infix notation (=> (man ?x) (mortal ?x)) ;; KIF

Similar to the representation language, all PowerLoom commands need to be written in a Lisp-ish input syntax as well. Commands take zero or more required arguments and sometimes a set of options specified as keyword/value pairs. For example, we could use PowerLoom's assert command to tell it about the rule shown above:

(assert (forall (?x) (=> (man ?x) (mortal ?x))))

If we then tell it that Socrates is a man via

(assert (man (socrates)))

we can then ask whether Socrates is mortal like this

(ask (mortal (socrates)))

to which PowerLoom would return TRUE as the result.

Most of KOJAK's configuration files are PowerLoom files which contain a set of PowerLoom definitions and related commands. Such files can be loaded and interpreted at runtime to dynamically change the internal knowledge base and KOJAK's behavior. If KOJAK is run in interactive mode, such commands can also be typed in interactively - more examples on that are shown in some of the configuration sections below.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.2 File and Module Structure

KOJAK divides its advanced configuration information into a set of PowerLoom knowledge base (KB) files and load scripts. The reasons for this separation are (1) to separate functionally different aspects of the configuration into separate files, and (2) that this information is organized into separate PowerLoom modules and currently each KB file can only be loaded into one single module. Each configuration file (e.g., `config/ali-baba.dat') points at its own set of KB files which are loaded during startup and initialization of KOJAK. When they are loaded the necessary definitions and mappings are established which can be completely different for two different configurations. There are the following types of KB files:

Ontology files define the vocabulary of classes and relations used by the Group Finder. Ontology files are somewhat analogous to a database schema definition. For example, `kbs/generic-groups-ontology.plm' is the central ontology file used to define the groups vocabulary of KOJAK. This file always needs to be loaded for proper operation. The file `kbs/ali-baba-ontology.plm' defines a set of additional classes and relations specific to the Ali Baba dataset. For example, it defines the relation participatedInTerrorism to represent and reason with a particular kind of information available in this dataset.

Seed constraint files define mappings between a dataset ontology and the generic groups ontology as well as rules that allow KOJAK to infer additional threat groups or seed members. For example, `kbs/ali-baba-seed-constraints.plm' defines the mappings between the Ali Baba ontology and the generic groups ontology as well as a rule that infers additional seed members based on joint participation in a terrorism event.

EDB schema files are a special kind of ontology file specifically dealing with the mapping of a relational evidence database (EDB) to the PowerLoom relations used by KOJAK. For example, `kbs/kojak-edb-schema.plm' is the file defining the mapping for KOJAK's own internal evidence database (the one into which data from CSV files gets imported) and `kbs/ali-baba-edb-schema.plm' defines the mapping for the Ali Baba EDB.

Data files define or load actual data instances such as certain groups, members, etc. We also put the actual mapping rules between KOJAK's abstract interface relations such as GROUPS/linkCount and particular EDB relations such as, for example, EDB-link_count, since they are in effect data import rules (conceivably, these could also go into the associated EDB schema file). For example, `scripts/load-data.plm' is the generic data file to load data from the internal KOJAK EDB, `scripts/ali-baba-load-data.plm' is the data file for the Ali Baba EDB.

Load files provide an extra level of indirection, e.g., to load multiple ontology files when the Group Finder ontology is loaded. For example, `scripts/ali-baba-load-ontology.plm' loads three different files. A load file could also contain actual definitions instead of just loading some other file as is the case for `scripts/ali-baba-load-data.plm'. The load file mechanism is not absolutely necessary and might be replaced in the future by simply specifying multiple EDB or ontology files instead of specifying single load scripts as currently done via configuration parameters such as LoadOntologyScript, etc. Load files can use physical pathnames to load other files, but all files shipping with KOJAK use logical pathname syntax to load other files to make them platform and programming language independent. For more information on logical pathname syntax see 6.1 Configuration File.

These various files need to be loaded in the correct order. For example, a relation such as participatedInTerrorism needs to be defined before it can be used in an assertion, rule or query. For this reason KOJAK first loads ontology files via the value of LoadOntologyScript, then EDB schema files via the value of LoadEDBSchemaScript and, finally, data via the value of LoadDataScript.

8.2.1 Module Structure

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.2.1 Module Structure

As mentioned above, PowerLoom KB files are loaded into modules. Modules are separate name and assertion spaces that can inherit from each other. Each PowerLoom KB file starts with an IN-MODULE declaration which specifies into which module the information should be loaded. The PowerLoom module system is very flexible and powerful but can also be confusing. When generating a configuration for a new dataset it is therefore best to use an existing configuration and simply mirror its IN-MODULE declarations.

Using appropriate module structure allows us to avoid name clashes (e.g., between user terms and the built-in groups ontology) or to separate non-volatile information such as schema definitions from more volatile information such as data assertions. For example, we could rerun certain commands on a different data set by clearing out the data module but leaving all ontology modules intact. The main modules used by KOJAK are described below:

KOJAK: This is a namespace that contains all KOJAK commands such as, for example, extend-groups. Run scripts such as `scripts/ali-baba-run-kojak.plm' use this as their module so commands can be written without the KOJAK/ module prefix. Also, if KOJAK is started up in interactive mode it will initially be in the KOJAK module.

LD-ONTOLOGY: This is the top-level ontology module of KOJAK. All ontology files should be loaded into this module. It inherits the PowerLoom kernel module to make all built-in PowerLoom definitions and commands available as well as the KOJAK module which contains all KOJAK commands.

LD: This is the module used to contain mappings and seed constraints. It inherits LD-ONTOLOGY. Seed constraint files such as `kbs/ali-baba-seed-constraints.plm' should use this module.

GROUPS: This is a name space for the objects defined in KOJAK's generic groups ontology. The objects themselves are defined in LD-ONTOLOGY but their names come from the GROUPS name space to prevent clashes with user-defined concepts and relations. This is the reason why these terms always need to be prefixed as with, for example, GROUPS/linkCount.

EDB: This module is used to define database objects and schema mappings. It inherits LD-ONTOLOGY as well as commands from the PowerLoom RDBMS module. EDB schema files such as `kbs/ali-baba-edb-schema.plm' should use this module.

DATASET: This is the lowest level module intended to host data objects and assertions. It inherits all of the above. Data files such as `scripts/ali-baba-load-data.plm' should go into this module.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3 The Ali Baba Configuration

In the following we describe each and every aspect of the configuration for the synthetic Ali Baba dataset developed by SAIC. The Ali Baba configuration ships with this KOJAK release and consists of the following files:

config/ali-baba.dat ...top-level config file kbs/ali-baba-ontology.plm ...ontology file kbs/ali-baba-seed-constraints.plm ...seed constraint file kbs/ali-baba-edb-schema.plm ...EDB schema file scripts/ali-baba-load-ontology.plm ...ontology load script scripts/ali-baba-load-edb-schema.plm ...EDB schema load script scripts/ali-baba-load-data.plm ...data file scripts/ali-baba-run-kojak.plm ...run script

To create a new configuration it is best to first create appropriately named copies for each of these eight files (e.g., use a different prefix such as my-edb- instead of `ali-baba-') and then modify each file to fit the actual database and dataset that you are trying to map to. By starting with copies of an existing configuration each KB file will already have an appropriate IN-MODULE declaration.

Below we describe each and every file in this configuration in detail. It will be helpful to have printouts or electronic copies of these files handy when working through this section.

8.3.1 `ali-baba.dat'

8.3.2 `ali-baba-ontology.plm'

8.3.3 `ali-baba-seed-constraints.plm'

8.3.4 `ali-baba-edb-schema.plm'

8.3.5 `ali-baba-load-ontology.plm'

8.3.6 `ali-baba-load-edb-schema.plm'

8.3.7 `ali-baba-load-data.plm'

8.3.8 `ali-baba-run-kojak.plm'

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.1 ``ali-baba.dat`'

This is the top-level configuration file to run KOJAK on the Ali Baba dataset. This file is what you will pass to KOJAK via the -c command line argument to configure it for this dataset, for example:

run-kojak -c config/ali-baba.dat

The set of available KOJAK configuration parameters is described in 6. Configuration. The set of parameters that most definitely will need to be adapted for your dataset are the following:

Data +=jdbc:mysql://blackcat:3306/ali_baba_v41?user=me DatasetName =AliBaba LoadOntologyScript =KJ:scripts;ali-baba-load-ontology.plm LoadEDBSchemaScript =KJ:scripts;ali-baba-load-edb-schema.plm LoadDataScript =KJ:scripts;ali-baba-load-data.plm RunKojakScript =KJ:scripts;ali-baba-run-kojak.plm ExpansionLinkTypes +=phoneCall ExpansionLinkTypes +=affiliatedWith ExpansionLinkTypes +=communication ReportFile =./ali-baba-report.csv

You might want to add an additional Data entry to load seed information for your dataset, for example, from a CSV file. In the Ali Baba configuration seed information is loaded directly from the database in the `scripts/ali-baba-load-data.plm' script.

The various script variables need to point to the load and run scripts described below. ExpansionLinkTypes need to correspond to the link types defined and mapped in the ontology and EDB schema files. Alternatively, you can list those directly as an argument to expand-groups in the KOJAK run script.

You might also want to adapt various analysis parameters for your dataset, e.g., to control boosting and/or thresholding. Usually that doesn't need to be done right away but will be done over the course of multiple analysis runs.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.2 ``ali-baba-ontology.plm`'

This file defines the Ali Baba ontology. An ontology defines a set of classes or types (in PowerLoom called "concepts") and a set of functions and relations. Under this view an ontology is roughly analogous to a database schema (however, ontologies usually also define hierarchical relationships, logic rules and constraints that further define or constrain the meaning of the defined terms).

The ontology file should define all terminology needed in addition to what is already available in the generic groups ontology. Additionally, it might define name variants or synonyms to some of the generic terms (which is what is done here).

The first part of the file defines a small person and group hierarchy that mostly mirrors the structure of `kbs/generic-groups-ontology.plm' but uses the terms ThreatGroup and memberAgents, since those are the names used in the Ali Baba EDB. Using these definitions is not strictly necessary, since we could assert imported data directly in terms of the generic groups ontology. Classes of entities such as ThreatGroup can be defined with the PowerLoom DEFCONCEPT command (note that since the LD-ONTOLOGY module is case sensitive, PowerLoom commands have to be written in all upper-case):

(DEFCONCEPT Person) (DEFCONCEPT Group) (DEFCONCEPT ThreatGroup (Group)) (DEFCONCEPT NonThreatGroup (Group)) (DEFCONCEPT Event) (DEFRELATION memberAgents (?group (?agent Person)) :documentation "Asserts that ?agent is a `Person' and a member of ?group. Note that the argument order of `memberAgents' links in the Ali Baba EDB is actually reversed from the one used in the EAGLE Y2 and Y3 EDB schema!")

The classes ThreatGroup and NonThreatGroup are defined as subclasses of Group which is supplied as the superconcept in the DEFCONCEPT command.

memberAgents is a binary relation between a group and a Person which mirrors KOJAK's GROUPS/groupMember relation. The second argument of the DEFRELATION command is a list of variables defining the arguments and arity of the relation. If a variable is grouped with a concept name as done for ?agent the concept is interpreted to be the argument type for that position (somewhat similar to a column type in a database table).

Once these classes (or concepts) and relations have been defined, they can be used to introduce instances and assert relationships between them. Let us play with these definitions a little bit to get a better idea how they are used by KOJAK. To do so we'll start KOJAK in interactive mode via the -i command-line argument which instead of running an analysis end-to-end will bring up a command loop where we can execute KOJAK and PowerLoom commands:

% run-kojak -i -c config/ali-baba.dat Running C++ version of KOJAK Group Finder... Initializing STELLA... Initializing PowerLoom... Initializing KOJAK... [2005-OCT-06 17:53:19.000 PL] Processing check-types agenda... +-------------------------------------------------------------+ | Welcome to the KOJAK Group Finder v2.2.0 | | Copyright (C) USC Information Sciences Institute, 1996-2005 | | This software comes with ABSOLUTELY NO WARRANTY and is | | licensed for NON-COMMERCIAL EVALUATION PURPOSES ONLY! | | All rights reserved. | +-------------------------------------------------------------+ |=

Once we see the |= command prompt we can type commands and see their results. Let us start by examining and changing the current module with PowerLoom's cc command. Without an argument, the command will simply return the current module we are in which initially is the KOJAK module. Given a module name as an argument, it will switch the current module to that module:

|= (cc) |MDL|/KOJAK |= (cc LD-ONTOLOGY) |MDL|/PL-KERNEL-KB/PL-USER/LD-ONTOLOGY

Now we are in the LD-ONTOLOGY module and ready to experiment with some ontology definitions. Note that this module is case sensitive (as opposed to the case insensitive KOJAK module) which means all KOJAK and PowerLoom commands have to be spelled in all upper-case. Let us start by running a PowerLoom retrieve query to see whether KOJAK knows about any ThreatGroup's at this point. retrieve takes a logic sentence as an argument where at least one argument is a variable (a question mark symbol) and tries to find bindings for the variable(s) based on what is currently asserted in the knowledge base. The optional second argument specifies how many answers we are looking for. By default only one answer is retrieved, the ALL argument indicates that we are looking for all answers:

|= (RETRIEVE ALL (ThreatGroup ?g)) ERROR: Undeclared predicate or function reference: `ThreatGroup'. Error occurred while parsing the proposition: (KAPPA (?g) (ThreatGroup ?g))

We get an error message, since at this point the ThreatGroup concept isn't yet defined. To define it we manually load the ontology file using the load command (in normal operation that's done automatically when initialize-kojak is called) and then run the query again.

|= (LOAD "KJ:kbs;ali-baba-ontology.plm") |= (RETRIEVE ALL (ThreatGroup ?g)) No solutions.

This time we didn't get an error but also not any solutions, since the KB is still empty. Let's add some content with PowerLoom's assert command. We first switch to the DATASET module, since that is where we usually want to store any data:

Now we can ask some questions again:

|= (RETRIEVE ALL (ThreatGroup ?g)) [2005-OCT-06 18:00:34.000 PL] Processing check-types agenda... There is 1 solution: #1: ?g=RAF |= (RETRIEVE ALL (memberAgents RAF ?m)) There are 2 solutions: #1: ?m=Ulrike_Meinhof #2: ?m=Andreas_Baader |= (RETRIEVE ALL (memberAgents ?g ?m)) There are 2 solutions: #1: ?g=RAF, ?m=Andreas_Baader #2: ?g=RAF, ?m=Ulrike_Meinhof

We can also illustrate some very simple logical inference based on these definitions. For example, we can retrieve all Group's, since ThreatGroup is a subconcept of Group:

|= (RETRIEVE ALL (Group ?g)) There is 1 solution: #1: ?g=RAF

The following query retrieves all people. Note that we never explicitly asserted anybody to be a Person, but the system inferred Person-hood for the second argument of memberAgents assertions due to the type constraint given in the definition of the relation. This is different from the usual programming language semantics where a type constraint has to be satisfied before - say a function - can be applied to a particular argument. In PowerLoom we infer that the argument must be of that type if we are given the assertion of the relation. This is similar to, say, somebody tells you that X is Y's brother and you infer (given X is a person) that Y must be a person even if nobody has ever told you that explicitly.

|= (RETRIEVE ALL (Person ?x)) There are 2 solutions: #1: ?x=Andreas_Baader #2: ?x=Ulrike_Meinhof

This assertion of groups and their members is the basic mechanism via which KOJAK represents seed information. We haven't established the linkage yet between the terminology we used here and the generic groups ontology used by KOJAK, but once that is done (see next section) it can use queries just as the ones we used above to access this information.

Following these concept definitions is the definition of a nameString relation which can be used to assert a name with a particular entity. We do use this relation in `scripts/ali-baba-load-data.plm' to assert names for threat group entities, however, these name string are not yet used for anything. Conceivably, we could change the definition of the GROUPS/nameString rule at the end of the file to use those name strings. Just for illustration, here is how we can assert and retrieve such name strings:

|= (ASSERT (nameString RAF "Red Army Fraction")) |P|(nameString RAF "Red Army Fraction") |= (ASSERT (nameString RAF "Rote Armee Fraktion")) |P|(nameString RAF "Rote Armee Fraktion") |= (RETRIEVE ALL (nameString RAF ?name)) There are 2 solutions: #1: ?name="Red Army Fraction" #2: ?name="Rote Armee Fraktion" |= QUIT Really exit? (yes or no) yes

Next, the file defines participatedInTerrorism which we will use to materialize terrorism links from the Ali Baba EDB. Our interpretation of such links is that they encode that two people committed an act of terrorism together (this interpretation might be wrong - unfortunately, the documentation of the Ali Baba dataset is somewhat lacking so we don't know for sure). Assuming our interpretation is correct, we will use such links to infer additional seed members for groups which will be illustrated in the next section.

The next three relations affiliatedWith, communication and phoneCall correspond to the three expansion link types that will be used by the mutual information component to extend groups. These relations will actually never be materialized but simply serve as names to link up the mutual information component's GROUPS/linkCount function to three different EDB relations. We could have done without actually defining these relations, but they allow us to document here what types of links we are using for group extension. The use and semantics of these relations will be described in more detail later.

Finally, what kind of information is exploitable to infer additional seed members or what link types are usable for group expansion via the mutual information component will of course be different for every dataset. What we used here only makes sense for the Ali Baba data and can only serve as an illustration. For each new dataset some manual analysis has to be done to determine what information is available and what portion of it might be usable by the KOJAK Group Finder (also see 5.1.2 Link Data for some more information on what types of data the Group Finder can exploit).

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.3 ``ali-baba-seed-constraints.plm`'

The seed constraints file defines two things: (1) the mapping between terms in the Ali Baba group ontology and the generic group ontology used by KOJAK, and (2) constraint rules (really just one in this case) for inferring additional seed members. There is no specific requirement that the term mapping be defined in this file - we could have just as well added it to the ontology file. However, the mapping might be more complex and involve rules which would interact with the seed constraint rules. For this reason, we usually put these two pieces together in this file so it is easier to see their connection and interaction. For example, look at `kbs/iet-y3-seed-constraints.plm' to a see such a more complex mapping.

Let us experiment again with KOJAK in interactive mode to see how the seed constraints file is used. We start by loading the two ontology files but not yet the constraints file:

% run-kojak -i -c config/ali-baba.dat Running C++ version of KOJAK Group Finder... Initializing STELLA... Initializing PowerLoom... Initializing KOJAK... [2005-OCT-07 13:58:47.000 PL] Processing check-types agenda... +-------------------------------------------------------------+ | Welcome to the KOJAK Group Finder v2.2.0 | | Copyright (C) USC Information Sciences Institute, 1996-2005 | | This software comes with ABSOLUTELY NO WARRANTY and is | | licensed for NON-COMMERCIAL EVALUATION PURPOSES ONLY! | | All rights reserved. | +-------------------------------------------------------------+ |= (LOAD "KJ:kbs;generic-groups-ontology.plm") |= (LOAD "KJ:kbs;ali-baba-ontology.plm")

Next, we assert a ThreatGroup instance again and see whether we can retrieve the instance using KOJAK's generic group syntax:

There were no solutions. The reason is that the linkage between the Ali Baba term ThreatGroup and KOJAK's GROUPS/KnownGroup hasn't been established yet. The first section of `kbs/ali-baba-seed-constraints.plm' establishes this connection by declaring synonym relationships between KOJAK terms and Ali Baba terms. This is the simplest form of mapping where we simply declare equivalence between terms. More complicated mappings can be established by using rules. For example, suppose we have a memberAgents relation that uses arguments in the reverse order of GROUPS/groupMember. In that case we could use the following rule to map between them:

(ASSERT (<=> (memberAgents ?a ?g) (GROUPS/groupMember ?g ?a)))

We could have used rules like that instead of all the synonym mappings, however, the synonym mechanism is more efficient, since it doesn't require an extra inference step to carry out the mapping.

Now let us load the mapping and see how it changes things. We have to clear the data module first and reassert the information, since the synonym relation will not affect assertions that have already been made. After that the query that failed above does return the result we want:

The second part of the constraints file contains the following rule:

(ASSERT (=> (AND (ThreatGroup ?g) (memberAgents ?g ?p1) (EXISTS (?e) (participatedInTerrorism ?e ?p1 ?p2))) (memberAgents ?g ?p2)))

What it models is the following: If there is a threat group ?g with a member ?p1 and there is some terrorism event ?e where some other person ?p2 participated with ?p1, then that other person must also be a member of the threat group ?g. Let us see this rule in action. We start by asserting some initial seed members again and see that we can only retrieve what we asserted:

|= (ASSERT (AND (memberAgents RAF Andreas_Baader) (memberAgents RAF Ulrike_Meinhof))) (|P|(groupMember RAF Andreas_Baader) |P|(groupMember RAF Ulrike_Meinhof)) |= (RETRIEVE ALL (GROUPS/groupMember ?g ?m)) [2005-OCT-07 14:01:27.000 PL] Processing check-types agenda... There are 2 solutions: #1: ?g=RAF, ?m=Ulrike_Meinhof #2: ?g=RAF, ?m=Andreas_Baader

Now we assert that some new person performed a terrorist act together with a known member of the RAF. When we ask for membership again KOJAK now finds the new member through logical inference, even though membership was not explicitly asserted:

|= (ASSERT (participatedInTerrorism US_Army_Corp_Bombing Andreas_Baader Gudrun_Ensslin)) |P|(participatedInTerrorism US_Army_Corp_Bombing Andreas_Baader Gudrun_Ensslin) |= (RETRIEVE ALL (GROUPS/groupMember ?g ?m)) [2005-OCT-07 14:02:17.000 PL] Processing check-types agenda... There are 3 solutions: #1: ?g=RAF, ?m=Ulrike_Meinhof #2: ?g=RAF, ?m=Andreas_Baader #3: ?g=RAF, ?m=Gudrun_Ensslin

We can use PowerLoom's explanation mechanism to illustrate how the rule was actually used here:

|= (SET-FEATURE JUSTIFICATIONS) |l|(:JUSTIFICATIONS :EMIT-THINKING-DOTS :JUST-IN-TIME-INFERENCE) |= (ASK (GROUPS/groupMember RAF Gudrun_Ensslin)) TRUE |= (WHY) 1 (groupMember RAF Gudrun_Ensslin) follows by Modus Ponens with substitution {?g/RAF, ?p2/Gudrun_Ensslin, ?e/US_Army_Corp_Bombing, ?p1/Andreas_Baader} since 1.1 ! (FORALL (?g ?p2) (<= (groupMember ?g ?p2) (EXISTS (?e ?p1) (AND (ThreatGroup ?g) (groupMember ?g ?p1) (participatedInTerrorism ?e ?p1 ?p2))))) and 1.2 ! (ThreatGroup RAF) and 1.3 ! (participatedInTerrorism US_Army_Corp_Bombing Andreas_Baader Gudrun_Ensslin) and 1.4 ! (groupMember RAF Andreas_Baader)

Here we only have one constraint rule but there is no limit to the complexity of such a model. Any of the generic group terms can have rules associated with it to allow them to be inferred from other available information. For example, we might have a rule that states that an organization that received financial support from some known threat group is itself a threat group. How complex this needs to be will again strongly depend on the particular dataset and the available information. One important caveat is that such rules should be fairly reliable (but not necessarily 100%), since KOJAK currently treats such logically inferred information just as if it had been asserted explicitly (except that inferred seed members are annotated as such in the hypothesis tables of the KOJAK database).

KOJAK uses queries like the ones shown above to retrieve seed groups and their members. In fact, that's exactly what KOJAK's retrieve-groups command does, for example:

|= (RETRIEVE-GROUPS) [2005-OCT-07 14:56:46.000 KOJAK] Found 1 known groups... [2005-OCT-07 14:56:46.000 KOJAK] Looking for group memberships... [2005-OCT-07 14:56:47.000 KOJAK] Done retrieving groups and members. (|GROUP|RAF) |= (KOJAK-LOG-OBJECTS :LOW "Found these groups:" (GET-ALL-GROUPS)) [2005-OCT-07 14:59:11.000 KOJAK] Found these groups: [GROUP RAF known members: (|AGENT|Ulrike_Meinhof |AGENT|Andreas_Baader) inferred members: (|AGENT|Gudrun_Ensslin)]

Given appropriate mappings, it can retrieve this seed information even if it was asserted using a different vocabulary or classes and relations. In general, it is a good idea to use the built-in generic vocabulary wherever possible to simplify things and avoid errors, however, as the above shows it is possible to use custom terms if so desired.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.4 ``ali-baba-edb-schema.plm`'

The EDB schema file defines the interface between KOJAK's internal logic-based representation (based on PowerLoom) and an external relational evidence database. Once the EDB schema file is in place the following operations are possible:

Query external database tables fully transparently via PowerLoom's retrieve command just like any other PowerLoom relation
Selectively import and assert information from the external database in KOJAK's internal KB
Compute link statistics directly on the database and use them to perform group extension via KOJAK's mutual information component

The EDB schema file is usually the most complex part of every configuration. Creating it will require intimate knowledge of the content as well as the schema of the external EDB. Fortunately, the schema mapping only needs to be done for relevant tables of the database but not necessarily all of them.

PowerLoom commands to define database objects as well as the necessary supporting concepts and relations are all defined in the RDBMS namespace. This namespace is inherited in the EDB module and therefore these commands usually don't have to be explicitly qualified. Below, however, we do always describe them with their namespace prefix to be clear where they are coming from.

8.3.4.1 Database Instances

8.3.4.2 Table Mappings

8.3.4.3 Materializing Type IDs

8.3.4.4 Defining Link Count Relations

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.4.1 Database Instances

In KOJAK an external database is represented via a database instance which is a logic instance of type RDBMS/SQL-Database. All physical connection information is associated with this database instance and can be changed to link the instance to a different physical database. In this sense a database instance is very similar to an ODBC or JDBC data source. Mapping definitions such as table projections or queries are always associated with the logical database instance only to hide the physical connection details. These details would be different, for example, between the C++ and Java versions of KOJAK even if the same external database is used.

Database instances can be defined via the RDBMS/defdb command. The first command in `ali-baba-edb-schema.plm' defines the edb database instance which is the only one used (so far) by KOJAK. We don't assign any connection information, since that will be done programatically by KOJAK using the information from the configuration file. Otherwise, it could be explicitly asserted in a KOJAK run script via KOJAK's db-assert-connection-info command.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.4.2 Table Mappings

The next section in the schema file defines a fairly extensive mapping between tables in the Ali Baba evidence database and corresponding PowerLoom relations. Mapping between a relational database table and a PowerLoom relation is fairly straightforward, since a database table is simply a set of tuples. Each such table could therefore be represented in PowerLoom with a corresponding relation that has as many arguments as the database table has columns. Usually, however, we will map a database table onto one or more PowerLoom relations that each correspond to a particular projection of the database table. The reasons for this are the following:

Only some columns of a database table might be needed in KOJAK; by restricting the mapping to the essential columns we avoid communication of unnecessary information which improves speed and scalability.
Relational database use a different modeling style where tables might often be very wide with many columns, while PowerLoom models are usually "skinny" where relations have rarely more than 3 or 4 arguments. Since columns are referred to explicitly in SQL queries, wide tables do not cause a problem there, however, using PowerLoom relations with many arguments is tedious, since all argument variables always need to be listed. Since a column in a database table often corresponds to a single "semantic aspect" of some object (usually represented by a key), the better mapping is to represent each such key/column combination as a separate PowerLoom relation.
PowerLoom uses predicate logic as the underlying representation mechanism which does not have a natural way of supporting NULL values. Rows containing a NULL value can currently not be returned by a PowerLoom query, since the NULL value is equivalent to an unbound variable (some of PowerLoom's partial inference mechanisms can be used to work around this problem, but those are beyond the scope of this document). Therefore, mapping a database table onto a PowerLoom relation where for some row one or more columns could be NULL would prevent one from being able to retrieve the non-NULL columns of such a row. By mapping the table onto multiple projections this problem can be avoided.

The main command to define the mapping between a database table and a PowerLoom relation is RDBMS/deftable. It is very similar to a standard defrelation command but takes two extra arguments: (1) a logical database instance on which this table resides (e.g., edb) and (2) the name of the database table to which this PowerLoom relation corresponds. The argument list then specifies which table columns map onto which relation arguments and what if any type coercions should be performed. For example, let us look at the following table definition from the Ali Baba schema file:

(DEFTABLE EDB-AliBabaName edb "AB_V41" (ID (NAME STRING)))

This defines a new binary PowerLoom relation EDB-AliBabaName (just as if we had used defrelation) and associates it with the table "AB_V41" on the database identified by edb. The table "AB_V41" is a wide table with 22 columns part of which are repeated below (the full table description is listed in the EDB schema file):

mysql> describe ab_v41; +--------------+--------------+------+-----+---------+ | Field | Type | Null | Key | Default | +--------------+--------------+------+-----+---------+ | ID | int(11) | | PRI | 0 | | ALIAS | varchar(200) | YES | | NULL | | NAME | varchar(200) | YES | | NULL | | POB | varchar(200) | YES | | NULL | | DOB | varchar(200) | YES | | NULL | ................................................. | ACTIVITY | varchar(200) | YES | | NULL | +--------------+--------------+------+-----+---------+ 22 rows in set (0.00 sec)

The purpose of EDB-AliBabaName is to map an entity ID onto its name string defined in this table. The specifics of this projection are defined by the argument list of EDB-AliBabaName. Each argument name has to correspond to the name of some table column (arguments can be specified in the standard question-mark syntax if needed, e.g., to reference them in axioms in the definition body). If the argument is specified without a type (as with the ID argument above), argument values brought in during a query will be coerced to standard logic objects. If the argument is followed by a type (as with the NAME argument above), column values will be coerced to this PowerLoom type. Logic objects have a lot of support machinery to handle name spaces, modules, indexing, equality, etc. which makes them somewhat more heavy weight. For cases where this machinery isn't required, it is better to coerce such columns into simpler types such as STRING or INTEGER, since they occupy significantly less storage.

Let us now show how all this works with an example. We start KOJAK in interactive mode again and define the database instance and table mapping described above. For simplicity, we add the connection information right into the database definition:

% run-kojak java -i -c config/ali-baba.dat Running Java version of KOJAK Group Finder... Initializing STELLA... Initializing PowerLoom... Initializing KOJAK... [2005-OCT-10 15:31:59.000 PL] Processing check-types agenda... +-------------------------------------------------------------+ | Welcome to the KOJAK Group Finder v2.2.0 | | Copyright (C) USC Information Sciences Institute, 1996-2005 | | This software comes with ABSOLUTELY NO WARRANTY and is | | licensed for NON-COMMERCIAL EVALUATION PURPOSES ONLY! | | All rights reserved. | +-------------------------------------------------------------+ |= (cc edb) |MDL|/PL-KERNEL-KB/PL-USER/LD-ONTOLOGY/EVIDENCE-DATABASE-SCHEMA |= (DEFDB edb :jdbc-connection-string "jdbc:mysql://blackcat.isi.edu:3306/ali_baba_v41?user=me&password=???" :SQL-database TRUE) |i|edb |= (DEFTABLE EDB-AliBabaName edb "AB_V41" (ID (NAME STRING))) |r|EDB-AliBabaName

Now we are ready to ask some queries. Note that at this point we have nothing loaded into the KOJAK knowledge base, but these queries succeed, since they look up results directly in the external database. We start by asking for the name of the person with ID 404. Since in PowerLoom "404" would be interpreted as an integer, we enclose it in vertical bars to tell the system that this is a logic constant whose name is "404":

|= (RETRIEVE ALL (EDB-AliBabaName |404| ?name)) [2005-OCT-10 15:36:03.000 SDBC] Connect via JDBC: jdbc:mysql://blackcat.isi.edu:3306/ali_baba_v41?user=me&password=??? There is 1 solution: #1: ?name="Suraqah"

The log message shows that the connection to the database wasn't attempted until the first query was asked. The PowerLoom query above was translated into the following SQL query and its results where then fed back to the PowerLoom inference engine:

select NAME from AB_V41 where ID='404';

What this translation exactly looks like depends on the particular binding pattern. For example, the following asks the "reverse" query from the name to the corresponding ID. Since names are not unique, we get multiple solutions here. Also note that because the ID column gets coerced into PowerLoom logic objects, they are printed with vertical bars around them to distinguish them from regular integers:

|= (RETRIEVE ALL (EDB-AliBabaName ?id "Suraqah")) There are 5 solutions: #1: ?id=|404| #2: ?id=|1616| #3: ?id=|1293| #4: ?id=|1865| #5: ?id=|4366|

Next, we leave both arguments unbound. We restrict the number of answers sought to 10, since otherwise this would bring in all 6000 or so rows of this table (in general, one should always be very careful when composing queries against a database, since result sets could be very large):

|= (RETRIEVE 10 (EDB-AliBabaName ?id ?name)) There are 10 solutions so far: #1: ?id=|402|, ?name="Abidin" #2: ?id=|403|, ?name="Sa'eed" #3: ?id=|404|, ?name="Suraqah" #4: ?id=|405|, ?name="Fikri" #5: ?id=|406|, ?name="Mu'izz" #6: ?id=|407|, ?name="Mulhim" #7: ?id=|408|, ?name="Arfan" #8: ?id=|409|, ?name="Usama" #9: ?id=|410|, ?name="Ali" #10: ?id=|411|, ?name="Mubarak"

We can also ask TRUE/FALSE questions:

|= (ASK (EDB-AliBabaName |402| "Abidin")) TRUE

To illustrate the NULL-value problem outlined above, let us define a new table relation that maps an entity ID on its name and alias columns at the same time:

|= (DEFTABLE EDB-AliBabaNameAlias edb "AB_V41" (ID (NAME STRING) (ALIAS STRING))) |r|EDB-AliBabaNameAlias

For entities that actually have both a name and an alias this does what is expected:

|= (RETRIEVE ALL (EDB-AliBabaNameAlias |2772| ?name ?alias)) There is 1 solution: #1: ?name="Gimmel Faruk", ?alias="Scar Face"

However, for entities that only have a name this now causes a problem, since the NULL value in the alias column prevents us from retrieving the row (?alias would remain unbound), so we never could retrieve the name of such an entity using EDB-AliBabaNameAlias:

|= (RETRIEVE (EDB-AliBabaNameAlias |404| ?name ?alias)) No solutions.

The next 16 deftable definitions in `kbs/ali-baba-edb-schema.plm' provide a fairly exhaustive mapping of the Ali Baba schema whose table structure follows the Year-2 EAGLE EDB schema. For the Ali Baba data only EDB-LinkOfType, EDB-Organization and EDB-AliBabaName are actually used to import data. Additionally, the various type ID tables are used to import type ID information. One important "wrinkle" of this mapping is that all type IDs (e.g., the first argument of EDB-EntityType) are explicitly coerced to integers to avoid conflicts with actual people entities whose ID might be the same.

Note that table relation such as EDB-AliBabaName can only be used to query a database. They can also be asserted via the PowerLoom assert command, however, such assertions do not update the database (future versions of PowerLoom might provide this feature). Also, database table queries do not materialize any assertions on the PowerLoom end (apart from caching). If that is desired PowerLoom's assert-from-query command can be used (more on that below).

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.4.3 Materializing Type IDs

What is described in this section is very specific to the "typeless" nature of the EAGLE EDB schema which is unlikely to occur often in real-life databases. The concepts and mechanisms described here might still be useful, but if you are not dealing with such a type of schema you could safely skip this section.

The Year-2 EAGLE EDB schema is a very flexible and extensible schema where a single entity table holds entities of many different types, and, similarly, a single link table holds entity-entity links of many different types. This allows the introduction of new entity or link types without having to change the database schema. Instead of having separate tables for each entity and link type, each entry in the EDB ENTITY and LINK tables is associated with a type ID. Type IDs are mapped onto descriptive type names via tables such as ENTITYTYPE and LINKTYPE. Many of these descriptive names originate in an EDB ontology developed by Cycorp, but the Ali Baba database also uses many new and undocumented types.

Because of this "typeless" nature of the EAGLE EDB schema, each query will usually have to specify entity and link type IDs to restrict the types of entities and links that should be considered. This can convolute query expressions significantly, since it usually involves an extra constraint for the type ID plus an extra join if one doesn't want to hardcode integer type IDs but actually wants to key in on their descriptive name. For example, the following query would retrieve all organizations in the entity table:

select e.ENTITYID from ENTITY e, ENTITYTYPE et where e.ENTITYTYPEID=et.ENTITYTYPEID and et.CYCCOLLECTION='Organization';

Similarly, using the PowerLoom mapping we would have to phrase this query as follows:

(RETRIEVE ALL (AND (EDB-EntityType ?typeid Organization) (EDB-EntityOfType ?id ?typeid)))

To avoid the extra join and database roundtrip as well as hard-coding integer type IDs in our queries, we define a set of PowerLoom functions that can map an entity type such as Organization onto its type ID by simply looking it up in a local table. This allows us to rephrase the query above like this:

(RETRIEVE ALL (EDB-EntityOfType ?id (EDB-EntityTypeID Organization)))

This looks up the type ID on the PowerLoom side and translates into the following much simpler SQL query:

select ENTITYID from ENTITY where ENTITYTYPEID=1071;

The type ID tables represented by the four PowerLoom functions EDB-EntityTypeID, EDB-LinkTypeID, EDB-EntityAttributeTypeID and EDB-LinkAttributeTypeID are populated using PowerLoom's assert-from-query command. For example:

(ASSERT-FROM-QUERY (RETRIEVE ALL (?pred ?id) (EDB-EntityType ?id ?pred)) :relation EDB-EntityTypeID)

assert-from-query takes a retrieve command as its argument and then creates an assertion for each solution (or set of variable bindings) the query generates. There are different ways for creating these assertions (see the documentation of assert-from-query for more detail). One way used here is to provide a :relation argument which will assert the given relation for each set of bindings retrieved by the query. For example, one of the 122 assertions created by the above is the following:

(ASSERT (EDB-EntityTypeID Organization 1071))

Note that in the query above we supply the set of query variables (similar to output columns in a select statement) to get the appropriate argument order when bindings are supplied to EDB-EntityTypeID.

assert-from-query is very useful to selectively materialize portions of the external DB to avoid having to query for the same information over and over again. This mechanism will be used again when we discuss `scripts/ali-baba-load-data.plm'.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.4.4 Defining Link Count Relations

Probably the most important but also challenging part of the Ali Baba EDB schema file is the section that defines the three link count relations EDB-phoneCallCount, EDB-communicationCount and EDB-affiliatedWithCount. These relations are used by KOJAK's mutual information (MI) component to perform group extension and compute connection strengths between individuals. The challenge here is not so much with the KOJAK or PowerLoom interface, but with the formulation of appropriate SQL queries which can become somewhat complex.

The MI component starts with a set of group seed entities and in its first step builds an extended graph around them. To do this it looks at each seed entity and finds all other entities connected to it by a link of one of the expansion link types listed in the configuration or directly supplied to the expand-groups command (which see). Additional to just looking for entities connected via such links, it also retrieves link counts, that is, how many links of a particular type lead to some new entity X. For example, it might retrieve that from seed entity 38 there are two communication links to entity 1155. These link counts are an important input to the MI computation.

The MI component uses GROUPS/linkCount as the central interface relation to access links and link counts from group seeds. This relation takes a set of seeds and an expansion link type as arguments and returns the retrieved entities and associated counts. For example, let us look at the following query:

|= (RETRIEVE ALL (GROUPS/linkCount communication (SETOF |342| |38|) ?p1 ?p2 ?c)) There are 9 solutions: #1: ?p1=|38|, ?p2=|2|, ?c=1 #2: ?p1=|38|, ?p2=|22|, ?c=1 #3: ?p1=|38|, ?p2=|264|, ?c=1 #4: ?p1=|38|, ?p2=|639|, ?c=1 #5: ?p1=|38|, ?p2=|1155|, ?c=2 #6: ?p1=|342|, ?p2=|2|, ?c=1 #7: ?p1=|342|, ?p2=|214|, ?c=1 #8: ?p1=|342|, ?p2=|247|, ?c=1 #9: ?p1=|342|, ?p2=|277|, ?c=1

In this query we supplied communication as the link type we are interested in and a set of two seed entities represented by the setof term. ?p1 will be bound to each seed member in the set and ?p2 to the various entities ?p1 is connected to via a communication link. The last argument reports how many such links are between the two entities.

For the Ali Baba dataset, the actual retrieval of these links and counts is done by the EDB-communicationCount relation which is defined in `kbs/ali-baba-edb-schema.plm'. For example, we could have used this relation directly to ask the query:

|= (RETRIEVE ALL (EDB-communicationCount (SETOF |342| |38|) ?p1 ?p2 ?c)) There are 9 solutions: #1: ?p1=|38|, ?p2=|2|, ?c=1 #2: ?p1=|38|, ?p2=|22|, ?c=1 #3: ?p1=|38|, ?p2=|264|, ?c=1 #4: ?p1=|38|, ?p2=|639|, ?c=1 #5: ?p1=|38|, ?p2=|1155|, ?c=2 #6: ?p1=|342|, ?p2=|2|, ?c=1 #7: ?p1=|342|, ?p2=|214|, ?c=1 #8: ?p1=|342|, ?p2=|247|, ?c=1 #9: ?p1=|342|, ?p2=|277|, ?c=1

GROUPS/linkCount and EDB-communicationCount are connected via the following rule which is defined in `scripts/ali-baba-load-data.plm' (since it is a data import rule but it could have been defined in the EDB schema file as well):

(ASSERT (=> (AND (BOUND-VARIABLES ?seeds) (= ?ltype communication) (EDB-communicationCount ?seeds ?party1 ?party2 ?count)) (GROUPS/linkCount ?ltype ?seeds ?party1 ?party2 ?count)))

So, whenever the MI components asks a GROUPS/linkCount query where ?ltype is bound to communication, PowerLoom simply backchains into EDB-communicationCount which then does all the work. The bound-variables clause is there for safety and makes sure that the query is never asked with ?seeds unbound (which could only happen if asked manually). The reason for this safeguard is that such a completely unconstrained query might bring in the whole database which could of course be very large and take forever.

The other two link types phoneCall and affiliatedWith are connected to corresponding EDB relations with similar rules. Now it should become clear why these relations defined in kbs/ali-baba-ontology.plm only serve as names but are never actually asserted or queried.

Having explained how GROUPS/linkCount is hooked up to EDB relations that can retrieve such links and associated counts, all that remains to be done is to show how an EDB relation such as EDB-communicationCount can be defined. Such link count relations are defined via parametric SQL queries specified via PowerLoom's RDBMS/defquery command (which see). For example, EDB-communicationCount is defined like this:

(DEFQUERY EDB-communicationCount (?seeds ?party1 ?party2 (?count INTEGER)) :query-pattern (RDBMS/SQL-QUERY edb "SELECT ' ', P1, P2, COUNT(P2) FROM...<complicated SQL query>..."))

The RDBMS/defquery command defines a PowerLoom relation with the given name and arguments and links it to an SQL query whose arity (number of output columns) needs to match the arity of the relation. SQL output columns will be assigned by position to relation arguments, therefore, variable names such as ?seeds do not have to match a column name as with the RDBMS/deftable command. Similar to RDBMS/deftable, relation arguments can be associated with a type (e.g., as done for ?count) to force type coercion.

The value of the :query-pattern argument has to be of the following form:

(RDBMS/SQL-QUERY <db-instance> "<sql query>")

The first element in this list is RDBMS/SQL-QUERY which is the name of a PowerLoom query specialist to handle arbitrary parametric SQL queries. RDBMS/defquery is very general and supports other queries such as joins or non-SQL queries as well, but for KOJAK SQL queries are all we need. <db-instance> needs to be a database instance which will usually be edb.

Writing Parametric SQL Queries

For the parametric SQL query there are two types of complexity that we need to handle: (1) how to pass in bound arguments that will result in appropriate constraints, and (2) writing an appropriate SQL query that can compute the desired link counts. Before we explain how count queries need to be constructed, let us explain the parametric query mechanism with a simple example. Below we define a relation myEntityOfType which has essentially the same functionality as EDB-EntityOfType which was defined via RDBMS/deftable earlier in the EDB schema file:

(DEFQUERY myEntityOfType (?id (?typeID INTEGER)) :query-pattern (RDBMS/SQL-QUERY edb "SELECT ENTITYID, ENTITYTYPEID FROM ENTITY WHERE ENTITYID='?id' AND ENTITYTYPEID IN '?typeID'"))

The SQL query has two output columns to match the relation variables ?id and ?typeID. The rest of the SQL query is fairly straightforward. What makes the query parametric is the mechanism by which relation variables are bound to column constraints. For example, the constraint

ENTITYID='?id'

would be translated into

ENTITYID='38'

if ?id is bound to 38. The constraint

ENTITYTYPEID IN '?typeID'

would be translated into

ENTITYTYPEID IN ('1074')

if ?typeID is bound to 1074. This allows us to answer a query such as this:

|= (ASK (myEntityOfType |38| 1074)) TRUE

If the relation variable in such a column constraint is unbound, the whole constraint is eliminated in the corresponding SQL translation (i.e., replaced by TRUE), which allows us to run queries like the following:

(RETRIEVE ALL (myEntityOfType |38| ?type)) There is 1 solution: #1: ?type=1074 |= (RETRIEVE 5 (myEntityOfType ?id 1071)) There are 5 solutions so far: #1: ?id=|1289| #2: ?id=|1290| #3: ?id=|1291| #4: ?id=|1292| #5: ?id=|1293|

Finally, binding a relation variable to a set of terms will translate into an appropriate IN constraint (even if the SQL query was specified with an = constraint for that column):

|= (RETRIEVE ALL (myEntityOfType (SETOF |38| |1296|) ?type)) #1: ?type=1074 #2: ?type=1071

This last PowerLoom query was translated into the following SQL query:

SELECT ENTITYID, ENTITYTYPEID FROM ENTITY WHERE ENTITYID IN ('38', '1296') AND TRUE

The parametric constraints shown above use a simplified, restricted syntax that has to follow one of the following patterns (<ws> means one or more whitespace characters):

<ws><column>=<?var><ws> <ws><column>='<?var>' <ws><column>="<?var>" <ws><column>=`<?var>` <ws><column> {in|IN} <?var><ws> <ws><column> {in|IN} '<?var>' <ws><column> {in|IN} "<?var>" <ws><column> {in|IN} `<?var>`

Note that the amount (or lack) of whitespace in each pattern is very specific and has to be strictly followed for the constraint to work. If quotes are used, the substituted value will be appropriately escaped for the chosen quote character. A more general parameter substitution syntax is also available to support the construction of more complex queries, but it is beyond the scope of this document.

Writing Link Count Queries

With this machinery in hand we can now go on to write link count queries for relations such as EDB-communicationCount. Let us again look at the overall structure of this relation:

(DEFQUERY EDB-communicationCount (?seeds ?party1 ?party2 (?count INTEGER)) :query-pattern (RDBMS/SQL-QUERY edb "SELECT ' ', P1, P2, COUNT(P2) FROM...<complicated SQL query>..."))

It needs to have four arguments as shown above (which could be given different names). The first is an "input" argument which will be bound to a set of seeds when the MI components asks the query, the remaining arguments are "output" arguments which will be generated by the embedded parametric SQL query.

The output columns of the SQL query should look as follows (again names could be chosen differently):

SELECT ' ', P1, P2, COUNT(P2) ...

The first value ' ' is only needed to match up with the ?seeds variable position. Since that variable will be bound at query time we don't have to generate any actual output here. Make sure you use ' ' and not ", since the latter is interpreted as a NULL value on some database systems (e.g., Oracle) which would cause the query to fail in PowerLoom.

The next output column called P1 above will be bound to each of the seed members supplied in ?seeds. The third output column P2 needs to generate all the entities (or nodes) that each seed member is connected to via the particular link type handled by this relation (communication in our example). Finally, the last column needs to generate the number of links of the particular type for each P1,P2 node pair.

The query should generate exactly one row for each node pair which should contain the total number of links. Multiple entries as in

...... ' ', 342, 214, 1 ...... ' ', 342, 214, 2 ......

will not be aggregated by KOJAK (the second entry will simply be ignored). Duplicate entries that simply permute the arguments should be avoided but will be ignored if they occur, for example:

...... ' ', 342, 214, 2 ...... ' ', 214, 342, 2 ......

Such duplicates can not always be avoided by the query, since there might be links between two seed members.

Let us emphasize again that KOJAK performs best with high frequency transaction-type links such as, for example, communication events (see 5.1.2 Link Data for more discussion on that). If such data is available, it is important to compute accurate link counts for best performance. That is, it is much better for KOJAK to know that there were 23 phone calls and 17 financial transactions between two entities as opposed to just knowing that there were one or more but without knowing how many. The following example shows how such link counts can be computed for 'Communications' links in the Ali Baba data. How this is done exactly will differ for each different evidence database schema and each link type. The parametric SQL query for EDB-communicationCount looks as follows:

SELECT ' ', P1, P2, COUNT(P2) FROM (SELECT ENTITYID_ARG1 P1, ENTITYID_ARG2 P2, l.LINKID FROM LINK l, LINKTYPE lt, ENTITY e, ENTITYTYPE et WHERE l.LINKTYPEID=lt.LINKTYPEID AND lt.CYCCOLLECTION='Communications' AND l.ENTITYID_ARG2=e.ENTITYID AND e.ENTITYTYPEID=et.ENTITYTYPEID AND et.CYCCOLLECTION='Person' AND l.ENTITYID_ARG1 IN '?seeds' AND l.ENTITYID_ARG1='?party1' -- optional AND l.ENTITYID_ARG2='?party2' -- optional UNION SELECT l.ENTITYID_ARG2 P1, l.ENTITYID_ARG1 P2, l.LINKID FROM LINK l, LINKTYPE lt, ENTITY e, ENTITYTYPE et WHERE l.LINKTYPEID=lt.LINKTYPEID AND lt.CYCCOLLECTION='Communications' AND l.ENTITYID_ARG2=e.ENTITYID AND e.ENTITYTYPEID=et.ENTITYTYPEID AND et.CYCCOLLECTION='Person' AND ENTITYID_ARG2 IN '?seeds' AND l.ENTITYID_ARG1='?party2' -- optional AND l.ENTITYID_ARG2='?party1') -- optional SUBQUERY GROUP BY P1, P2

The clauses marked as optional don't really need to be there, since the ?party1 and ?party2 variables will not be bound in queries generated by the MI component. They could be bound in manual queries, though, which is why they are added here. The query is structured as follows: the inner query named SUBQUERY retrieves all links in the LINK table whose link type is Communications and whose arguments are of type Person. Due to the fragmented nature of this schema this takes a somewhat complex join. The ?seeds argument constrains the arguments of each link. Since a ?seeds argument could be the first or second argument of such a link, we have to ask the query twice and union the results (note that an OR would not do the same thing here, since it would not guarantee to have only seed members for column P1). For each link we also output its LINKID to make sure that we get separate rows for each Communications link between two people. In the outer query we then group the result of the subquery first by seed members and then by second entity which allows us to compute accurate link counts for the final output column.

The other two link count relations EDB-phoneCallCount and EDB-affiliatedWithCount are defined similarly. EDB-phoneCallCount finds pairs of entities that are linked by having called the same phone number (which is more indirect than having called each other; however, direct phone call information doesn't seem to be available in this dataset). EDB-affiliatedWithCount uses the most complex SQL query, since it finds pairs of entities that are either directly affiliatedWith each other as well as those that are affiliated via an intermediary Event or Organization. Conceivably, these two different kinds of affiliations could be considered to be separate link types and handled by separate relations.

The way KOJAK computes and uses these link count statistics is one of the main reasons for its scalability. Instead of having to look at each individual link of a particular type between two entities, all of them can be abstracted into a single link count tuple. KOJAK also only computes these link counts in a very focused way starting from group seeds instead of having to do that for the whole database. Finally, by off-loading this data aggregation to the database, much less data has to be moved and the power of the relational database system can be exploited to perform these computations over large datasets. Nevertheless, depending on the nature and connectivity of the data, such queries can be time consuming and put significant stress on the database server.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.5 ``ali-baba-load-ontology.plm`'

This file loads the Ali Baba ontology, the Ali Baba/groups ontology mapping and constraint rules. The value of LoadOntologyScript in `config/ali-baba.dat' should point to this file. It will be loaded when the KOJAK commands initialize-kojak or load-kojak-ontology (which see) are called in the run script. load-kojak-ontology allows the explicit specification of a script which would override what is specified in LoadOntologyScript.

PowerLoom KB files are loaded using PowerLoom's load command. The script is loaded into the KOJAK module which is case-insensitive. This means that commands can be spelled in upper or lower case. If the load command were to be used in a case-sensitive module such as DATASET, it would have to be spelled in all upper-case letters.

load takes the name of the file to be loaded as its argument. In this load script all files are specified via logical pathnames which will ensure their proper translation regarding of which version of KOJAK you are using and which OS it is run on (for more information on logical pathname syntax see 6.1 Configuration File). You can also use physical pathnames appropriate for the OS KOJAK is run on. IMPORTANT: if you supply a physical Windows pathname you will need to double the \ character, since it is also the escape character for strings. For example:

(load "C:\\kojak\\my-db-ontology.plm")

The first file loaded is `kbs/generic-groups-ontology.plm' which defines KOJAK's generic groups ontology. This file always needs to be loaded as the first file in every configuration (future versions of KOJAK might do that automatically).

The other two files load (1) the Ali Baba ontology, and (2) the mappings between the Ali Baba ontology and KOJAK's generic groups ontology as well as a seed constraint rule. Note that the Ali Baba ontology file needs to be loaded first, since the objects defined in it are used in the seed constraints file.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.6 ``ali-baba-load-edb-schema.plm`'

This file loads the Ali Baba EDB schema and mapping. The value of LoadEDBSchemaScript in `config/ali-baba.dat' should point to this file. It will be loaded when the KOJAK commands initialize-kojak or load-EDB-schema (which see) are called in the run script. load-EDB-schema allows the explicit specification of a script which would override what is specified in LoadEDBSchemaScript.

Since only one file `kbs/ali-baba-edb-schema.plm' is loaded by this script, LoadEDBSchemaScript could be pointed to that file directly as opposed to loading it via this script. However, it is conceivable that one might want to spread an EDB schema definition over multiple files in which case the load script needs to be used.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.7 ``ali-baba-load-data.plm`'

This file loads any necessary data (e.g., seed information) and related information (e.g., names) into KOJAK. It also usually contains the data import rules that link up the GROUPS/linkCount relation used by KOJAK's MI component to individual parametric query relations defined in the EDB schema file. This file does not define any more mapping information but instead uses the various mappings defined in previous files.

The first command in `scripts/ali-baba-load-data.plm' retrieves a set of seed group entities based on the names given in the setof term. The surrounding assert-from-query command then creates a ThreatGroup assertion for each of these entities and also assigns a nameString for them (the latter is not actually used). For example, running this command interactively generates the following assertions (see the EDB schema file section for more information on assert-from-query):

|= (ASSERT-FROM-QUERY (RETRIEVE all (?pred ?o ?name) (AND (= ?pred |ThreatGroup|) (MEMBER-OF ?name (SETOF "Al Qaeda" "Needabaath" "Pavdayeen" "Ali Baba")) (BOUND-VARIABLES ?name) (EDB-Organization ?o ?name))) :pattern (KAPPA (?pred ?o ?name) (AND (HOLDS ?pred ?o) (nameString ?o ?name)))) (|P|(nameString 1290 "Al Qaeda") |P|(KnownGroup 1290) |P|(nameString 1294 "Ali Baba") |P|(KnownGroup 1294) |P|(nameString 1291 "Needabaath") |P|(KnownGroup 1291) |P|(nameString 1292 "Pavdayeen") |P|(KnownGroup 1292))

The EDB-Organization relation defined in the EDB schema file is used to import this information from the Ali Baba database. Note that due to the prior synonym assertion for ThreatGroup this actually results in GROUPS/KnownGroup assertions.

The next command asserts know members for the seed groups imported above. Note that links of type memberAgents in the Ali Baba LINK table actually use the reverse argument order of the memberAgents relation defined in the EAGLE ontology (which is version we are using). Therefore the group argument ?g and member argument ?m need to be flipped before we create the assertion. The EDB schema relation EDB-LinkOfType is used to access membership links directly on the Ali Baba database. This is what the command would produce when run interactively; again, due to the prior synonym assertions this actually generates GROUPS/groupMember links:

|= (ASSERT-FROM-QUERY (RETRIEVE ALL (?pred ?g ?m) (EXISTS ?id (AND (ThreatGroup ?g) (BOUND-VARIABLES ?g) (= ?pred memberAgents) (EDB-LinkOfType ?id ?m ?g (EDB-LinkTypeID ?pred))))) :relation HOLDS) (|P|(groupMember 1292 732) |P|(groupMember 1292 782) |P|(groupMember 1292 536) |P|(groupMember 1292 882) ......112 assertions suppressed...... |P|(groupMember 1290 1229) |P|(groupMember 1290 1230) |P|(groupMember 1290 1231) |P|(groupMember 1290 1232))

This command could have been slightly simplified as follows:

(ASSERT-FROM-QUERY (RETRIEVE ALL (?g ?m) (EXISTS ?id (AND (ThreatGroup ?g) (BOUND-VARIABLES ?g) (EDB-LinkOfType ?id ?m ?g (EDB-LinkTypeID memberAgents))))) :relation memberAgents)

Note the use of the bound-variables clause to make sure the group argument ?g is bound before EDB-LinkOfType is queried. PowerLoom's query optimizer might move clauses around which could result in the situation where EDB-LinkOfType is queried first which would logically be the same but potentially retrieve a lot of unnecessary intermediate information (all members of any groups not just the four threat groups we care about). Unfortunately, the current version of the query optimizer does not know about potential result set sizes on an external database which might result in incorrect clause order "optimizations".

Alternatively, group seed information could have been supplied in a CSV file in which case these import commands should be commented or deleted, since they would most probably conflict with that.

The next command imports links of type Terrorism and asserts them via the participatedInTerrorism relation defined in the ontology. Remember that `kbs/ali-baba-seed-constraints.plm' defines a rule that can exploit these links to infer additional group seed members.

The next section defines three data import rules that maps GROUPS/linkCount used by the MI component onto the appropriate EDB link count relation defined in the EDB schema file. The rules dispatch based on the link type argument which is one of the three expansion link type names phoneCall, communication and affiliatedWith (see the EDB schema section for more information on this).

The last rule does not have anything to do with loading data but simply supports more readable output generation. For lack of a better place it was added here. By default, entities such as groups and their members are reported via their IDs in KOJAK's report file. For the Ali Baba data this makes for somewhat unreadable output, since all IDs are simply numeric keys. To allow one to substitute names for IDs if desired, KOJAK uses the following mechanism: For each entity it is about to report, it tries to lookup a name via GROUPS/nameString. If no name could be found, it outputs the entity ID, otherwise, it outputs the name instead. Such names can be simply asserted or defined via a rule as done here. The rule first looks up the name of an organization or person with help of the EDB relations EDB-Organization and EDB-AliBabaName. If a name is found it concatenates the entity ID to the end to ensure it is unique. Concatenation is done via PowerLoom's string-concatenate function.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3.8 ``ali-baba-run-kojak.plm`'

Finally, the run file determines the sequence of KOJAK commands executed to analyze a dataset such as Ali Baba. This file primarily controls what command parameters are taken from the configuration, whether (intermediary) results are deposited in the hypothesis tables of the KOJAK database, and where and how reports should be generated.

The Ali Baba run file is highly generic so we do not explain it in detail here. Refer to 7. Running the KOJAK Group Finder for more information on how to run KOJAK and how the run file controls this process.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Hans Chalupsky on October, 30 2007 using texi2html

8.3.1 ``ali-baba.dat`'
8.3.2 ``ali-baba-ontology.plm`'
8.3.3 ``ali-baba-seed-constraints.plm`'
8.3.4 ``ali-baba-edb-schema.plm`'
8.3.5 ``ali-baba-load-ontology.plm`'
8.3.6 ``ali-baba-load-edb-schema.plm`'
8.3.7 ``ali-baba-load-data.plm`'
8.3.8 ``ali-baba-run-kojak.plm`'

8. Advanced Configuration

8.1 PowerLoom

8.2 File and Module Structure

8.2.1 Module Structure

8.3 The Ali Baba Configuration

8.3.1 `ali-baba.dat'

8.3.2 `ali-baba-ontology.plm'

8.3.3 `ali-baba-seed-constraints.plm'

8.3.4 `ali-baba-edb-schema.plm'

8.3.4.1 Database Instances

8.3.4.2 Table Mappings

8.3.4.3 Materializing Type IDs

8.3.4.4 Defining Link Count Relations

8.3.5 `ali-baba-load-ontology.plm'

8.3.6 `ali-baba-load-edb-schema.plm'

8.3.7 `ali-baba-load-data.plm'

8.3.8 `ali-baba-run-kojak.plm'

8.3.1 ``ali-baba.dat`'

8.3.2 ``ali-baba-ontology.plm`'

8.3.3 ``ali-baba-seed-constraints.plm`'

8.3.4 ``ali-baba-edb-schema.plm`'

8.3.5 ``ali-baba-load-ontology.plm`'

8.3.6 ``ali-baba-load-edb-schema.plm`'

8.3.7 ``ali-baba-load-data.plm`'

8.3.8 ``ali-baba-run-kojak.plm`'