[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Data describes or represents some aspects of the world. For each such aspect there are infinitely many ways in which it could be represented or conceptualized as well as how such a representation might be physically realized on a storage medium such as a computer file or a database. This openness presents a major challenge for applying and deploying a generic link discovery system such as the KOJAK Group Finder.
To do its work KOJAK conceptualizes the world as a graph where nodes represent entities such as groups (e.g., human organizations) and their members (e.g., people), and where links represent different kinds of relationships between them, e.g., that an entity is member of a certain group or that entity A paid money to entity B, etc. When KOJAK is applied to a new type of dataset, the data needs to be appropriately mapped onto this internal conceptualization to generate meaningful results.
KOJAK uses the classes (or concepts) and relations defined in its
generic groups ontology (see file `kbs/generic-groups-ontology.plm'
and also 10. Group Finder Ontology) as its internal standard
vocabulary to represent different kinds of groups, group members,
membership relations, events, etc. This ontology also defines a set of
abstract interface relations such as, for example,
GROUPS/linkCount
that allow the mutual information component to
access link statistics for different types of links. Another example is
the GROUPS/nameString
relation which is used by the report
generator to substitute name strings for entity IDs if such names are
available.
In the simplest KOJAK configuration where data comes from CSV files seed
groups and members are specified using this standard vocabulary, and
links such as phoneCall
or telecon
use a standard
representation format and are defined and mapped automatically by the
KOJAK CSV file importer so that link statistics can be accessed via
KOJAK's GROUPS/linkCount
interface relation. In this case no
special mapping is required, since the mapping is done in effect by the
person who maps their data onto KOJAK's CSV input syntax.
There are basically three cases where this simple, automatic mapping is not sufficient:
ThreatGroup
instead of the canonical Group
or
KnownGroup
. This very simple case is illustrated by the Example
2 configuration and run script (see section 7.2 Other Example Runs).
robbedBankTogether
,
then they must be in the same group. Therefore, if group membership of
one person is known, group membership of the other person could be
inferred from it. Rules of this type are used in the mapping for the
IET synthetic datasets, for example, see the file
`kbs/iet-y3-seed-constraints.plm' and the rules inferring threat
groups and membership by participation in certain types of events.
In practice, any combination of the above cases might arise. In the following we describe a set of mechanisms that can be used to configure KOJAK for such situations and to appropriately map data onto the internal representation that KOJAK needs. This will be done more or less in a tutorial style that uses the mapping for the synthetic Ali Baba dataset as an example.
8.1 PowerLoom 8.2 File and Module Structure 8.3 The Ali Baba Configuration
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The core mechanisms used to define mappings such as the one for the Ali Baba dataset are based on the PowerLoom knowledge representation and reasoning system. KOJAK is built on top of PowerLoom, so all of PowerLoom's functionality is available at any time. We will try to keep the description below self-contained, but a basic knowledge of PowerLoom and its capabilities will greatly facilitate understanding (see http://www.isi.edu/isd/LOOM/PowerLoom/ for more information about PowerLoom). For quick reference, an important subset of PowerLoom commands is documented in 9.1 Important PowerLoom Commands.
PowerLoom is a logic-based knowledge representation and reasoning (KR&R) system that provides a language to define classes (called concepts), relations, instances, logic rules and mappings between PowerLoom relations and external database tables. PowerLoom also has a logical inference engine and query language that allows a user (or KOJAK) to query relationships that are not explicitly represented but logically follow based on some of the known rules.
PowerLoom uses the Knowledge Interchange Format (or KIF) as its basic representation language. KIF is a version of predicate logic that uses a Lisp-based uniform expression syntax. For example, the rule that all men are mortal could be represented in KIF like this:
(forall (?x) (=> (man ?x) (mortal ?x))) |
KIF variables are symbols that start with a question mark. Note the Lisp-style prefix syntax where each expression is enclosed in parentheses and starts with the operator followed by a list of arguments. Besides the parentheses, no punctuation is required. For example, compare the following two expressions:
man(?x) => mortal(?x) ;; infix notation (=> (man ?x) (mortal ?x)) ;; KIF |
Similar to the representation language, all PowerLoom commands need to
be written in a Lisp-ish input syntax as well. Commands take zero or
more required arguments and sometimes a set of options specified as
keyword/value pairs. For example, we could use PowerLoom's
assert
command to tell it about the rule shown above:
(assert (forall (?x) (=> (man ?x) (mortal ?x)))) |
If we then tell it that Socrates is a man via
(assert (man (socrates))) |
we can then ask whether Socrates is mortal like this
(ask (mortal (socrates))) |
to which PowerLoom would return TRUE
as the result.
Most of KOJAK's configuration files are PowerLoom files which contain a set of PowerLoom definitions and related commands. Such files can be loaded and interpreted at runtime to dynamically change the internal knowledge base and KOJAK's behavior. If KOJAK is run in interactive mode, such commands can also be typed in interactively - more examples on that are shown in some of the configuration sections below.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
KOJAK divides its advanced configuration information into a set of PowerLoom knowledge base (KB) files and load scripts. The reasons for this separation are (1) to separate functionally different aspects of the configuration into separate files, and (2) that this information is organized into separate PowerLoom modules and currently each KB file can only be loaded into one single module. Each configuration file (e.g., `config/ali-baba.dat') points at its own set of KB files which are loaded during startup and initialization of KOJAK. When they are loaded the necessary definitions and mappings are established which can be completely different for two different configurations. There are the following types of KB files:
Ontology files define the vocabulary of classes and relations used
by the Group Finder. Ontology files are somewhat analogous to a
database schema definition. For example, `kbs/generic-groups-ontology.plm'
is the central ontology file used to define the groups vocabulary of
KOJAK. This file always needs to be loaded for proper operation. The
file `kbs/ali-baba-ontology.plm' defines a set of additional
classes and relations specific to the Ali Baba dataset. For example, it
defines the relation participatedInTerrorism
to represent and
reason with a particular kind of information available in this dataset.
Seed constraint files define mappings between a dataset ontology and the generic groups ontology as well as rules that allow KOJAK to infer additional threat groups or seed members. For example, `kbs/ali-baba-seed-constraints.plm' defines the mappings between the Ali Baba ontology and the generic groups ontology as well as a rule that infers additional seed members based on joint participation in a terrorism event.
EDB schema files are a special kind of ontology file specifically dealing with the mapping of a relational evidence database (EDB) to the PowerLoom relations used by KOJAK. For example, `kbs/kojak-edb-schema.plm' is the file defining the mapping for KOJAK's own internal evidence database (the one into which data from CSV files gets imported) and `kbs/ali-baba-edb-schema.plm' defines the mapping for the Ali Baba EDB.
Data files define or load actual data instances such as certain
groups, members, etc. We also put the actual mapping rules between
KOJAK's abstract interface relations such as GROUPS/linkCount
and
particular EDB relations such as, for example, EDB-link_count
,
since they are in effect data import rules (conceivably, these could
also go into the associated EDB schema file). For example,
`scripts/load-data.plm' is the generic data file to load data from
the internal KOJAK EDB, `scripts/ali-baba-load-data.plm' is the
data file for the Ali Baba EDB.
Load files provide an extra level of indirection, e.g., to load
multiple ontology files when the Group Finder ontology is loaded. For
example, `scripts/ali-baba-load-ontology.plm' loads three different
files. A load file could also contain actual definitions instead of just
loading some other file as is the case for `scripts/ali-baba-load-data.plm'.
The load file mechanism is not absolutely necessary and might be
replaced in the future by simply specifying multiple EDB or ontology
files instead of specifying single load scripts as currently done via
configuration parameters such as LoadOntologyScript
, etc. Load
files can use physical pathnames to load other files, but all files
shipping with KOJAK use logical pathname syntax to load other files to
make them platform and programming language independent. For more
information on logical pathname syntax see 6.1 Configuration File.
These various files need to be loaded in the correct order. For
example, a relation such as participatedInTerrorism
needs to be
defined before it can be used in an assertion, rule or query. For this
reason KOJAK first loads ontology files via the value of
LoadOntologyScript
, then EDB schema files via the value of
LoadEDBSchemaScript
and, finally, data via the value of
LoadDataScript
.
8.2.1 Module Structure
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
As mentioned above, PowerLoom KB files are loaded into modules. Modules
are separate name and assertion spaces that can inherit from each other.
Each PowerLoom KB file starts with an IN-MODULE
declaration which
specifies into which module the information should be loaded. The
PowerLoom module system is very flexible and powerful but can also be
confusing. When generating a configuration for a new dataset it
is therefore best to use an existing configuration and simply mirror its
IN-MODULE
declarations.
Using appropriate module structure allows us to avoid name clashes (e.g., between user terms and the built-in groups ontology) or to separate non-volatile information such as schema definitions from more volatile information such as data assertions. For example, we could rerun certain commands on a different data set by clearing out the data module but leaving all ontology modules intact. The main modules used by KOJAK are described below:
KOJAK: This is a namespace that contains all KOJAK commands
such as, for example, extend-groups
. Run scripts such as
`scripts/ali-baba-run-kojak.plm' use this as their module so
commands can be written without the KOJAK/
module prefix. Also,
if KOJAK is started up in interactive mode it will initially be in the
KOJAK
module.
LD-ONTOLOGY: This is the top-level ontology module of KOJAK.
All ontology files should be loaded into this module. It inherits the
PowerLoom kernel module to make all built-in PowerLoom definitions and
commands available as well as the KOJAK
module which contains all
KOJAK commands.
LD: This is the module used to contain mappings and seed
constraints. It inherits LD-ONTOLOGY
. Seed constraint files
such as `kbs/ali-baba-seed-constraints.plm' should use this module.
GROUPS: This is a name space for the objects defined in KOJAK's
generic groups ontology. The objects themselves are defined in
LD-ONTOLOGY
but their names come from the GROUPS
name
space to prevent clashes with user-defined concepts and relations. This
is the reason why these terms always need to be prefixed as with, for
example, GROUPS/linkCount
.
EDB: This module is used to define database objects and schema
mappings. It inherits LD-ONTOLOGY
as well as commands from the
PowerLoom RDBMS module. EDB schema files such as
`kbs/ali-baba-edb-schema.plm' should use this module.
DATASET: This is the lowest level module intended to host data objects and assertions. It inherits all of the above. Data files such as `scripts/ali-baba-load-data.plm' should go into this module.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In the following we describe each and every aspect of the configuration for the synthetic Ali Baba dataset developed by SAIC. The Ali Baba configuration ships with this KOJAK release and consists of the following files:
config/ali-baba.dat ...top-level config file kbs/ali-baba-ontology.plm ...ontology file kbs/ali-baba-seed-constraints.plm ...seed constraint file kbs/ali-baba-edb-schema.plm ...EDB schema file scripts/ali-baba-load-ontology.plm ...ontology load script scripts/ali-baba-load-edb-schema.plm ...EDB schema load script scripts/ali-baba-load-data.plm ...data file scripts/ali-baba-run-kojak.plm ...run script |
To create a new configuration it is best to first create appropriately
named copies for each of these eight files (e.g., use a different prefix
such as my-edb-
instead of `ali-baba-') and then modify each
file to fit the actual database and dataset that you are trying to map to.
By starting with copies of an existing configuration each KB file will
already have an appropriate IN-MODULE
declaration.
Below we describe each and every file in this configuration in detail. It will be helpful to have printouts or electronic copies of these files handy when working through this section.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This is the top-level configuration file to run KOJAK on the Ali Baba
dataset. This file is what you will pass to KOJAK via the -c
command line argument to configure it for this dataset, for example:
run-kojak -c config/ali-baba.dat |
The set of available KOJAK configuration parameters is described in 6. Configuration. The set of parameters that most definitely will need to be adapted for your dataset are the following:
Data +=jdbc:mysql://blackcat:3306/ali_baba_v41?user=me DatasetName =AliBaba LoadOntologyScript =KJ:scripts;ali-baba-load-ontology.plm LoadEDBSchemaScript =KJ:scripts;ali-baba-load-edb-schema.plm LoadDataScript =KJ:scripts;ali-baba-load-data.plm RunKojakScript =KJ:scripts;ali-baba-run-kojak.plm ExpansionLinkTypes +=phoneCall ExpansionLinkTypes +=affiliatedWith ExpansionLinkTypes +=communication ReportFile =./ali-baba-report.csv |
You might want to add an additional Data
entry to load seed
information for your dataset, for example, from a CSV file. In the Ali
Baba configuration seed information is loaded directly from the database
in the `scripts/ali-baba-load-data.plm' script.
The various script variables need to point to the load and run scripts
described below. ExpansionLinkTypes
need to correspond to the
link types defined and mapped in the ontology and EDB schema files.
Alternatively, you can list those directly as an argument to
expand-groups
in the KOJAK run script.
You might also want to adapt various analysis parameters for your dataset, e.g., to control boosting and/or thresholding. Usually that doesn't need to be done right away but will be done over the course of multiple analysis runs.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This file defines the Ali Baba ontology. An ontology defines a set of classes or types (in PowerLoom called "concepts") and a set of functions and relations. Under this view an ontology is roughly analogous to a database schema (however, ontologies usually also define hierarchical relationships, logic rules and constraints that further define or constrain the meaning of the defined terms).
The ontology file should define all terminology needed in addition to what is already available in the generic groups ontology. Additionally, it might define name variants or synonyms to some of the generic terms (which is what is done here).
The first part of the file defines a small person and group hierarchy
that mostly mirrors the structure of `kbs/generic-groups-ontology.plm'
but uses the terms ThreatGroup
and memberAgents
, since
those are the names used in the Ali Baba EDB. Using these definitions
is not strictly necessary, since we could assert imported data directly
in terms of the generic groups ontology. Classes of entities such as
ThreatGroup
can be defined with the PowerLoom DEFCONCEPT
command (note that since the LD-ONTOLOGY
module is case
sensitive, PowerLoom commands have to be written in all upper-case):
(DEFCONCEPT Person) (DEFCONCEPT Group) (DEFCONCEPT ThreatGroup (Group)) (DEFCONCEPT NonThreatGroup (Group)) (DEFCONCEPT Event) (DEFRELATION memberAgents (?group (?agent Person)) :documentation "Asserts that ?agent is a `Person' and a member of ?group. Note that the argument order of `memberAgents' links in the Ali Baba EDB is actually reversed from the one used in the EAGLE Y2 and Y3 EDB schema!") |
The classes ThreatGroup
and NonThreatGroup
are defined as
subclasses of Group
which is supplied as the superconcept in the
DEFCONCEPT
command.
memberAgents
is a binary relation between a group and a
Person
which mirrors KOJAK's GROUPS/groupMember
relation.
The second argument of the DEFRELATION
command is a list of
variables defining the arguments and arity of the relation. If a
variable is grouped with a concept name as done for ?agent
the
concept is interpreted to be the argument type for that position
(somewhat similar to a column type in a database table).
Once these classes (or concepts) and relations have been defined, they
can be used to introduce instances and assert relationships between
them. Let us play with these definitions a little bit to get a better
idea how they are used by KOJAK. To do so we'll start KOJAK in
interactive mode via the -i
command-line argument which instead
of running an analysis end-to-end will bring up a command loop where we
can execute KOJAK and PowerLoom commands:
% run-kojak -i -c config/ali-baba.dat Running C++ version of KOJAK Group Finder... Initializing STELLA... Initializing PowerLoom... Initializing KOJAK... [2005-OCT-06 17:53:19.000 PL] Processing check-types agenda... +-------------------------------------------------------------+ | Welcome to the KOJAK Group Finder v2.2.0 | | Copyright (C) USC Information Sciences Institute, 1996-2005 | | This software comes with ABSOLUTELY NO WARRANTY and is | | licensed for NON-COMMERCIAL EVALUATION PURPOSES ONLY! | | All rights reserved. | +-------------------------------------------------------------+ |= |
Once we see the |=
command prompt we can type commands and see
their results. Let us start by examining and changing the current
module with PowerLoom's cc
command. Without an argument, the
command will simply return the current module we are in which initially
is the KOJAK
module. Given a module name as an argument, it will
switch the current module to that module:
|= (cc) |MDL|/KOJAK |= (cc LD-ONTOLOGY) |MDL|/PL-KERNEL-KB/PL-USER/LD-ONTOLOGY |
Now we are in the LD-ONTOLOGY
module and ready to experiment with
some ontology definitions. Note that this module is case sensitive (as
opposed to the case insensitive KOJAK
module) which means all
KOJAK and PowerLoom commands have to be spelled in all upper-case.
Let us start by running a PowerLoom retrieve
query to see whether
KOJAK knows about any ThreatGroup
's at this point.
retrieve
takes a logic sentence as an argument where at least one
argument is a variable (a question mark symbol) and tries to find
bindings for the variable(s) based on what is currently asserted in the
knowledge base. The optional second argument specifies how many answers
we are looking for. By default only one answer is retrieved, the
ALL
argument indicates that we are looking for all answers:
|= (RETRIEVE ALL (ThreatGroup ?g)) ERROR: Undeclared predicate or function reference: `ThreatGroup'. Error occurred while parsing the proposition: (KAPPA (?g) (ThreatGroup ?g)) |
We get an error message, since at this point the ThreatGroup
concept isn't yet defined. To define it we manually load the ontology
file using the load
command (in normal operation that's done
automatically when initialize-kojak
is called) and then run the
query again.
|= (LOAD "KJ:kbs;ali-baba-ontology.plm") |= (RETRIEVE ALL (ThreatGroup ?g)) No solutions. |
This time we didn't get an error but also not any solutions, since the
KB is still empty. Let's add some content with PowerLoom's
assert
command. We first switch to the DATASET
module,
since that is where we usually want to store any data:
|= (CC DATASET) |MDL|/PL-KERNEL-KB/PL-USER/LD-ONTOLOGY/EVIDENCE-DATABASE-SCHEMA/EDB- PRIMARY-DATA-PL/EDB-PRIMARY-DATA/DATASET |= (ASSERT (ThreatGroup RAF)) |P|(ThreatGroup RAF) |= (ASSERT (memberAgents RAF Ulrike_Meinhof)) |P|(memberAgents RAF Ulrike_Meinhof) |= (ASSERT (memberAgents RAF Andreas_Baader)) |P|(memberAgents RAF Andreas_Baader) |
Now we can ask some questions again:
|= (RETRIEVE ALL (ThreatGroup ?g)) [2005-OCT-06 18:00:34.000 PL] Processing check-types agenda... There is 1 solution: #1: ?g=RAF |= (RETRIEVE ALL (memberAgents RAF ?m)) There are 2 solutions: #1: ?m=Ulrike_Meinhof #2: ?m=Andreas_Baader |= (RETRIEVE ALL (memberAgents ?g ?m)) There are 2 solutions: #1: ?g=RAF, ?m=Andreas_Baader #2: ?g=RAF, ?m=Ulrike_Meinhof |
We can also illustrate some very simple logical inference based on these
definitions. For example, we can retrieve all Group
's, since
ThreatGroup
is a subconcept of Group
:
|= (RETRIEVE ALL (Group ?g)) There is 1 solution: #1: ?g=RAF |
The following query retrieves all people. Note that we never explicitly
asserted anybody to be a Person
, but the system inferred
Person
-hood for the second argument of memberAgents
assertions due to the type constraint given in the definition of the
relation. This is different from the usual programming language
semantics where a type constraint has to be satisfied before - say a
function - can be applied to a particular argument. In PowerLoom we
infer that the argument must be of that type if we are given the
assertion of the relation. This is similar to, say, somebody tells you
that X is Y's brother and you infer (given X is a person) that Y must be
a person even if nobody has ever told you that explicitly.
|= (RETRIEVE ALL (Person ?x)) There are 2 solutions: #1: ?x=Andreas_Baader #2: ?x=Ulrike_Meinhof |
This assertion of groups and their members is the basic mechanism via which KOJAK represents seed information. We haven't established the linkage yet between the terminology we used here and the generic groups ontology used by KOJAK, but once that is done (see next section) it can use queries just as the ones we used above to access this information.
Following these concept definitions is the definition of a
nameString
relation which can be used to assert a name with a
particular entity. We do use this relation in
`scripts/ali-baba-load-data.plm' to assert names for threat group
entities, however, these name string are not yet used for anything.
Conceivably, we could change the definition of the
GROUPS/nameString
rule at the end of the file to use those name
strings. Just for illustration, here is how we can assert and retrieve
such name strings:
|= (ASSERT (nameString RAF "Red Army Fraction")) |P|(nameString RAF "Red Army Fraction") |= (ASSERT (nameString RAF "Rote Armee Fraktion")) |P|(nameString RAF "Rote Armee Fraktion") |= (RETRIEVE ALL (nameString RAF ?name)) There are 2 solutions: #1: ?name="Red Army Fraction" #2: ?name="Rote Armee Fraktion" |= QUIT Really exit? (yes or no) yes |
Next, the file defines participatedInTerrorism
which we will use
to materialize terrorism
links from the Ali Baba EDB. Our
interpretation of such links is that they encode that two people
committed an act of terrorism together (this interpretation might be
wrong - unfortunately, the documentation of the Ali Baba dataset is
somewhat lacking so we don't know for sure). Assuming our
interpretation is correct, we will use such links to infer additional
seed members for groups which will be illustrated in the next section.
The next three relations affiliatedWith
, communication
and
phoneCall
correspond to the three expansion link types that will
be used by the mutual information component to extend groups. These
relations will actually never be materialized but simply serve as names
to link up the mutual information component's GROUPS/linkCount
function to three different EDB relations. We could have done without
actually defining these relations, but they allow us to document here
what types of links we are using for group extension. The use and
semantics of these relations will be described in more detail later.
Finally, what kind of information is exploitable to infer additional seed members or what link types are usable for group expansion via the mutual information component will of course be different for every dataset. What we used here only makes sense for the Ali Baba data and can only serve as an illustration. For each new dataset some manual analysis has to be done to determine what information is available and what portion of it might be usable by the KOJAK Group Finder (also see 5.1.2 Link Data for some more information on what types of data the Group Finder can exploit).
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The seed constraints file defines two things: (1) the mapping between terms in the Ali Baba group ontology and the generic group ontology used by KOJAK, and (2) constraint rules (really just one in this case) for inferring additional seed members. There is no specific requirement that the term mapping be defined in this file - we could have just as well added it to the ontology file. However, the mapping might be more complex and involve rules which would interact with the seed constraint rules. For this reason, we usually put these two pieces together in this file so it is easier to see their connection and interaction. For example, look at `kbs/iet-y3-seed-constraints.plm' to a see such a more complex mapping.
Let us experiment again with KOJAK in interactive mode to see how the seed constraints file is used. We start by loading the two ontology files but not yet the constraints file:
% run-kojak -i -c config/ali-baba.dat Running C++ version of KOJAK Group Finder... Initializing STELLA... Initializing PowerLoom... Initializing KOJAK... [2005-OCT-07 13:58:47.000 PL] Processing check-types agenda... +-------------------------------------------------------------+ | Welcome to the KOJAK Group Finder v2.2.0 | | Copyright (C) USC Information Sciences Institute, 1996-2005 | | This software comes with ABSOLUTELY NO WARRANTY and is | | licensed for NON-COMMERCIAL EVALUATION PURPOSES ONLY! | | All rights reserved. | +-------------------------------------------------------------+ |= (LOAD "KJ:kbs;generic-groups-ontology.plm") |= (LOAD "KJ:kbs;ali-baba-ontology.plm") |
Next, we assert a ThreatGroup
instance again and see whether we
can retrieve the instance using KOJAK's generic group syntax:
|= (CC DATASET) |MDL|/PL-KERNEL-KB/PL-USER/LD-ONTOLOGY/EVIDENCE-DATABASE-SCHEMA/EDB- PRIMARY-DATA-PL/EDB-PRIMARY-DATA/DATASET |= (ASSERT (ThreatGroup RAF)) |P|(ThreatGroup RAF) |= (RETRIEVE ALL (GROUPS/KnownGroup ?g)) No solutions. |
There were no solutions. The reason is that the linkage between the Ali
Baba term ThreatGroup
and KOJAK's GROUPS/KnownGroup
hasn't
been established yet. The first section of
`kbs/ali-baba-seed-constraints.plm' establishes this connection by
declaring synonym relationships between KOJAK terms and Ali Baba terms.
This is the simplest form of mapping where we simply declare equivalence
between terms. More complicated mappings can be established by using
rules. For example, suppose we have a memberAgents
relation that
uses arguments in the reverse order of GROUPS/groupMember
. In
that case we could use the following rule to map between them:
(ASSERT (<=> (memberAgents ?a ?g) (GROUPS/groupMember ?g ?a))) |
We could have used rules like that instead of all the synonym mappings, however, the synonym mechanism is more efficient, since it doesn't require an extra inference step to carry out the mapping.
Now let us load the mapping and see how it changes things. We have to clear the data module first and reassert the information, since the synonym relation will not affect assertions that have already been made. After that the query that failed above does return the result we want:
|= (LOAD "KJ:kbs;ali-baba-seed-constraints.plm") |= (CLEAR-MODULE DATASET) |= (ASSERT (ThreatGroup RAF)) |P|(ThreatGroup RAF) |= (RETRIEVE ALL (GROUPS/KnownGroup ?g)) There is 1 solution: #1: ?g=RAF |
The second part of the constraints file contains the following rule:
(ASSERT (=> (AND (ThreatGroup ?g) (memberAgents ?g ?p1) (EXISTS (?e) (participatedInTerrorism ?e ?p1 ?p2))) (memberAgents ?g ?p2))) |
What it models is the following: If there is a threat group ?g with a member ?p1 and there is some terrorism event ?e where some other person ?p2 participated with ?p1, then that other person must also be a member of the threat group ?g. Let us see this rule in action. We start by asserting some initial seed members again and see that we can only retrieve what we asserted:
|= (ASSERT (AND (memberAgents RAF Andreas_Baader) (memberAgents RAF Ulrike_Meinhof))) (|P|(groupMember RAF Andreas_Baader) |P|(groupMember RAF Ulrike_Meinhof)) |= (RETRIEVE ALL (GROUPS/groupMember ?g ?m)) [2005-OCT-07 14:01:27.000 PL] Processing check-types agenda... There are 2 solutions: #1: ?g=RAF, ?m=Ulrike_Meinhof #2: ?g=RAF, ?m=Andreas_Baader |
Now we assert that some new person performed a terrorist act together with a known member of the RAF. When we ask for membership again KOJAK now finds the new member through logical inference, even though membership was not explicitly asserted:
|= (ASSERT (participatedInTerrorism US_Army_Corp_Bombing Andreas_Baader Gudrun_Ensslin)) |P|(participatedInTerrorism US_Army_Corp_Bombing Andreas_Baader Gudrun_Ensslin) |= (RETRIEVE ALL (GROUPS/groupMember ?g ?m)) [2005-OCT-07 14:02:17.000 PL] Processing check-types agenda... There are 3 solutions: #1: ?g=RAF, ?m=Ulrike_Meinhof #2: ?g=RAF, ?m=Andreas_Baader #3: ?g=RAF, ?m=Gudrun_Ensslin |
We can use PowerLoom's explanation mechanism to illustrate how the rule was actually used here:
|= (SET-FEATURE JUSTIFICATIONS) |l|(:JUSTIFICATIONS :EMIT-THINKING-DOTS :JUST-IN-TIME-INFERENCE) |= (ASK (GROUPS/groupMember RAF Gudrun_Ensslin)) TRUE |= (WHY) 1 (groupMember RAF Gudrun_Ensslin) follows by Modus Ponens with substitution {?g/RAF, ?p2/Gudrun_Ensslin, ?e/US_Army_Corp_Bombing, ?p1/Andreas_Baader} since 1.1 ! (FORALL (?g ?p2) (<= (groupMember ?g ?p2) (EXISTS (?e ?p1) (AND (ThreatGroup ?g) (groupMember ?g ?p1) (participatedInTerrorism ?e ?p1 ?p2))))) and 1.2 ! (ThreatGroup RAF) and 1.3 ! (participatedInTerrorism US_Army_Corp_Bombing Andreas_Baader Gudrun_Ensslin) and 1.4 ! (groupMember RAF Andreas_Baader) |
Here we only have one constraint rule but there is no limit to the complexity of such a model. Any of the generic group terms can have rules associated with it to allow them to be inferred from other available information. For example, we might have a rule that states that an organization that received financial support from some known threat group is itself a threat group. How complex this needs to be will again strongly depend on the particular dataset and the available information. One important caveat is that such rules should be fairly reliable (but not necessarily 100%), since KOJAK currently treats such logically inferred information just as if it had been asserted explicitly (except that inferred seed members are annotated as such in the hypothesis tables of the KOJAK database).
KOJAK uses queries like the ones shown above to retrieve seed groups and
their members. In fact, that's exactly what KOJAK's retrieve-groups
command does, for example:
|= (RETRIEVE-GROUPS) [2005-OCT-07 14:56:46.000 KOJAK] Found 1 known groups... [2005-OCT-07 14:56:46.000 KOJAK] Looking for group memberships... [2005-OCT-07 14:56:47.000 KOJAK] Done retrieving groups and members. (|GROUP|RAF) |= (KOJAK-LOG-OBJECTS :LOW "Found these groups:" (GET-ALL-GROUPS)) [2005-OCT-07 14:59:11.000 KOJAK] Found these groups: [GROUP RAF known members: (|AGENT|Ulrike_Meinhof |AGENT|Andreas_Baader) inferred members: (|AGENT|Gudrun_Ensslin)] |
Given appropriate mappings, it can retrieve this seed information even if it was asserted using a different vocabulary or classes and relations. In general, it is a good idea to use the built-in generic vocabulary wherever possible to simplify things and avoid errors, however, as the above shows it is possible to use custom terms if so desired.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The EDB schema file defines the interface between KOJAK's internal logic-based representation (based on PowerLoom) and an external relational evidence database. Once the EDB schema file is in place the following operations are possible:
retrieve
command just like any other PowerLoom relation
The EDB schema file is usually the most complex part of every configuration. Creating it will require intimate knowledge of the content as well as the schema of the external EDB. Fortunately, the schema mapping only needs to be done for relevant tables of the database but not necessarily all of them.
PowerLoom commands to define database objects as well as the necessary
supporting concepts and relations are all defined in the RDBMS
namespace. This namespace is inherited in the EDB
module and
therefore these commands usually don't have to be explicitly qualified.
Below, however, we do always describe them with their namespace prefix
to be clear where they are coming from.
8.3.4.1 Database Instances 8.3.4.2 Table Mappings 8.3.4.3 Materializing Type IDs 8.3.4.4 Defining Link Count Relations
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In KOJAK an external database is represented via a database instance
which is a logic instance of type RDBMS/SQL-Database
. All
physical connection information is associated with this database
instance and can be changed to link the instance to a different physical
database. In this sense a database instance is very similar to an ODBC
or JDBC data source. Mapping definitions such as table projections or
queries are always associated with the logical database instance only to
hide the physical connection details. These details would be different,
for example, between the C++ and Java versions of KOJAK even if the same
external database is used.
Database instances can be defined via the RDBMS/defdb
command.
The first command in `ali-baba-edb-schema.plm' defines the
edb
database instance which is the only one used (so far) by
KOJAK. We don't assign any connection information, since that will be
done programatically by KOJAK using the information from the
configuration file. Otherwise, it could be explicitly asserted in a
KOJAK run script via KOJAK's db-assert-connection-info
command.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The next section in the schema file defines a fairly extensive mapping between tables in the Ali Baba evidence database and corresponding PowerLoom relations. Mapping between a relational database table and a PowerLoom relation is fairly straightforward, since a database table is simply a set of tuples. Each such table could therefore be represented in PowerLoom with a corresponding relation that has as many arguments as the database table has columns. Usually, however, we will map a database table onto one or more PowerLoom relations that each correspond to a particular projection of the database table. The reasons for this are the following:
The main command to define the mapping between a database table and a
PowerLoom relation is RDBMS/deftable
. It is very similar to a
standard defrelation
command but takes two extra arguments: (1)
a logical database instance on which this table resides (e.g.,
edb
) and (2) the name of the database table to which this
PowerLoom relation corresponds. The argument list then specifies which
table columns map onto which relation arguments and what if any type
coercions should be performed. For example, let us look at the
following table definition from the Ali Baba schema file:
(DEFTABLE EDB-AliBabaName edb "AB_V41" (ID (NAME STRING))) |
This defines a new binary PowerLoom relation EDB-AliBabaName
(just as if we had used defrelation
) and associates it with the
table "AB_V41" on the database identified by edb
. The table
"AB_V41" is a wide table with 22 columns part of which are repeated
below (the full table description is listed in the EDB schema file):
mysql> describe ab_v41; +--------------+--------------+------+-----+---------+ | Field | Type | Null | Key | Default | +--------------+--------------+------+-----+---------+ | ID | int(11) | | PRI | 0 | | ALIAS | varchar(200) | YES | | NULL | | NAME | varchar(200) | YES | | NULL | | POB | varchar(200) | YES | | NULL | | DOB | varchar(200) | YES | | NULL | ................................................. | ACTIVITY | varchar(200) | YES | | NULL | +--------------+--------------+------+-----+---------+ 22 rows in set (0.00 sec) |
The purpose of EDB-AliBabaName
is to map an entity ID onto its
name string defined in this table. The specifics of this projection are
defined by the argument list of EDB-AliBabaName
. Each argument
name has to correspond to the name of some table column (arguments can
be specified in the standard question-mark syntax if needed, e.g., to
reference them in axioms in the definition body). If the argument is
specified without a type (as with the ID
argument above),
argument values brought in during a query will be coerced to standard
logic objects. If the argument is followed by a type (as with the
NAME
argument above), column values will be coerced to this
PowerLoom type. Logic objects have a lot of support machinery to handle
name spaces, modules, indexing, equality, etc. which makes them somewhat
more heavy weight. For cases where this machinery isn't required, it is
better to coerce such columns into simpler types such as STRING
or INTEGER
, since they occupy significantly less storage.
Let us now show how all this works with an example. We start KOJAK in interactive mode again and define the database instance and table mapping described above. For simplicity, we add the connection information right into the database definition:
% run-kojak java -i -c config/ali-baba.dat Running Java version of KOJAK Group Finder... Initializing STELLA... Initializing PowerLoom... Initializing KOJAK... [2005-OCT-10 15:31:59.000 PL] Processing check-types agenda... +-------------------------------------------------------------+ | Welcome to the KOJAK Group Finder v2.2.0 | | Copyright (C) USC Information Sciences Institute, 1996-2005 | | This software comes with ABSOLUTELY NO WARRANTY and is | | licensed for NON-COMMERCIAL EVALUATION PURPOSES ONLY! | | All rights reserved. | +-------------------------------------------------------------+ |= (cc edb) |MDL|/PL-KERNEL-KB/PL-USER/LD-ONTOLOGY/EVIDENCE-DATABASE-SCHEMA |= (DEFDB edb :jdbc-connection-string "jdbc:mysql://blackcat.isi.edu:3306/ali_baba_v41?user=me&password=???" :SQL-database TRUE) |i|edb |= (DEFTABLE EDB-AliBabaName edb "AB_V41" (ID (NAME STRING))) |r|EDB-AliBabaName |
Now we are ready to ask some queries. Note that at this point we have nothing loaded into the KOJAK knowledge base, but these queries succeed, since they look up results directly in the external database. We start by asking for the name of the person with ID 404. Since in PowerLoom "404" would be interpreted as an integer, we enclose it in vertical bars to tell the system that this is a logic constant whose name is "404":
|= (RETRIEVE ALL (EDB-AliBabaName |404| ?name)) [2005-OCT-10 15:36:03.000 SDBC] Connect via JDBC: jdbc:mysql://blackcat.isi.edu:3306/ali_baba_v41?user=me&password=??? There is 1 solution: #1: ?name="Suraqah" |
The log message shows that the connection to the database wasn't attempted until the first query was asked. The PowerLoom query above was translated into the following SQL query and its results where then fed back to the PowerLoom inference engine:
select NAME from AB_V41 where ID='404'; |
What this translation exactly looks like depends on the particular binding pattern. For example, the following asks the "reverse" query from the name to the corresponding ID. Since names are not unique, we get multiple solutions here. Also note that because the ID column gets coerced into PowerLoom logic objects, they are printed with vertical bars around them to distinguish them from regular integers:
|= (RETRIEVE ALL (EDB-AliBabaName ?id "Suraqah")) There are 5 solutions: #1: ?id=|404| #2: ?id=|1616| #3: ?id=|1293| #4: ?id=|1865| #5: ?id=|4366| |
Next, we leave both arguments unbound. We restrict the number of answers sought to 10, since otherwise this would bring in all 6000 or so rows of this table (in general, one should always be very careful when composing queries against a database, since result sets could be very large):
|= (RETRIEVE 10 (EDB-AliBabaName ?id ?name)) There are 10 solutions so far: #1: ?id=|402|, ?name="Abidin" #2: ?id=|403|, ?name="Sa'eed" #3: ?id=|404|, ?name="Suraqah" #4: ?id=|405|, ?name="Fikri" #5: ?id=|406|, ?name="Mu'izz" #6: ?id=|407|, ?name="Mulhim" #7: ?id=|408|, ?name="Arfan" #8: ?id=|409|, ?name="Usama" #9: ?id=|410|, ?name="Ali" #10: ?id=|411|, ?name="Mubarak" |
We can also ask TRUE/FALSE questions:
|= (ASK (EDB-AliBabaName |402| "Abidin")) TRUE |
To illustrate the NULL-value problem outlined above, let us define a new table relation that maps an entity ID on its name and alias columns at the same time:
|= (DEFTABLE EDB-AliBabaNameAlias edb "AB_V41" (ID (NAME STRING) (ALIAS STRING))) |r|EDB-AliBabaNameAlias |
For entities that actually have both a name and an alias this does what is expected:
|= (RETRIEVE ALL (EDB-AliBabaNameAlias |2772| ?name ?alias)) There is 1 solution: #1: ?name="Gimmel Faruk", ?alias="Scar Face" |
However, for entities that only have a name this now causes a problem,
since the NULL value in the alias column prevents us from retrieving the
row (?alias
would remain unbound), so we never could retrieve the
name of such an entity using EDB-AliBabaNameAlias
:
|= (RETRIEVE (EDB-AliBabaNameAlias |404| ?name ?alias)) No solutions. |
The next 16 deftable
definitions in
`kbs/ali-baba-edb-schema.plm' provide a fairly exhaustive mapping
of the Ali Baba schema whose table structure follows the Year-2 EAGLE
EDB schema. For the Ali Baba data only EDB-LinkOfType
,
EDB-Organization
and EDB-AliBabaName
are actually used to
import data. Additionally, the various type ID tables are used to
import type ID information. One important "wrinkle" of this mapping is
that all type IDs (e.g., the first argument of EDB-EntityType
)
are explicitly coerced to integers to avoid conflicts with actual people
entities whose ID might be the same.
Note that table relation such as EDB-AliBabaName
can only be used
to query a database. They can also be asserted via the PowerLoom
assert
command, however, such assertions do not update the
database (future versions of PowerLoom might provide this feature).
Also, database table queries do not materialize any assertions on the
PowerLoom end (apart from caching). If that is desired PowerLoom's
assert-from-query
command can be used (more on that below).
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
What is described in this section is very specific to the "typeless" nature of the EAGLE EDB schema which is unlikely to occur often in real-life databases. The concepts and mechanisms described here might still be useful, but if you are not dealing with such a type of schema you could safely skip this section.
The Year-2 EAGLE EDB schema is a very flexible and extensible schema
where a single entity table holds entities of many different types, and,
similarly, a single link table holds entity-entity links of many
different types. This allows the introduction of new entity or link
types without having to change the database schema. Instead of having
separate tables for each entity and link type, each entry in the EDB
ENTITY
and LINK
tables is associated with a type ID. Type
IDs are mapped onto descriptive type names via tables such as
ENTITYTYPE
and LINKTYPE
. Many of these descriptive names
originate in an EDB ontology developed by Cycorp, but the Ali Baba
database also uses many new and undocumented types.
Because of this "typeless" nature of the EAGLE EDB schema, each query will usually have to specify entity and link type IDs to restrict the types of entities and links that should be considered. This can convolute query expressions significantly, since it usually involves an extra constraint for the type ID plus an extra join if one doesn't want to hardcode integer type IDs but actually wants to key in on their descriptive name. For example, the following query would retrieve all organizations in the entity table:
select e.ENTITYID from ENTITY e, ENTITYTYPE et where e.ENTITYTYPEID=et.ENTITYTYPEID and et.CYCCOLLECTION='Organization'; |
Similarly, using the PowerLoom mapping we would have to phrase this query as follows:
(RETRIEVE ALL (AND (EDB-EntityType ?typeid Organization) (EDB-EntityOfType ?id ?typeid))) |
To avoid the extra join and database roundtrip as well as hard-coding
integer type IDs in our queries, we define a set of PowerLoom functions
that can map an entity type such as Organization
onto its type ID
by simply looking it up in a local table. This allows us to rephrase
the query above like this:
(RETRIEVE ALL (EDB-EntityOfType ?id (EDB-EntityTypeID Organization))) |
This looks up the type ID on the PowerLoom side and translates into the following much simpler SQL query:
select ENTITYID from ENTITY where ENTITYTYPEID=1071; |
The type ID tables represented by the four PowerLoom functions
EDB-EntityTypeID
, EDB-LinkTypeID
,
EDB-EntityAttributeTypeID
and EDB-LinkAttributeTypeID
are
populated using PowerLoom's assert-from-query
command. For
example:
(ASSERT-FROM-QUERY (RETRIEVE ALL (?pred ?id) (EDB-EntityType ?id ?pred)) :relation EDB-EntityTypeID) |
assert-from-query
takes a retrieve command as its argument and
then creates an assertion for each solution (or set of variable
bindings) the query generates. There are different ways for creating
these assertions (see the documentation of assert-from-query
for
more detail). One way used here is to provide a :relation
argument which will assert the given relation for each set of bindings
retrieved by the query. For example, one of the 122 assertions created by
the above is the following:
(ASSERT (EDB-EntityTypeID Organization 1071)) |
Note that in the query above we supply the set of query variables
(similar to output columns in a select statement) to get the appropriate
argument order when bindings are supplied to EDB-EntityTypeID
.
assert-from-query
is very useful to selectively materialize
portions of the external DB to avoid having to query for the same
information over and over again. This mechanism will be used again when
we discuss `scripts/ali-baba-load-data.plm'.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Probably the most important but also challenging part of the Ali Baba
EDB schema file is the section that defines the three link count
relations EDB-phoneCallCount
, EDB-communicationCount
and
EDB-affiliatedWithCount
. These relations are used by KOJAK's
mutual information (MI) component to perform group extension and compute
connection strengths between individuals. The challenge here is not so
much with the KOJAK or PowerLoom interface, but with the formulation of
appropriate SQL queries which can become somewhat complex.
The MI component starts with a set of group seed entities and in its
first step builds an extended graph around them. To do this it looks at
each seed entity and finds all other entities connected to it by a link
of one of the expansion link types listed in the configuration or
directly supplied to the expand-groups
command (which see).
Additional to just looking for entities connected via such links, it
also retrieves link counts, that is, how many links of a particular type
lead to some new entity X. For example, it might retrieve that from
seed entity 38 there are two communication
links to entity 1155.
These link counts are an important input to the MI computation.
The MI component uses GROUPS/linkCount
as the central interface
relation to access links and link counts from group seeds. This
relation takes a set of seeds and an expansion link type as arguments
and returns the retrieved entities and associated counts. For example,
let us look at the following query:
|= (RETRIEVE ALL (GROUPS/linkCount communication (SETOF |342| |38|) ?p1 ?p2 ?c)) There are 9 solutions: #1: ?p1=|38|, ?p2=|2|, ?c=1 #2: ?p1=|38|, ?p2=|22|, ?c=1 #3: ?p1=|38|, ?p2=|264|, ?c=1 #4: ?p1=|38|, ?p2=|639|, ?c=1 #5: ?p1=|38|, ?p2=|1155|, ?c=2 #6: ?p1=|342|, ?p2=|2|, ?c=1 #7: ?p1=|342|, ?p2=|214|, ?c=1 #8: ?p1=|342|, ?p2=|247|, ?c=1 #9: ?p1=|342|, ?p2=|277|, ?c=1 |
In this query we supplied communication
as the link type we are
interested in and a set of two seed entities represented by the
setof
term. ?p1 will be bound to each seed member in the
set and ?p2 to the various entities ?p1 is connected to via
a communication
link. The last argument reports how many such
links are between the two entities.
For the Ali Baba dataset, the actual retrieval of these links and counts
is done by the EDB-communicationCount
relation which is defined
in `kbs/ali-baba-edb-schema.plm'. For example, we could have used
this relation directly to ask the query:
|= (RETRIEVE ALL (EDB-communicationCount (SETOF |342| |38|) ?p1 ?p2 ?c)) There are 9 solutions: #1: ?p1=|38|, ?p2=|2|, ?c=1 #2: ?p1=|38|, ?p2=|22|, ?c=1 #3: ?p1=|38|, ?p2=|264|, ?c=1 #4: ?p1=|38|, ?p2=|639|, ?c=1 #5: ?p1=|38|, ?p2=|1155|, ?c=2 #6: ?p1=|342|, ?p2=|2|, ?c=1 #7: ?p1=|342|, ?p2=|214|, ?c=1 #8: ?p1=|342|, ?p2=|247|, ?c=1 #9: ?p1=|342|, ?p2=|277|, ?c=1 |
GROUPS/linkCount
and EDB-communicationCount
are connected
via the following rule which is defined in
`scripts/ali-baba-load-data.plm' (since it is a data import rule
but it could have been defined in the EDB schema file as well):
(ASSERT (=> (AND (BOUND-VARIABLES ?seeds) (= ?ltype communication) (EDB-communicationCount ?seeds ?party1 ?party2 ?count)) (GROUPS/linkCount ?ltype ?seeds ?party1 ?party2 ?count))) |
So, whenever the MI components asks a GROUPS/linkCount
query
where ?ltype is bound to communication
, PowerLoom simply
backchains into EDB-communicationCount
which then does all the
work. The bound-variables
clause is there for safety and makes
sure that the query is never asked with ?seeds unbound (which
could only happen if asked manually). The reason for this safeguard is
that such a completely unconstrained query might bring in the whole
database which could of course be very large and take forever.
The other two link types phoneCall
and affiliatedWith
are
connected to corresponding EDB relations with similar rules. Now it
should become clear why these relations defined in
kbs/ali-baba-ontology.plm
only serve as names but are never
actually asserted or queried.
Having explained how GROUPS/linkCount
is hooked up to EDB
relations that can retrieve such links and associated counts, all that
remains to be done is to show how an EDB relation such as
EDB-communicationCount
can be defined. Such link count relations
are defined via parametric SQL queries specified via PowerLoom's
RDBMS/defquery
command (which see). For example,
EDB-communicationCount
is defined like this:
(DEFQUERY EDB-communicationCount (?seeds ?party1 ?party2 (?count INTEGER)) :query-pattern (RDBMS/SQL-QUERY edb "SELECT ' ', P1, P2, COUNT(P2) FROM...<complicated SQL query>...")) |
The RDBMS/defquery
command defines a PowerLoom relation with the
given name and arguments and links it to an SQL query whose arity
(number of output columns) needs to match the arity of the relation.
SQL output columns will be assigned by position to relation arguments,
therefore, variable names such as ?seeds do not have to match a
column name as with the RDBMS/deftable
command. Similar to
RDBMS/deftable
, relation arguments can be associated with a type
(e.g., as done for ?count) to force type coercion.
The value of the :query-pattern
argument has to be of the
following form:
(RDBMS/SQL-QUERY <db-instance> "<sql query>") |
The first element in this list is RDBMS/SQL-QUERY
which is the
name of a PowerLoom query specialist to handle arbitrary parametric SQL
queries. RDBMS/defquery
is very general and supports other
queries such as joins or non-SQL queries as well, but for KOJAK SQL
queries are all we need. <db-instance>
needs to be a database
instance which will usually be edb
.
Writing Parametric SQL Queries
For the parametric SQL query there are two types of complexity that we
need to handle: (1) how to pass in bound arguments that will result in
appropriate constraints, and (2) writing an appropriate SQL query that
can compute the desired link counts. Before we explain how count
queries need to be constructed, let us explain the parametric query
mechanism with a simple example. Below we define a relation
myEntityOfType
which has essentially the same functionality as
EDB-EntityOfType
which was defined via RDBMS/deftable
earlier in the EDB schema file:
(DEFQUERY myEntityOfType (?id (?typeID INTEGER)) :query-pattern (RDBMS/SQL-QUERY edb "SELECT ENTITYID, ENTITYTYPEID FROM ENTITY WHERE ENTITYID='?id' AND ENTITYTYPEID IN '?typeID'")) |
The SQL query has two output columns to match the relation variables ?id and ?typeID. The rest of the SQL query is fairly straightforward. What makes the query parametric is the mechanism by which relation variables are bound to column constraints. For example, the constraint
ENTITYID='?id' |
would be translated into
ENTITYID='38' |
if ?id is bound to 38
. The constraint
ENTITYTYPEID IN '?typeID' |
would be translated into
ENTITYTYPEID IN ('1074') |
if ?typeID is bound to 1074
. This allows us to answer a
query such as this:
|= (ASK (myEntityOfType |38| 1074)) TRUE |
If the relation variable in such a column constraint is unbound, the
whole constraint is eliminated in the corresponding SQL translation
(i.e., replaced by TRUE
), which allows us to run queries like the
following:
(RETRIEVE ALL (myEntityOfType |38| ?type)) There is 1 solution: #1: ?type=1074 |= (RETRIEVE 5 (myEntityOfType ?id 1071)) There are 5 solutions so far: #1: ?id=|1289| #2: ?id=|1290| #3: ?id=|1291| #4: ?id=|1292| #5: ?id=|1293| |
Finally, binding a relation variable to a set of terms will translate
into an appropriate IN
constraint (even if the SQL query was
specified with an =
constraint for that column):
|= (RETRIEVE ALL (myEntityOfType (SETOF |38| |1296|) ?type)) #1: ?type=1074 #2: ?type=1071 |
This last PowerLoom query was translated into the following SQL query:
SELECT ENTITYID, ENTITYTYPEID FROM ENTITY WHERE ENTITYID IN ('38', '1296') AND TRUE |
The parametric constraints shown above use a simplified, restricted
syntax that has to follow one of the following patterns (<ws>
means one or more whitespace characters):
<ws><column>=<?var><ws> <ws><column>='<?var>' <ws><column>="<?var>" <ws><column>=`<?var>` <ws><column> {in|IN} <?var><ws> <ws><column> {in|IN} '<?var>' <ws><column> {in|IN} "<?var>" <ws><column> {in|IN} `<?var>` |
Note that the amount (or lack) of whitespace in each pattern is very specific and has to be strictly followed for the constraint to work. If quotes are used, the substituted value will be appropriately escaped for the chosen quote character. A more general parameter substitution syntax is also available to support the construction of more complex queries, but it is beyond the scope of this document.
Writing Link Count Queries
With this machinery in hand we can now go on to write link count queries
for relations such as EDB-communicationCount
. Let us again look
at the overall structure of this relation:
(DEFQUERY EDB-communicationCount (?seeds ?party1 ?party2 (?count INTEGER)) :query-pattern (RDBMS/SQL-QUERY edb "SELECT ' ', P1, P2, COUNT(P2) FROM...<complicated SQL query>...")) |
It needs to have four arguments as shown above (which could be given different names). The first is an "input" argument which will be bound to a set of seeds when the MI components asks the query, the remaining arguments are "output" arguments which will be generated by the embedded parametric SQL query.
The output columns of the SQL query should look as follows (again names could be chosen differently):
SELECT ' ', P1, P2, COUNT(P2) ... |
The first value ' '
is only needed to match up with the
?seeds variable position. Since that variable will be bound at
query time we don't have to generate any actual output here. Make sure
you use ' '
and not "
, since the latter is interpreted as
a NULL value on some database systems (e.g., Oracle) which would cause
the query to fail in PowerLoom.
The next output column called P1
above will be bound to each of
the seed members supplied in ?seeds. The third output column
P2
needs to generate all the entities (or nodes) that each seed
member is connected to via the particular link type handled by this
relation (communication
in our example). Finally, the last
column needs to generate the number of links of the particular type for
each P1,P2
node pair.
The query should generate exactly one row for each node pair which should contain the total number of links. Multiple entries as in
...... ' ', 342, 214, 1 ...... ' ', 342, 214, 2 ...... |
will not be aggregated by KOJAK (the second entry will simply be ignored). Duplicate entries that simply permute the arguments should be avoided but will be ignored if they occur, for example:
...... ' ', 342, 214, 2 ...... ' ', 214, 342, 2 ...... |
Such duplicates can not always be avoided by the query, since there might be links between two seed members.
Let us emphasize again that KOJAK performs best with high frequency
transaction-type links such as, for example, communication events (see
5.1.2 Link Data for more discussion on that). If such data is
available, it is important to compute accurate link counts for best
performance. That is, it is much better for KOJAK to know that there
were 23 phone calls and 17 financial transactions between two entities
as opposed to just knowing that there were one or more but without
knowing how many. The following example shows how such link counts can
be computed for 'Communications'
links in the Ali Baba data. How
this is done exactly will differ for each different evidence database
schema and each link type. The parametric SQL query for
EDB-communicationCount
looks as follows:
SELECT ' ', P1, P2, COUNT(P2) FROM (SELECT ENTITYID_ARG1 P1, ENTITYID_ARG2 P2, l.LINKID FROM LINK l, LINKTYPE lt, ENTITY e, ENTITYTYPE et WHERE l.LINKTYPEID=lt.LINKTYPEID AND lt.CYCCOLLECTION='Communications' AND l.ENTITYID_ARG2=e.ENTITYID AND e.ENTITYTYPEID=et.ENTITYTYPEID AND et.CYCCOLLECTION='Person' AND l.ENTITYID_ARG1 IN '?seeds' AND l.ENTITYID_ARG1='?party1' -- optional AND l.ENTITYID_ARG2='?party2' -- optional UNION SELECT l.ENTITYID_ARG2 P1, l.ENTITYID_ARG1 P2, l.LINKID FROM LINK l, LINKTYPE lt, ENTITY e, ENTITYTYPE et WHERE l.LINKTYPEID=lt.LINKTYPEID AND lt.CYCCOLLECTION='Communications' AND l.ENTITYID_ARG2=e.ENTITYID AND e.ENTITYTYPEID=et.ENTITYTYPEID AND et.CYCCOLLECTION='Person' AND ENTITYID_ARG2 IN '?seeds' AND l.ENTITYID_ARG1='?party2' -- optional AND l.ENTITYID_ARG2='?party1') -- optional SUBQUERY GROUP BY P1, P2 |
The clauses marked as optional don't really need to be there, since the
?party1 and ?party2 variables will not be bound in queries
generated by the MI component. They could be bound in manual queries,
though, which is why they are added here. The query is structured as
follows: the inner query named SUBQUERY
retrieves all links in
the LINK
table whose link type is Communications
and whose
arguments are of type Person
. Due to the fragmented nature of
this schema this takes a somewhat complex join. The ?seeds
argument constrains the arguments of each link. Since a ?seeds
argument could be the first or second argument of such a link, we have
to ask the query twice and union the results (note that an OR
would not do the same thing here, since it would not guarantee to have
only seed members for column P1
). For each link we also output
its LINKID
to make sure that we get separate rows for each
Communications
link between two people. In the outer query we
then group the result of the subquery first by seed members and then by
second entity which allows us to compute accurate link counts for the
final output column.
The other two link count relations EDB-phoneCallCount
and
EDB-affiliatedWithCount
are defined similarly.
EDB-phoneCallCount
finds pairs of entities that are linked by
having called the same phone number (which is more indirect than having
called each other; however, direct phone call information doesn't seem
to be available in this dataset). EDB-affiliatedWithCount
uses
the most complex SQL query, since it finds pairs of entities that are
either directly affiliatedWith
each other as well as those that
are affiliated via an intermediary Event
or Organization
.
Conceivably, these two different kinds of affiliations could be
considered to be separate link types and handled by separate relations.
The way KOJAK computes and uses these link count statistics is one of the main reasons for its scalability. Instead of having to look at each individual link of a particular type between two entities, all of them can be abstracted into a single link count tuple. KOJAK also only computes these link counts in a very focused way starting from group seeds instead of having to do that for the whole database. Finally, by off-loading this data aggregation to the database, much less data has to be moved and the power of the relational database system can be exploited to perform these computations over large datasets. Nevertheless, depending on the nature and connectivity of the data, such queries can be time consuming and put significant stress on the database server.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This file loads the Ali Baba ontology, the Ali Baba/groups ontology
mapping and constraint rules. The value of LoadOntologyScript
in `config/ali-baba.dat' should point to this file. It will be
loaded when the KOJAK commands initialize-kojak
or
load-kojak-ontology
(which see) are called in the run script.
load-kojak-ontology
allows the explicit specification of a script
which would override what is specified in LoadOntologyScript
.
PowerLoom KB files are loaded using PowerLoom's load
command. The
script is loaded into the KOJAK
module which is case-insensitive.
This means that commands can be spelled in upper or lower case. If the
load
command were to be used in a case-sensitive module such as
DATASET
, it would have to be spelled in all upper-case letters.
load
takes the name of the file to be loaded as its argument. In
this load script all files are specified via logical pathnames which
will ensure their proper translation regarding of which version of KOJAK
you are using and which OS it is run on (for more information on logical
pathname syntax see 6.1 Configuration File). You can also use
physical pathnames appropriate for the OS KOJAK is run on. IMPORTANT:
if you supply a physical Windows pathname you will need to double the
\
character, since it is also the escape character for strings.
For example:
(load "C:\\kojak\\my-db-ontology.plm") |
The first file loaded is `kbs/generic-groups-ontology.plm' which defines KOJAK's generic groups ontology. This file always needs to be loaded as the first file in every configuration (future versions of KOJAK might do that automatically).
The other two files load (1) the Ali Baba ontology, and (2) the mappings between the Ali Baba ontology and KOJAK's generic groups ontology as well as a seed constraint rule. Note that the Ali Baba ontology file needs to be loaded first, since the objects defined in it are used in the seed constraints file.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This file loads the Ali Baba EDB schema and mapping. The value of
LoadEDBSchemaScript
in `config/ali-baba.dat' should point to
this file. It will be loaded when the KOJAK commands
initialize-kojak
or load-EDB-schema
(which see) are called
in the run script. load-EDB-schema
allows the explicit
specification of a script which would override what is specified in
LoadEDBSchemaScript
.
Since only one file `kbs/ali-baba-edb-schema.plm' is loaded by this
script, LoadEDBSchemaScript
could be pointed to that file
directly as opposed to loading it via this script. However, it is
conceivable that one might want to spread an EDB schema definition over
multiple files in which case the load script needs to be used.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This file loads any necessary data (e.g., seed information) and related
information (e.g., names) into KOJAK. It also usually contains the data
import rules that link up the GROUPS/linkCount
relation used by
KOJAK's MI component to individual parametric query relations defined in
the EDB schema file. This file does not define any more mapping
information but instead uses the various mappings defined in previous
files.
The first command in `scripts/ali-baba-load-data.plm' retrieves a
set of seed group entities based on the names given in the setof
term. The surrounding assert-from-query
command then creates a
ThreatGroup
assertion for each of these entities and also assigns
a nameString
for them (the latter is not actually used). For
example, running this command interactively generates the following
assertions (see the EDB schema file section for more information on
assert-from-query
):
|= (ASSERT-FROM-QUERY (RETRIEVE all (?pred ?o ?name) (AND (= ?pred |ThreatGroup|) (MEMBER-OF ?name (SETOF "Al Qaeda" "Needabaath" "Pavdayeen" "Ali Baba")) (BOUND-VARIABLES ?name) (EDB-Organization ?o ?name))) :pattern (KAPPA (?pred ?o ?name) (AND (HOLDS ?pred ?o) (nameString ?o ?name)))) (|P|(nameString 1290 "Al Qaeda") |P|(KnownGroup 1290) |P|(nameString 1294 "Ali Baba") |P|(KnownGroup 1294) |P|(nameString 1291 "Needabaath") |P|(KnownGroup 1291) |P|(nameString 1292 "Pavdayeen") |P|(KnownGroup 1292)) |
The EDB-Organization
relation defined in the EDB schema file is
used to import this information from the Ali Baba database. Note that
due to the prior synonym assertion for ThreatGroup
this actually
results in GROUPS/KnownGroup
assertions.
The next command asserts know members for the seed groups imported
above. Note that links of type memberAgents
in the Ali Baba
LINK
table actually use the reverse argument order of the
memberAgents
relation defined in the EAGLE ontology (which is
version we are using). Therefore the group argument ?g and member
argument ?m need to be flipped before we create the assertion.
The EDB schema relation EDB-LinkOfType
is used to access
membership links directly on the Ali Baba database. This is what the
command would produce when run interactively; again, due to the prior
synonym assertions this actually generates GROUPS/groupMember
links:
|= (ASSERT-FROM-QUERY (RETRIEVE ALL (?pred ?g ?m) (EXISTS ?id (AND (ThreatGroup ?g) (BOUND-VARIABLES ?g) (= ?pred memberAgents) (EDB-LinkOfType ?id ?m ?g (EDB-LinkTypeID ?pred))))) :relation HOLDS) (|P|(groupMember 1292 732) |P|(groupMember 1292 782) |P|(groupMember 1292 536) |P|(groupMember 1292 882) ......112 assertions suppressed...... |P|(groupMember 1290 1229) |P|(groupMember 1290 1230) |P|(groupMember 1290 1231) |P|(groupMember 1290 1232)) |
This command could have been slightly simplified as follows:
(ASSERT-FROM-QUERY (RETRIEVE ALL (?g ?m) (EXISTS ?id (AND (ThreatGroup ?g) (BOUND-VARIABLES ?g) (EDB-LinkOfType ?id ?m ?g (EDB-LinkTypeID memberAgents))))) :relation memberAgents) |
Note the use of the bound-variables
clause to make sure the group
argument ?g is bound before EDB-LinkOfType
is queried.
PowerLoom's query optimizer might move clauses around which could result
in the situation where EDB-LinkOfType
is queried first which
would logically be the same but potentially retrieve a lot of
unnecessary intermediate information (all members of any groups not just
the four threat groups we care about). Unfortunately, the current
version of the query optimizer does not know about potential result set
sizes on an external database which might result in incorrect clause
order "optimizations".
Alternatively, group seed information could have been supplied in a CSV file in which case these import commands should be commented or deleted, since they would most probably conflict with that.
The next command imports links of type Terrorism
and asserts them
via the participatedInTerrorism
relation defined in the ontology.
Remember that `kbs/ali-baba-seed-constraints.plm' defines a rule
that can exploit these links to infer additional group seed members.
The next section defines three data import rules that maps
GROUPS/linkCount
used by the MI component onto the appropriate
EDB link count relation defined in the EDB schema file. The rules
dispatch based on the link type argument which is one of the three
expansion link type names phoneCall
, communication
and
affiliatedWith
(see the EDB schema section for more information
on this).
The last rule does not have anything to do with loading data but simply
supports more readable output generation. For lack of a better place it
was added here. By default, entities such as groups and their members
are reported via their IDs in KOJAK's report file. For the Ali Baba
data this makes for somewhat unreadable output, since all IDs are simply
numeric keys. To allow one to substitute names for IDs if desired,
KOJAK uses the following mechanism: For each entity it is about to
report, it tries to lookup a name via GROUPS/nameString
. If no
name could be found, it outputs the entity ID, otherwise, it outputs the
name instead. Such names can be simply asserted or defined via a rule
as done here. The rule first looks up the name of an organization or
person with help of the EDB relations EDB-Organization
and
EDB-AliBabaName
. If a name is found it concatenates the entity
ID to the end to ensure it is unique. Concatenation is done via
PowerLoom's string-concatenate
function.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Finally, the run file determines the sequence of KOJAK commands executed to analyze a dataset such as Ali Baba. This file primarily controls what command parameters are taken from the configuration, whether (intermediary) results are deposited in the hypothesis tables of the KOJAK database, and where and how reports should be generated.
The Ali Baba run file is highly generic so we do not explain it in detail here. Refer to 7. Running the KOJAK Group Finder for more information on how to run KOJAK and how the run file controls this process.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |