[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Developing a data analysis tool such as the Group Finder that can be easily and successfully applied to arbitrary real-world datasets is a very challenging task, given that data can come in so many different formats and with so many statistical properties. To address this challenge (at least partially), the Group Finder has a very flexible and highly configurable interface. One might almost say that the Group Finder is configurable to a fault, since there are so many different ways of mapping data, configuring it and running it. However, the generic and example configurations shipping with the distribution should cover a wide variety of situations and make easy departure points when applying the Group Finder to some new dataset.
The Group Finder uses the following configuration mechanisms:
6.1 Configuration File 6.2 Command-Line Options
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
By default, the Group Finder looks for the file
`configuration.dat' in the KOJAK software's installation
directory to load its configuration information. Alternatively, the
-c
command-line option can be used to specify a different
configuration file. It is an error if no configuration file can be
found and none was specified on the command line.
There are a set of configuration files available in the `config' directory. By default the Group Finder uses the `config/example1.dat' configuration which runs the Group Finder on one of the example datasets that ship with the release. This file and its associated load and run scripts should be a good departure point for generating a configuration for some new dataset.
Configuration information is supplied in a Java-style property file with lines of the following form:
<parameterName> =<parameterValue> <parameterName> +=<parameterValue> |
Empty lines or lines starting with a #
character will be
ignored. Whitespace leading up to the =
character will be
ignored, whitespace following it will become part of the parameter
value. The +=
syntax allows the specification of list-valued
parameters, for example, to assign a set of input files to the
Data
parameter. Both parameter names and values are treated
case-sensitively.
The configuration file can contain arbitrary user-defined parameters
which can be accessed in run and load scripts via the functions
get-parameter
, set-parameter
, and
add-parameter-value
. There are a multitude of built-in
parameters that control the behavior of the Group Finder which are
explained further below.
Various built-in parameters take filenames as arguments. These
filenames can be absolute or relative physical pathnames (using syntax
appropriate for the underlying operating system), or, they can be
logical pathnames that are platform neutral and that can be
translated automatically by KOJAK into the appopriate physical form.
Logical pathname syntax is derived from Common-Lisp. Each logical
pathname starts with a logical host (terminated by a colon) followed by
a set of directory components (separated by ;
) and ending in a
filename with an optional extension. KOJAK logical pathnames need to be
of the following form:
KJ:{<dirname>;}*[<basename>][.<extension>] |
KJ:scripts;ali-baba-load-ontology.plm |
C:\Program Files\kojak-2.2.0\scripts\ali-baba-load-ontology.plm |
KojakRootDirectory
(see below) is defined as
`C:\Program Files\kojak-2.2.0\'. This means the logical host
KJ
is simply replaced by the value of KojakRootDirectory
and the logical pathname separator ;
is substituted with the
pathname separator appropriate for the underlying OS.
6.1.1 General Options 6.1.2 Input Specifications 6.1.3 Script Specifications 6.1.4 Analysis Specifications 6.1.5 Output Specifications
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
run-kojak
scripts always change to
the KOJAK directory relative pathnames are fine unless the KOJAK
executables are invoked manually somehow). Logical pathnames starting
with the KJ
logical host will use this directory as their root.
Data
specifications which see). If it is a DSN all necessary connection
information must be specified in `.odbc.ini' (see section 3.4.2.1 .odbc.ini File). For example:
KojakDB =KOJAK |
KojakDB =DSN=KOJAK;DB=kojak;SERVER=blackcat;UID=scott; KojakDB =DSN=KOJAK;DB=kojak; KojakDB =DSN=KOJAK;USER=KOJAK;DB=blackcat.isi.edu/oracledb; |
KojakDB =jdbc:mysql://blackcat:3306/kojak?user=scott&password=secret KojakDB =jdbc:oracle:thin:KOJAK/[email protected]:1521:oracledb |
KojakDB
can
be set or overridden the command line options -k
and
--kojak-db
.
none/low/medium/high
. This can also be
specified via the command line options -l
and --log-level
.
.odbc.ini
, it will not be inherited from the [Default] data
source.
.odbc.ini
, it will not be inherited from the [Default] data
source.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Data
can be used to specify data sources to be analyzed. Sources
can either be files in various formats or evidence databases accessible
via ODBC or JDBC. Currently supported file formats are comma-separated
value files, IET report-format files or PowerLoom files (see section 5. Input Data). Appropriate file extensions must be used to indicate the type
of file, for example, data.csv
, data.iet
or
data.plm
. Databases can be specified as a DSN or ODBC or JDBC
connection string (see KojakDB
for more details). This parameter
can also be specified by pointing to another parameter such as
KojakDB
whose value will then be used. Data
can have
multiple values specified via the +=
syntax. For example:
Data +=KJ:data;example1;ds1-group-seeds-all.csv Data +=KojakDB |
The value(s) of the Data
parameter will be used by default by the
load-data
command to load data (unless overridden by actual
parameters). If a database source is specified, the script specified
by LoadDataScript
will be used to load data from that database.
DisabledData
is used by the Kojak interface code to record data
specifications which are not currently used in the analysis, but which
are made available for selection in the interface. It uses the same
format as Data
.
Data
but uses the value of LoadPrimaryDataScript
to load database data sources. These data sources are not loaded by
default by load-data
but can be specified as follows:
(load-data :data-source "PrimaryData") |
-pd
or
--primary-data
options.
Data
but uses the value of LoadSecondaryDataScript
to load database data sources. These data sources are not loaded by
default by load-data
but can be specified as follows:
(load-data :data-source "SecondaryData") |
-sd
or
--secondary-data
options.
-n
or
--dataset-name
command-line options. Dataset names are used as
metadata for results deposited in the hypothesis tables of the KOJAK DB.
DatasetType =IET_2004 |
-t
or --dataset-type
options.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The Group Finder uses a variety of scripts to load data, ontologies, schema information, execute commands, etc. The following parameters allow customization of these scripts.
initialize-kojak
and
load-kojak-ontology
to load ontology information into PowerLoom.
initialize-kojak
and
load-edb-schema
to load evidence database schema information into
PowerLoom.
load-data
to load RDBMS data sources
specified in the Data
configuration parameter.
The following scripts are primarily relevant to IET data sources and
somewhat obsolete. They are the default scripts used when an RDBMS data
source is loaded from a PrimaryData
, SecondaryData
,
PrimaryLDData
or PrimaryPLData
specification.
main
function of the KOJAK Group Finder is run in
non-interactive mode, or when the run-kojak
command is executed.
This can also be specified via the -s
or --run-script
options.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
DatasetName
when
the configuration gets saved to the KOJAK database. If none is supplied
a new one is generated automatically. This is useful to track different
configuration settings across multiple analysis runs. It can also be
specified via the -r or --run-id options.
RunID
.
It can also be specified via the -rp or --run-prefix options.
--depth
option.
expand-groups
command when expanding seed groups and computing mutual information
between entities. These should correspond to the relevant link types
loaded and accessed from the evidence database (for example, telecon
,
phoneCall
, etc.). If no link types are specified the default is
anyLink
. This is a multi-valued parameter.
DisabledExpansionLinkTypes
is used by the Kojak interface code
to record link types which are not currently used in the analysis, but which
are made available for selection in the interface. It uses the same
format as ExpansionLinkTypes
.
expand-groups
. These are used to compute connection strength
measures between entities and seed groups. This is a multi-valued
parameter whose values are treated case-insensitively. Legal values are
the following:
MI
: computes the mutual information value between an entity
and a seed group member based on the complete set of links between two
entities.
Group-MI
: views the whole seed group as a single individual
and computes the MI between this "group individual" and an entity based
on all the links between the entity and the "group individual".
MI-or-Group-MI
: computes both MI
and Group-MI
values.
Link-Count
: Computes link counts between an entity and
seed group members.
Connectivity
: Computes the number of seed group members
connected to an entity.
Connectivity
, Link-Count
and Group-MI
form
the default.
NOTE: this parameter will eventually go away, since which expansion
methods are required can be inferred from the ExpansionMeasures
described below.
expand-groups
. Each strength computation method generates a
variety of results such as maximums, averages, etc. The set of
expansion measures selects a subset of those result measures and
averages them to generate a single result measure (this averaging is
called "bagging"). This is a multi-valued parameter that can take
values from the following set (values are treated case-insensitively):
max-MI
: computes the maximum MI between an entity and any of
the seed group members as the connection strength to the seed group.
max-MI-correlation
: computes max-MI
and normalizes
it as a correlation value.
sum-MI
: sums up the MIs between an entity and each seed
group member.
group-MI
: computes the MI between an entity and the set of
seed group members when viewed as a single super individual (combining
all their links).
group-MI-correlation
: computes group-MI
and normalizes
it as a correlation value.
MI-or-group-MI
: computes the maximum of the max-MI
and group-MI
measures.
MI-or-group-MI-correlation
: computes MI-or-group-MI
and normalizes it as a correlation value.
max-count
: computes the maximum link count between an
entity and any of the seed group members as the connection strength to
the seed group.
sum-count
: sums up the link counts between an entity and
each seed group member.
connectivity
: computes the number of seed members an entity
is connected to.
Each of the above measures will be normalized to a value between 0 and 1
to allow proper averaging. The best performing measures in experiments
performed so far are sum-count
, group-MI
and
connectivity
which form the default set if no measures are
specified.
Boosting can be useful in cases where we have very few seed members for
a group. Instead of running the group expansion only once, we take a
small number of the top new members, add them to the seed group (i.e.,
"boost" the number of seeds) and run again. This is useful if the top
members have very high likelihood of being correct. The parameters
below control how and how often boosting is done. All of these values
can be overridden when calling expand-groups
.
BoostMaxCycles
or BoostMinSeeds
has a value.
BoostMinSeeds
):
BoostMaxCycles
are undefined boosting is disabled.
BoostFactor
tells us otherwise.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
ReportFormat
(which see) is specified, the file extension
determines which format to use. The report file can also be specified
via the -o
or --report-file
options. Note that results
can also be output to the KOJAK database by using the
db-save-groups
command.
ReportFile
s which is useful for
multiple runs via scripts.
report-groups
.
Supported formats are IET-Report
(equivalent to
IET-Y3-Report
), IET-Y2-Report
and CSV-Table
. If
not specified, the format will be guessed from the extension of
ReportFile
. A .csv
extension means use the
CSV-Table
format, everything else is interpreted as
IET-Report
.
LastRunCutoff
determines how this last run should be cut. Legal
values are Inclusive
which keeps the whole run, Exclusive
which cuts at the end of the previous run, Hard
which cuts right
at the computed threshold point or Closest
which cuts at the
closest endpoint of the run. The default is Inclusive
. If the
last run is also the first run, it will always be reported in full
unless the value is Hard
; if we fall on the last member of a run
the run will be reported in full even if the cutoff value is
Exclusive
.
GROUPS/nameString
relation and printed instead of the IDs or names
identifying them in evidence. This is useful if groups and members are
identified via unique but unreadable IDs in evidence and more readable
reports need to be generated. This will only work if appropriate
GROUPS/nameString
assertions are available in the evidence, or
if there are mapping rules that can retrieve them from an EDB.
The following parameters provide some basic control for thresholding of
extended groups. Reported groups are sorted in descending order of
membership weight and thresholding controls where this list is cut.
Automatic thresholding is still not very well supported by the Group
Finder, since there are no clear transitions indicating where true
members stop and non-members start. Besides the very basic controls
provided below, specialized functions can be written that provide
thresholding appropriate for a particular domain (see the
threshold-groups
command).
-gf
or --report-group-fraction
options.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
What data is analyzed by KOJAK and how it analyzes it is controlled by a large number of parameters given in the `configuration.dat' configuration file in the KOJAK directory. See the previous section for documentation of the various available configuration parameters. Many of these parameters can also be specified and overridden from the Unix command line. Here is the full list of available command-line options with a brief description of their functionality:
run-kojak [{-c|--configuration-file} FILE] [-i|--interactive] [{-l|--log-level"} LEVEL] [{-d|--data} DATASPEC]* [{-pd|--primary-data} DATASPEC]* [{-sd|--secondary-data} DATASPEC]* [{-k|--kojak-db} DBSPEC] [{-n|--dataset-name} NAME] [{-t|--dataset-type} TYPE] [{-r|--run-id} ID] [{-r|--run-prefix} PREFIX] [{-s|--run-script} FILE] [--depth N] [{-o|--report-file} FILE] [{-gf|--report-group-fraction} FRACTION] |
run-kojak
command will run the Group Finder
end-to-end.
LogLevel
parameter to LEVEL.
Data
parameter to DATASPEC. This option can be
supplied multiple times.
PrimaryData
parameter to DATASPEC. This option
can be supplied multiple times.
SecondaryData
parameter to DATASPEC. This option
can be supplied multiple times.
KojakDB
parameter to DBSPEC.
DatasetName
parameter to NAME.
DatasetType
parameter to TYPE.
RunID
parameter to ID.
RunPrefix
parameter to PREFIX.
RunKojakScript
parameter to FILE.
ExpansionDepth
parameter to N.
ReportFile
parameter to FILE.
ReportGroupFraction
parameter to FRACTION.
=
and +=
syntax are supported.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |