| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Developing a data analysis tool such as the Group Finder that can be easily and successfully applied to arbitrary real-world datasets is a very challenging task, given that data can come in so many different formats and with so many statistical properties. To address this challenge (at least partially), the Group Finder has a very flexible and highly configurable interface. One might almost say that the Group Finder is configurable to a fault, since there are so many different ways of mapping data, configuring it and running it. However, the generic and example configurations shipping with the distribution should cover a wide variety of situations and make easy departure points when applying the Group Finder to some new dataset.
The Group Finder uses the following configuration mechanisms:
6.1 Configuration File 6.2 Command-Line Options
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
By default, the Group Finder looks for the file
`configuration.dat' in the KOJAK software's installation
directory to load its configuration information. Alternatively, the
-c command-line option can be used to specify a different
configuration file. It is an error if no configuration file can be
found and none was specified on the command line.
There are a set of configuration files available in the `config' directory. By default the Group Finder uses the `config/example1.dat' configuration which runs the Group Finder on one of the example datasets that ship with the release. This file and its associated load and run scripts should be a good departure point for generating a configuration for some new dataset.
Configuration information is supplied in a Java-style property file with lines of the following form:
<parameterName> =<parameterValue> <parameterName> +=<parameterValue> |
Empty lines or lines starting with a # character will be
ignored. Whitespace leading up to the = character will be
ignored, whitespace following it will become part of the parameter
value. The += syntax allows the specification of list-valued
parameters, for example, to assign a set of input files to the
Data parameter. Both parameter names and values are treated
case-sensitively.
The configuration file can contain arbitrary user-defined parameters
which can be accessed in run and load scripts via the functions
get-parameter, set-parameter, and
add-parameter-value. There are a multitude of built-in
parameters that control the behavior of the Group Finder which are
explained further below.
Various built-in parameters take filenames as arguments. These
filenames can be absolute or relative physical pathnames (using syntax
appropriate for the underlying operating system), or, they can be
logical pathnames that are platform neutral and that can be
translated automatically by KOJAK into the appopriate physical form.
Logical pathname syntax is derived from Common-Lisp. Each logical
pathname starts with a logical host (terminated by a colon) followed by
a set of directory components (separated by ;) and ending in a
filename with an optional extension. KOJAK logical pathnames need to be
of the following form:
KJ:{<dirname>;}*[<basename>][.<extension>]
|
KJ:scripts;ali-baba-load-ontology.plm |
C:\Program Files\kojak-2.2.0\scripts\ali-baba-load-ontology.plm |
KojakRootDirectory (see below) is defined as
`C:\Program Files\kojak-2.2.0\'. This means the logical host
KJ is simply replaced by the value of KojakRootDirectory
and the logical pathname separator ; is substituted with the
pathname separator appropriate for the underlying OS.
6.1.1 General Options 6.1.2 Input Specifications 6.1.3 Script Specifications 6.1.4 Analysis Specifications 6.1.5 Output Specifications
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
run-kojak scripts always change to
the KOJAK directory relative pathnames are fine unless the KOJAK
executables are invoked manually somehow). Logical pathnames starting
with the KJ logical host will use this directory as their root.
Data
specifications which see). If it is a DSN all necessary connection
information must be specified in `.odbc.ini' (see section 3.4.2.1 .odbc.ini File). For example:
KojakDB =KOJAK |
KojakDB =DSN=KOJAK;DB=kojak;SERVER=blackcat;UID=scott; KojakDB =DSN=KOJAK;DB=kojak; KojakDB =DSN=KOJAK;USER=KOJAK;DB=blackcat.isi.edu/oracledb; |
KojakDB =jdbc:mysql://blackcat:3306/kojak?user=scott&password=secret KojakDB =jdbc:oracle:thin:KOJAK/secret@blackcat.isi.edu:1521:oracledb |
KojakDB can
be set or overridden the command line options -k and
--kojak-db.
none/low/medium/high. This can also be
specified via the command line options -l and --log-level.
.odbc.ini, it will not be inherited from the [Default] data
source.
.odbc.ini, it will not be inherited from the [Default] data
source.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Data can be used to specify data sources to be analyzed. Sources
can either be files in various formats or evidence databases accessible
via ODBC or JDBC. Currently supported file formats are comma-separated
value files, IET report-format files or PowerLoom files (see section 5. Input Data). Appropriate file extensions must be used to indicate the type
of file, for example, data.csv, data.iet or
data.plm. Databases can be specified as a DSN or ODBC or JDBC
connection string (see KojakDB for more details). This parameter
can also be specified by pointing to another parameter such as
KojakDB whose value will then be used. Data can have
multiple values specified via the += syntax. For example:
Data +=KJ:data;example1;ds1-group-seeds-all.csv Data +=KojakDB |
The value(s) of the Data parameter will be used by default by the
load-data command to load data (unless overridden by actual
parameters). If a database source is specified, the script specified
by LoadDataScript will be used to load data from that database.
DisabledData is used by the Kojak interface code to record data
specifications which are not currently used in the analysis, but which
are made available for selection in the interface. It uses the same
format as Data.
Data but uses the value of LoadPrimaryDataScript
to load database data sources. These data sources are not loaded by
default by load-data but can be specified as follows:
(load-data :data-source "PrimaryData") |
-pd or
--primary-data options.
Data but uses the value of LoadSecondaryDataScript
to load database data sources. These data sources are not loaded by
default by load-data but can be specified as follows:
(load-data :data-source "SecondaryData") |
-sd or
--secondary-data options.
-n or
--dataset-name command-line options. Dataset names are used as
metadata for results deposited in the hypothesis tables of the KOJAK DB.
DatasetType =IET_2004 |
-t or --dataset-type options.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The Group Finder uses a variety of scripts to load data, ontologies, schema information, execute commands, etc. The following parameters allow customization of these scripts.
initialize-kojak and
load-kojak-ontology to load ontology information into PowerLoom.
initialize-kojak and
load-edb-schema to load evidence database schema information into
PowerLoom.
load-data to load RDBMS data sources
specified in the Data configuration parameter.
The following scripts are primarily relevant to IET data sources and
somewhat obsolete. They are the default scripts used when an RDBMS data
source is loaded from a PrimaryData, SecondaryData,
PrimaryLDData or PrimaryPLData specification.
main function of the KOJAK Group Finder is run in
non-interactive mode, or when the run-kojak command is executed.
This can also be specified via the -s or --run-script options.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
DatasetName when
the configuration gets saved to the KOJAK database. If none is supplied
a new one is generated automatically. This is useful to track different
configuration settings across multiple analysis runs. It can also be
specified via the -r or --run-id options.
RunID.
It can also be specified via the -rp or --run-prefix options.
--depth option.
expand-groups
command when expanding seed groups and computing mutual information
between entities. These should correspond to the relevant link types
loaded and accessed from the evidence database (for example, telecon,
phoneCall, etc.). If no link types are specified the default is
anyLink. This is a multi-valued parameter.
DisabledExpansionLinkTypes is used by the Kojak interface code
to record link types which are not currently used in the analysis, but which
are made available for selection in the interface. It uses the same
format as ExpansionLinkTypes.
expand-groups. These are used to compute connection strength
measures between entities and seed groups. This is a multi-valued
parameter whose values are treated case-insensitively. Legal values are
the following:
MI: computes the mutual information value between an entity
and a seed group member based on the complete set of links between two
entities.
Group-MI: views the whole seed group as a single individual
and computes the MI between this "group individual" and an entity based
on all the links between the entity and the "group individual".
MI-or-Group-MI: computes both MI and Group-MI
values.
Link-Count: Computes link counts between an entity and
seed group members.
Connectivity: Computes the number of seed group members
connected to an entity.
Connectivity, Link-Count and Group-MI form
the default.
NOTE: this parameter will eventually go away, since which expansion
methods are required can be inferred from the ExpansionMeasures
described below.
expand-groups. Each strength computation method generates a
variety of results such as maximums, averages, etc. The set of
expansion measures selects a subset of those result measures and
averages them to generate a single result measure (this averaging is
called "bagging"). This is a multi-valued parameter that can take
values from the following set (values are treated case-insensitively):
max-MI: computes the maximum MI between an entity and any of
the seed group members as the connection strength to the seed group.
max-MI-correlation: computes max-MI and normalizes
it as a correlation value.
sum-MI: sums up the MIs between an entity and each seed
group member.
group-MI: computes the MI between an entity and the set of
seed group members when viewed as a single super individual (combining
all their links).
group-MI-correlation: computes group-MI and normalizes
it as a correlation value.
MI-or-group-MI: computes the maximum of the max-MI
and group-MI measures.
MI-or-group-MI-correlation: computes MI-or-group-MI
and normalizes it as a correlation value.
max-count: computes the maximum link count between an
entity and any of the seed group members as the connection strength to
the seed group.
sum-count: sums up the link counts between an entity and
each seed group member.
connectivity: computes the number of seed members an entity
is connected to.
Each of the above measures will be normalized to a value between 0 and 1
to allow proper averaging. The best performing measures in experiments
performed so far are sum-count, group-MI and
connectivity which form the default set if no measures are
specified.
Boosting can be useful in cases where we have very few seed members for
a group. Instead of running the group expansion only once, we take a
small number of the top new members, add them to the seed group (i.e.,
"boost" the number of seeds) and run again. This is useful if the top
members have very high likelihood of being correct. The parameters
below control how and how often boosting is done. All of these values
can be overridden when calling expand-groups.
BoostMaxCycles
or BoostMinSeeds has a value.
BoostMinSeeds):
BoostMaxCycles are undefined boosting is disabled.
BoostFactor tells us otherwise.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
ReportFormat (which see) is specified, the file extension
determines which format to use. The report file can also be specified
via the -o or --report-file options. Note that results
can also be output to the KOJAK database by using the
db-save-groups command.
ReportFiles which is useful for
multiple runs via scripts.
report-groups.
Supported formats are IET-Report (equivalent to
IET-Y3-Report), IET-Y2-Report and CSV-Table. If
not specified, the format will be guessed from the extension of
ReportFile. A .csv extension means use the
CSV-Table format, everything else is interpreted as
IET-Report.
LastRunCutoff determines how this last run should be cut. Legal
values are Inclusive which keeps the whole run, Exclusive
which cuts at the end of the previous run, Hard which cuts right
at the computed threshold point or Closest which cuts at the
closest endpoint of the run. The default is Inclusive. If the
last run is also the first run, it will always be reported in full
unless the value is Hard; if we fall on the last member of a run
the run will be reported in full even if the cutoff value is
Exclusive.
GROUPS/nameString relation and printed instead of the IDs or names
identifying them in evidence. This is useful if groups and members are
identified via unique but unreadable IDs in evidence and more readable
reports need to be generated. This will only work if appropriate
GROUPS/nameString assertions are available in the evidence, or
if there are mapping rules that can retrieve them from an EDB.
The following parameters provide some basic control for thresholding of
extended groups. Reported groups are sorted in descending order of
membership weight and thresholding controls where this list is cut.
Automatic thresholding is still not very well supported by the Group
Finder, since there are no clear transitions indicating where true
members stop and non-members start. Besides the very basic controls
provided below, specialized functions can be written that provide
thresholding appropriate for a particular domain (see the
threshold-groups command).
-gf or --report-group-fraction
options.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
What data is analyzed by KOJAK and how it analyzes it is controlled by a large number of parameters given in the `configuration.dat' configuration file in the KOJAK directory. See the previous section for documentation of the various available configuration parameters. Many of these parameters can also be specified and overridden from the Unix command line. Here is the full list of available command-line options with a brief description of their functionality:
run-kojak [{-c|--configuration-file} FILE] [-i|--interactive]
[{-l|--log-level"} LEVEL] [{-d|--data} DATASPEC]*
[{-pd|--primary-data} DATASPEC]*
[{-sd|--secondary-data} DATASPEC]*
[{-k|--kojak-db} DBSPEC] [{-n|--dataset-name} NAME]
[{-t|--dataset-type} TYPE] [{-r|--run-id} ID]
[{-r|--run-prefix} PREFIX]
[{-s|--run-script} FILE] [--depth N] [{-o|--report-file} FILE]
[{-gf|--report-group-fraction} FRACTION]
|
run-kojak command will run the Group Finder
end-to-end.
LogLevel parameter to LEVEL.
Data parameter to DATASPEC. This option can be
supplied multiple times.
PrimaryData parameter to DATASPEC. This option
can be supplied multiple times.
SecondaryData parameter to DATASPEC. This option
can be supplied multiple times.
KojakDB parameter to DBSPEC.
DatasetName parameter to NAME.
DatasetType parameter to TYPE.
RunID parameter to ID.
RunPrefix parameter to PREFIX.
RunKojakScript parameter to FILE.
ExpansionDepth parameter to N.
ReportFile parameter to FILE.
ReportGroupFraction parameter to FRACTION.
= and += syntax are supported.
| [ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |