6. Configuration

Developing a data analysis tool such as the Group Finder that can be easily and successfully applied to arbitrary real-world datasets is a very challenging task, given that data can come in so many different formats and with so many statistical properties. To address this challenge (at least partially), the Group Finder has a very flexible and highly configurable interface. One might almost say that the Group Finder is configurable to a fault, since there are so many different ways of mapping data, configuring it and running it. However, the generic and example configurations shipping with the distribution should cover a wide variety of situations and make easy departure points when applying the Group Finder to some new dataset.

The Group Finder uses the following configuration mechanisms:

A configuration file allows the specification of a large number of parameter such as datasets, connection information, analysis parameters, etc. Many of the available KOJAK commands take these parameters as inputs or as defaults.
A set of command line flags that can be used to specify and override some of the configuration parameters specified in the configuration file. This is useful for multiple runs over the same data with slightly different parameter settings, etc.
A run script that can execute an arbitrary sequence of KOJAK commands for loading/importing data, ontology and schema information, generating group seeds, extending them, thresholding, reporting, etc.
A variety of load scripts that control how and what data is loaded, which ontology and schema information is loaded, etc.
Ontology and knowledge base files that can leverage the full power of the PowerLoom knowledge representation and reasoning system to represent specifics of a particular domain, map to a particular external database schema, etc.

6.1 Configuration File

6.2 Command-Line Options

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

6.1 Configuration File

By default, the Group Finder looks for the file `configuration.dat' in the KOJAK software's installation directory to load its configuration information. Alternatively, the -c command-line option can be used to specify a different configuration file. It is an error if no configuration file can be found and none was specified on the command line.

There are a set of configuration files available in the `config' directory. By default the Group Finder uses the `config/example1.dat' configuration which runs the Group Finder on one of the example datasets that ship with the release. This file and its associated load and run scripts should be a good departure point for generating a configuration for some new dataset.

Configuration information is supplied in a Java-style property file with lines of the following form:

Empty lines or lines starting with a # character will be ignored. Whitespace leading up to the = character will be ignored, whitespace following it will become part of the parameter value. The += syntax allows the specification of list-valued parameters, for example, to assign a set of input files to the Data parameter. Both parameter names and values are treated case-sensitively.

The configuration file can contain arbitrary user-defined parameters which can be accessed in run and load scripts via the functions get-parameter, set-parameter, and add-parameter-value. There are a multitude of built-in parameters that control the behavior of the Group Finder which are explained further below.

Various built-in parameters take filenames as arguments. These filenames can be absolute or relative physical pathnames (using syntax appropriate for the underlying operating system), or, they can be logical pathnames that are platform neutral and that can be translated automatically by KOJAK into the appopriate physical form. Logical pathname syntax is derived from Common-Lisp. Each logical pathname starts with a logical host (terminated by a colon) followed by a set of directory components (separated by ;) and ending in a filename with an optional extension. KOJAK logical pathnames need to be of the following form:

KJ:{<dirname>;}*[<basename>][.<extension>]
For example,

KJ:scripts;ali-baba-load-ontology.plm
which would translate into

C:\Program Files\kojak-2.2.0\scripts\ali-baba-load-ontology.plm
if the value of KojakRootDirectory (see below) is defined as `C:\Program Files\kojak-2.2.0\'. This means the logical host KJ is simply replaced by the value of KojakRootDirectory and the logical pathname separator ; is substituted with the pathname separator appropriate for the underlying OS.

6.1.1 General Options

6.1.2 Input Specifications

6.1.3 Script Specifications

6.1.4 Analysis Specifications

6.1.5 Output Specifications

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

6.1.1 General Options

Parameter: KojakRootDirectory: This should be an absolute physical path pointing to the KOJAK directory (including a final directory delimiter). A relative path such as `./' will work also but requires that the Group Finder is run from the KOJAK directory (since the run-kojak scripts always change to the KOJAK directory relative pathnames are fine unless the KOJAK executables are invoked manually somehow). Logical pathnames starting with the KJ logical host will use this directory as their root.

Parameter: KojakDB

The internal database used by KOJAK to store hypotheses, intermediate results, data and configuration information. Must be either a simple ODBC DSN or an ODBC or JDBC connection string (similar to Data specifications which see). If it is a DSN all necessary connection information must be specified in `.odbc.ini' (see section 3.4.2.1 .odbc.ini File). For example:

KojakDB =KOJAK

If it is an ODBC connection string, everything necessary can be specified, but partial connection strings are also possible in which case missing information is filled in from `.odbc.ini'. For example:

KojakDB =DSN=KOJAK;DB=kojak;SERVER=blackcat;UID=scott;
KojakDB =DSN=KOJAK;DB=kojak;
KojakDB =DSN=KOJAK;USER=KOJAK;DB=blackcat.isi.edu/oracledb;

If it is a JDBC connection string, all connection information must be specified. JDBC connection string also varies for different database systems and JDBC drivers. Here are two examples for the Oracle and MySQL drivers currently supported by KOJAK:

KojakDB =jdbc:mysql://blackcat:3306/kojak?user=scott&password=secret
KojakDB =jdbc:oracle:thin:KOJAK/[email protected]:1521:oracledb

Note that for Oracle databases the KOJAK schema name is the USER name while for MySQL it is the name of the database (DB). KojakDB can be set or overridden the command line options -k and --kojak-db.

Parameter: LogLevel: Controls the amount of log output generated during an analysis run. Needs to be one of none/low/medium/high. This can also be specified via the command line options -l and --log-level.

Parameter: DBUser: Default user name to use for connection information if a database is simply supplied as a DSN but not a connection string. If this is not provided, user information will need to be specified for the particular DSN in .odbc.ini, it will not be inherited from the [Default] data source.

Parameter: DBPassword: Default password to use for connection information if a database is simply supplied as a DSN but not a connection string. If this is not provided, password information will need to be specified for the particular DSN in .odbc.ini, it will not be inherited from the [Default] data source.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

6.1.2 Input Specifications

Parameter: Data

Data can be used to specify data sources to be analyzed. Sources can either be files in various formats or evidence databases accessible via ODBC or JDBC. Currently supported file formats are comma-separated value files, IET report-format files or PowerLoom files (see section 5. Input Data). Appropriate file extensions must be used to indicate the type of file, for example, data.csv, data.iet or data.plm. Databases can be specified as a DSN or ODBC or JDBC connection string (see KojakDB for more details). This parameter can also be specified by pointing to another parameter such as KojakDB whose value will then be used. Data can have multiple values specified via the += syntax. For example:

Data +=KJ:data;example1;ds1-group-seeds-all.csv
Data +=KojakDB

Data sources can also be specified via the -d or --data command line options which can be given multiple times.

The value(s) of the Data parameter will be used by default by the load-data command to load data (unless overridden by actual parameters). If a database source is specified, the script specified by LoadDataScript will be used to load data from that database.

Parameter: DisabledData: DisabledData is used by the Kojak interface code to record data specifications which are not currently used in the analysis, but which are made available for selection in the interface. It uses the same format as Data.

Parameter: PrimaryData

Similar to Data but uses the value of LoadPrimaryDataScript to load database data sources. These data sources are not loaded by default by load-data but can be specified as follows:

(load-data :data-source "PrimaryData")

Primary data can also be specified via the -pd or --primary-data options.

Parameter: SecondaryData

Similar to Data but uses the value of LoadSecondaryDataScript to load database data sources. These data sources are not loaded by default by load-data but can be specified as follows:

(load-data :data-source "SecondaryData")

Secondary data can also be specified via the -sd or --secondary-data options.

Parameter: DatasetName: Serves as a single logical name for the supplied input data and all results generated from it. It can also be specified via the -n or --dataset-name command-line options. Dataset names are used as metadata for results deposited in the hypothesis tables of the KOJAK DB.

Parameter: DatasetType

Indicates what kind of data we are looking at; mainly used to flag simulated data generated by IET, since some special assumptions apply there; if this is not given, it is guessed from some of the Data specifications.

DatasetType =IET_2004

It can also be specified via the -t or --dataset-type options.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

6.1.3 Script Specifications

The Group Finder uses a variety of scripts to load data, ontologies, schema information, execute commands, etc. The following parameters allow customization of these scripts.

Parameter: LoadOntologyScript: The default script used by initialize-kojak and load-kojak-ontology to load ontology information into PowerLoom.

Parameter: LoadEDBSchemaScript: The default script used by initialize-kojak and load-edb-schema to load evidence database schema information into PowerLoom.

Parameter: LoadDataScript: The default script used by load-data to load RDBMS data sources specified in the Data configuration parameter.

The following scripts are primarily relevant to IET data sources and somewhat obsolete. They are the default scripts used when an RDBMS data source is loaded from a PrimaryData, SecondaryData, PrimaryLDData or PrimaryPLData specification.

Parameter: LoadPrimaryDataScript

Parameter: LoadSecondaryDataScript

Parameter: LoadPrimaryLDDataScript

Parameter: LoadPrimaryPLDataScript

Parameter: RunKojakScript: The top-level KOJAK run script. Commands in this script are executed when the main function of the KOJAK Group Finder is run in non-interactive mode, or when the run-kojak command is executed. This can also be specified via the -s or --run-script options.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

6.1.4 Analysis Specifications

Parameter: RunID: Marks a particular configuration used to analyze DatasetName when the configuration gets saved to the KOJAK database. If none is supplied a new one is generated automatically. This is useful to track different configuration settings across multiple analysis runs. It can also be specified via the -r or --run-id options.

Parameter: RunPrefix: Used as a prefix when Kojak generates an automatic RunID. It can also be specified via the -rp or --run-prefix options.

Parameter: ClearOldHypotheses: If true, hypotheses generated in previous runs will be deleted from the KOJAK database before any new ones are generated.

Parameter: ExpansionDepth: Depth of graph expansion from seed groups to get extended groups. IMPORTANT: due to graph fanout this value has a big impact on run time. Use values larger than 1 only for cases with very small seed groups (e.g., 1-5 members) and increase it only one step at a time if needed. It can also be specified via the --depth option.

Parameter: ExpansionLinkTypes: The default link types that should be considered by the expand-groups command when expanding seed groups and computing mutual information between entities. These should correspond to the relevant link types loaded and accessed from the evidence database (for example, telecon, phoneCall, etc.). If no link types are specified the default is anyLink. This is a multi-valued parameter.

Parameter: DisabledExpansionLinkTypes: DisabledExpansionLinkTypes is used by the Kojak interface code to record link types which are not currently used in the analysis, but which are made available for selection in the interface. It uses the same format as ExpansionLinkTypes.

Parameter: ExpansionMethods

The default strength computation methods considered by expand-groups. These are used to compute connection strength measures between entities and seed groups. This is a multi-valued parameter whose values are treated case-insensitively. Legal values are the following:

MI: computes the mutual information value between an entity and a seed group member based on the complete set of links between two entities.
Group-MI: views the whole seed group as a single individual and computes the MI between this "group individual" and an entity based on all the links between the entity and the "group individual".
MI-or-Group-MI: computes both MI and Group-MI values.
Link-Count: Computes link counts between an entity and seed group members.
Connectivity: Computes the number of seed group members connected to an entity.

The set Connectivity, Link-Count and Group-MI form the default.

NOTE: this parameter will eventually go away, since which expansion methods are required can be inferred from the ExpansionMeasures described below.

Parameter: ExpansionMeasures

The default strength computation measures that should be used by expand-groups. Each strength computation method generates a variety of results such as maximums, averages, etc. The set of expansion measures selects a subset of those result measures and averages them to generate a single result measure (this averaging is called "bagging"). This is a multi-valued parameter that can take values from the following set (values are treated case-insensitively):

max-MI: computes the maximum MI between an entity and any of the seed group members as the connection strength to the seed group.
max-MI-correlation: computes max-MI and normalizes it as a correlation value.
sum-MI: sums up the MIs between an entity and each seed group member.
group-MI: computes the MI between an entity and the set of seed group members when viewed as a single super individual (combining all their links).
group-MI-correlation: computes group-MI and normalizes it as a correlation value.
MI-or-group-MI: computes the maximum of the max-MI and group-MI measures.
MI-or-group-MI-correlation: computes MI-or-group-MI and normalizes it as a correlation value.
max-count: computes the maximum link count between an entity and any of the seed group members as the connection strength to the seed group.
sum-count: sums up the link counts between an entity and each seed group member.
connectivity: computes the number of seed members an entity is connected to.

Each of the above measures will be normalized to a value between 0 and 1 to allow proper averaging. The best performing measures in experiments performed so far are sum-count, group-MI and connectivity which form the default set if no measures are specified.

Boosting can be useful in cases where we have very few seed members for a group. Instead of running the group expansion only once, we take a small number of the top new members, add them to the seed group (i.e., "boost" the number of seeds) and run again. This is useful if the top members have very high likelihood of being correct. The parameters below control how and how often boosting is done. All of these values can be overridden when calling expand-groups.

Parameter: BoostEnabled: A boolean parameter that determines whether boosting is performed or not. If this is not explicitly specified, then (for backward compatibility) boosting will be done if either BoostMaxCycles or BoostMinSeeds has a value.

Parameter: BoostMaxCycles: Perform at most this many boosting cycles even if the targets have not been reached (leave this undefined if boosting is controlled by BoostMinSeeds):

Parameter: BoostMinSeeds: Boost until seed groups have at least that many members; if both this and BoostMaxCycles are undefined boosting is disabled.

Parameter: BoostFactor: In each boost iteration, increase the number of seed members by this factor.

Parameter: BoostMinStep: In each boost iteration, increase the number of seed members at least by that many members, even if BoostFactor tells us otherwise.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

6.1.5 Output Specifications

Parameter: ReportFile: File to which Group Finder results should be reported. If no ReportFormat (which see) is specified, the file extension determines which format to use. The report file can also be specified via the -o or --report-file options. Note that results can also be output to the KOJAK database by using the db-save-groups command.

Parameter: ReportDirectory: Directory path to prepend to ReportFiles which is useful for multiple runs via scripts.

Parameter: ReportFormat: Format to use for report files generated by report-groups. Supported formats are IET-Report (equivalent to IET-Y3-Report), IET-Y2-Report and CSV-Table. If not specified, the format will be guessed from the extension of ReportFile. A .csv extension means use the CSV-Table format, everything else is interpreted as IET-Report.

Parameter: ReportMemberWeights: If true, membership weights will be reported for each group member.

Parameter: WeightDecimals: The number of decimals behind the decimal point that should be used to print a weight in fixed-point format to the output file (defaults to 6).

Parameter: LastRunCutoff: The computed number of to-be-reported members will usually wind up somewhere within a run of members with equal membership weights. Cutting right there would arbitrarily include and exclude members that are really indistinguishable by their membership weight. LastRunCutoff determines how this last run should be cut. Legal values are Inclusive which keeps the whole run, Exclusive which cuts at the end of the previous run, Hard which cuts right at the computed threshold point or Closest which cuts at the closest endpoint of the run. The default is Inclusive. If the last run is also the first run, it will always be reported in full unless the value is Hard; if we fall on the last member of a run the run will be reported in full even if the cutoff value is Exclusive.

Parameter: ReportNames: If true, group and member print names will be looked up via the GROUPS/nameString relation and printed instead of the IDs or names identifying them in evidence. This is useful if groups and members are identified via unique but unreadable IDs in evidence and more readable reports need to be generated. This will only work if appropriate GROUPS/nameString assertions are available in the evidence, or if there are mapping rules that can retrieve them from an EDB.

The following parameters provide some basic control for thresholding of extended groups. Reported groups are sorted in descending order of membership weight and thresholding controls where this list is cut. Automatic thresholding is still not very well supported by the Group Finder, since there are no clear transitions indicating where true members stop and non-members start. Besides the very basic controls provided below, specialized functions can be written that provide thresholding appropriate for a particular domain (see the threshold-groups command).

Parameter: ReportGroupFraction: Fraction of individuals in the extended graph that should be reported in the result. Includes seed members; 1.0 reports the whole group. This can also be specified via the -gf or --report-group-fraction options.

Parameter: ReportMinMembers: Minimum number of group members to report (including seeds).

Parameter: ReportMaxMembers: Maximum number of group members to report (including seeds).

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

6.2 Command-Line Options

What data is analyzed by KOJAK and how it analyzes it is controlled by a large number of parameters given in the `configuration.dat' configuration file in the KOJAK directory. See the previous section for documentation of the various available configuration parameters. Many of these parameters can also be specified and overridden from the Unix command line. Here is the full list of available command-line options with a brief description of their functionality:

run-kojak [{-c|--configuration-file} FILE] [-i|--interactive] [{-l|--log-level"} LEVEL] [{-d|--data} DATASPEC]* [{-pd|--primary-data} DATASPEC]* [{-sd|--secondary-data} DATASPEC]* [{-k|--kojak-db} DBSPEC] [{-n|--dataset-name} NAME] [{-t|--dataset-type} TYPE] [{-r|--run-id} ID] [{-r|--run-prefix} PREFIX] [{-s|--run-script} FILE] [--depth N] [{-o|--report-file} FILE] [{-gf|--report-group-fraction} FRACTION]

`--configuration-file FILE'
`-c FILE': Specifies FILE to be used as the configuration file. By default, `configuration.dat' in the KOJAK directory is used. If there is no valid configuration file specified or available, the Group Finder terminates immediately.
`--interactive'
`-i': Run the Group Finder interactively by bringing up the PowerLoom command loop. At that point any PowerLoom or KOJAK command can be executed. Calling the run-kojak command will run the Group Finder end-to-end.
`--log-level LEVEL'
`-l LEVEL': Sets the LogLevel parameter to LEVEL.
`--data DATASPEC'
`-d DATASPEC': Sets the Data parameter to DATASPEC. This option can be supplied multiple times.
`--primary-data DATASPEC'
`-pd DATASPEC': Sets the PrimaryData parameter to DATASPEC. This option can be supplied multiple times.
`--secondary-data DATASPEC'
`-sd DATASPEC': Sets the SecondaryData parameter to DATASPEC. This option can be supplied multiple times.
`--kojak-db DBSPEC'
`-k DBSPEC': Sets the KojakDB parameter to DBSPEC.
`--dataset-name NAME'
`-n NAME': Sets the DatasetName parameter to NAME.
`--dataset-type TYPE'
`-t TYPE': Sets the DatasetType parameter to TYPE.
`--run-id ID'
`-r ID': Sets the RunID parameter to ID.
`--run-prefix PREFIX'
`-rp PREFIX': Sets the RunPrefix parameter to PREFIX.
`--run-script FILE'
`-s FILE': Sets the RunKojakScript parameter to FILE.
`--depth N': Sets the ExpansionDepth parameter to N.
`--report-file FILE'
`-o FILE': Sets the ReportFile parameter to FILE.
`--report-group-fraction FRACTION'
`-gf FRACTION': Sets the ReportGroupFraction parameter to FRACTION.
`--define PAR=VAL, --define PAR+=VAL'
`-D PAR=VAL, -D PAR+=VAL': Can be used to set an arbitrary configuration parameter PAR to value VAL. Both = and += syntax are supported.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Hans Chalupsky on October, 30 2007 using texi2html