5. Input Data

The Group Finder needs two separate types of input data, group seeds to identify the starting points for group expansion and link data which is analyzed to expand the groups.

The data can also be subdivided into different access classes, which is based on ease of availability or cost of access. These classes are called primary data for data that can be easily and cheaply accessed or secondary data for data that is harder to access or more expensive.

The data can also presented in one of several formats.

These options are described in the following sections.

5.1 Types of Data

5.2 Data Access Classes

5.3 Data Formats

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.1 Types of Data

Note that there is a fundamental assumption underlying all of the data, namely that different entities have unique names. One consquence of this is that if data is being transferred from a system that uses catch-all categories such as unknown or other that refer in general to several unnamed entities, the data must be prepared by generating unique ids for all such unamed entities. This can be accomplished by adding a numeric suffix to any such categories so as to create unique individual names.

5.1.1 Group Seeds

5.1.2 Link Data

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.1.1 Group Seeds

Group seed data describes one or more known or unknown groups with one or more known or hypothesized members for each seed. As mentioned previously, the Group Finder performs seeded group extension, therefore, it must have at least one seed group to start with. The rationale behind this seeded approach is (1) that it allows for a focused search through a potentially huge data space, and (2) that the seeds - if chosen properly - will automatically lead to the detection of groups of interest (e.g., threat groups). This is very important, since in general, every large dataset will contain large numbers of groups that are benign and uninteresting. Since KOJAK's extended groups are formed around seeds, they have a much higher likelihood of being of the same kind as the seed group. With respect to scale, this approach has also already paid off and allowed us to successfully analyze datasets with close to 10,000,000 links. Having said that, there are situations where seed information is not available and we are currently developing methods to generate such seeds automatically if necessary.

Group seed data is generally fairly small. In the simplest case it involves the specification of one or more groups with one or more known members for each of them. For example, the following specifies a seed group in comma-separated-value (CSV) syntax (which is described in more detail below):

KnownGroup, SeedGroup1 groupMember, SeedGroup1, MemberA groupMember, SeedGroup1, MemberB groupMember, SeedGroup1, MemberC groupMember, SeedGroup1, MemberD

The first line specifies SeedGroup1 as a group of type KnownGroup, and the following lines specify 4 known (or suspected) members of the group. The order of these lines is insignificant. The terms KnownGroup and groupMember are reserved words interpreted by the Group Finder. It is possible to use different terms (i.e., a different "ontology"), but then appropriate mapping specifications are needed.

The Group Finder distinguishes between KnownGroup's and UnknownGroup's. The difference is that known groups are assumed to be groups with known identity (e.g., the RAF), and therefore such known groups will never be merged (even though their members might overlap). Unknown groups are groups whose identity we do not know. Unknown groups might turn out to be (part of) a known group or another unknown group. For unknown groups, the Group Finder will attempt to merge them with each other or some known group in case they are similar enough. Note that knowing a group's identity does not mean that we know its members. So, extension of known groups is still a very useful thing to do.

Depending on the domain, it is possible to model some aspects of it and infer additional seed groups or members from other data or link information. This is one of the functions of the logic module which uses a domain ontology and rules to do that. These rules, however, are highly dependent on a particular application domain (such as IET's simulated data generated within the EAGLE program), and will need to be handcrafted specially for each such domain. For example, the file `kbs/iet-y3-seed-constraints.plm' encodes such a set of rules and constraints for IET's simulated data. The released version of the Group Finder is generic and domain independent, however, and there is no set of generic, domain-independent seed generation rules. For a typical generic application, it is assumed that seed information will be given explicitly as described above. If a domain is more complex and has additional data that can be exploited, this domain information could be modeled similar to `kbs/iet-y3-seed-constraints.plm' to allow the automatic derivation of additional seed information.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.1.2 Link Data

Link data describes known connections of different types between the entities (individuals) in a dataset. These entities should overlap with the ones provided in the seed groups, but, in general, there will be lots and lots of other irrelevant entities in this data. The Group Finder performs its main work on this link data to find the few (hopefully relevant) individuals that are strongly connected to the seeds. It does so by statistically analyzing and comparing the links from seed individuals to others in the data. In order for this analysis to be significant, there needs to be a sufficient amount of link data - the more, the better. A good example is communication data such as emails or phone calls, where each email or phone call can be viewed as a separate link between two entities. Therefore, there might be hundreds or thousands of links between a pair of entities for each link type. Financial transactions or other transactional data are good candidates also. Here is an example fragment of a link database in CSV syntax:

... phoneCall,In9999,In15132 phoneCall,In9999,In24521 phoneCall,In9999,In27306 telecon,In10005,In11120,In15968,In19067,In2284,In32055,In6166 telecon,In10009,In18419,In28735,In33962,In35311,In36690,In4777,In5866 telecon,In10013,In1325,In24781,In33373,In38164,In38830,In4194,In8312 telecon,In10013,In1325,In24781,In33373,In38164,In38830,In4194,In8312 ...

The first element in a row specifies the type of link such as phoneCall or telecon. These link types are not predefined and can be chosen arbitrarily. However, they should represent distinctions that are meaningful. Depending on how the Group Finder was configured, they might have to correspond to relations defined in the underlying ontology.

The remaining arguments are the set of individuals that participated in the particular link, transaction or event. For example, all participants of a telecon can be listed and are then assumed to be pairwise connected by a link of this type. Rows might be duplicated (such as the last one) indicating multiple events or links with the same participants. Again, the order of these rows is insignificant.

The Group Finder can function with data that contains just a single link type, but the mutual information model takes different link types into account, and the more such link-type-differentiated information is available, the better.

Example datasets derived from one of IET's simulated datasets are shipped in the `data/example*' subdirectories. They are provided in a comma-separated value (CSV) format that might be the easiest to imitate when applying the Group Finder to new domains. The data files have some more comments describing their structure. They contain about 10,000 entities and 250,000 links connecting them.

The Group Finder can read input data in a variety of formats described in the following sections.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.2 Data Access Classes

Data can be differentiated based on a general notion of cost of access. If this is not a concern for a particular application environment, undifferentiated data may be used.

5.2.1 Primary Data

5.2.2 Secondary Data

5.2.3 Undifferentiated Data

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.2.1 Primary Data

Primary Data are easy and cheap to access. They can be used freely for analysis and permit extremely general queries. Link data can be queried without a need to specify any of the participants.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.2.2 Secondary Data

Seoncdary Data are more difficult or expensive to access. This is taken into consideration by the Group Finder code, which will try to minimize the amount of secondary data used in the analysis. It is also assumed that querying of such data is more limited, in that fully open-ended queries are not permitted. For example, rather than retrieving all link information, the queries might require that at least one of the participants in the link be specified in the query.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.2.3 Undifferentiated Data

Undifferentiated data describes data which do not have differential access costs. The entire dataset is available for all analysis and no special processing is required.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.3 Data Formats

The Group Finder supports several data formats, including comma-separated value files, a special IET data format, PowerLoom format and MySQL databases. Of these, the most easily used and adapted is the Comman-separated value format.

5.3.1 Comma-Separated Value Format

5.3.2 IET Format

5.3.3 PowerLoom Format

5.3.4 Relational Databases

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.3.1 Comma-Separated Value Format

The comma-separated value (or CSV) format is probably the easiest to use for applying the Group Finder to a new domain. Each CSV file should have a `.csv' extension (or file type) so the Group Finder will use the appropriate data input method. CSV files have the following syntax:

Empty lines and lines starting with a # character will be ignored
All other content lines should have the following syntax:
<relation>, <arg1> {, <arg>}*
Whitespace following a comma up to the next non-whitespace character will be ignored. Relations and arguments can contain any arbitrary charactes. ,'s need to be escaped with a \ character.

A content line with one argument will be interpreted as a type specification, for example:

KnownGroup, Group1

The relations groupMember and memberAgents will be interpreted as a binary membership assertion between a group and a group member. For example:

groupMember, Group1, Member1

All other relations will be interpreted as event-style links. For example,

telecon, pA, pB, pC
will create a new link (or event) object of type telecon and attach pA, pB and pC as the link arguments (or event participants). The exact internal representation will differ dependent on whether this link is loaded into PowerLoom or into the KOJAK EDB. However, the treatment and interpretation will be the same.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.3.2 IET Format

The IET format was developed by Information Extraction and Transport, Inc. to represent synthetic data. It is a Lisp-style format that uses CycL as the representation language. For example:

(isa Gr-16511 ThreatGroup) (exploitsVulnerabilities Gr-16511 Mo-18843) (memberAgents Gr-16511 In-17687) (memberAgents Gr-16511 In-11314)

This format is not intended to be a primary input format and only supported to allow the Group Finder to be applied to synthetic data generated by the IET simulators (there are a two different variants for Y2 and Y3 datasets, both are supported). IET format files need to use the file extension .iet in order to be properly recognized and handled by the Group Finder.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.3.3 PowerLoom Format

PowerLoom is the representation and reasoning engine underlying the KOJAK Group Finder. Consequently, all legal PowerLoom input can appear in PowerLoom data files. PowerLoom uses KIF as its input language which is a Lisp-style syntax for predicate logic. For example:

(ASSERT (KnownGroup UID-Group-34988)) (ASSERT (groupMember UID-Group-34988 UID-Indvdl-10641)) (ASSERT (groupMember UID-Group-34988 UID-Indvdl-14179)) (ASSERT (groupMember UID-Group-34988 UID-Indvdl-20265)) (ASSERT (groupMember UID-Group-34988 UID-Indvdl-2206))

PowerLoom files need to use the file extension .plm in order to be properly recognized and handled by the Group Finder. More information about the PowerLoom system and its input syntax can be found at http://www.isi.edu/isd/LOOM/PowerLoom/.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.3.4 Relational Databases

Finally, data can come directly from a relational database such as MySQL or Oracle. No specific schema is assumed, however, for each schema an appropriate mapping specification/schema/ontology needs to exist that maps the database schema onto the Group Finder ontology. The Group Finder ships with various predefined mappings that support IET Y2, IET Y3 as well as KOJAK evidence databases (EDBs). CSV files can be imported into the KOJAK EDB which is useful for very large link databases or to support multiple analysis runs. Example mapping specifications can be found in the `kbs' directory, for example, `kbs/ali-baba-edb-schema.plm', `kbs/iet-y2-edb-schema.plm', `kbs/iet-y3-edb-schema.plm' and kbs/kojak-edb-schema.plm. Writing such mapping specifications is somewhat complex and requires some knowledge of the underlying PowerLoom system. It is described in more detail in 8. Advanced Configuration.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Hans Chalupsky on October, 30 2007 using texi2html