Mining Sequential Database by Layered Phases

 

A recognized group of large temporal databases consist of a collection of sequential action-perception data. Some of these databases have some unique characteristics, which ask for a radically different approach for mining useful knowledge. The huge size and uncertainty embedded in these collections make it difficult and impractical to mine the whole database at once.

This research proposes a domain independent novel technique to mine temporal interactive databases through layered phases. I introduce a novel method to mine sequential data by layered phases for prediction and forecasting. The main goal is to find useful knowledge and discover hidden patterns to predict an upcoming sequence in both short term and long term.

The concept of phase, rhythm, mode, regime and hierarchy plays a fundamental role in human cognition as well as many real world application domains. A structure, model, organization or behavior could be projected as a group of smaller entities related to each other. The concept of mining by layered phase binds with two directions: granulation and organization. In a general setting, granulation involves a decomposition of whole into parts, while organization involves an integration of parts into whole. In more specific terms, granulation relates to partitioning a class of points, objects, states etc into granules and phases, with a granule or a phase being a clump of points, states, functions or entities drawn together by locality, similarity, functionality or other features.

We apply such a point of view to mine interactive sequential databases. To construct model of a pool of sequence of interactions between an agent and environment we project the model to a layer of simpler models interacting with each other. The whole model will be build gradually in a top-down fashion and in each level we may drill down and make a model with more details.

This technique consists of tow major steps: Sequence Segmentation and Model Construction. Segmentation consists of following steps: 1) perception selection, in which system selects a set of percepts for signature extraction 2) signature extraction; to extract the signature of a sub-sequence and 3) phase recognition to find major component of the model. Model Construction consist of the following modules 1) Rolling up, in which we make super class of smaller entities to represent a new level of abstraction 2) Drilling down to drill down in each component and to make a more accurate model with more details.

Mining by layered phases is investigated in the context of four different domains: Simulated data which we used for test and tuning the algorithms; Critical Care, a large databases of patients in critical care condition; RoboCup a dynamic simulation of soccer in which each agent react to a dynamic environment and World Wide Web to provide dynamic web construction and for user classification. The Initial result with simulated data shows a significant increase in terms of prediction accuracy of next observation. Meanwhile we have shown the complexity of such technique is low.

I show data mining by layered phases has several advantages in terms of accuracy in predictions, interpretability, optimization and speed. Moreover we compare the current technique to conventional techniques to show the complexity of mining process is less than conventional methods in which the problem considers as one entity.
Introduction

1.1 Problem Statement

With the current rate of widespread use of information systems, which have recently featured explosive growth in their sizes, several groups in industry and educational sector are faced with a problem of making use of the stored data. These databases have a wide variety, from a pool of genes, medical databases, biological information, stock market analysis, customer information, Internet data, Network monitoring information, financial forecast, and multi agent's behavior analysis. Currently there is a great interest in building information miner suites to find hidden patterns, recognize similarity among individuals, learn policies from interactions and detect new phenomenon in such domains. A major group of these domains consists of interactive sequences of action-perception pairs. To name a few: medical diagnosis including physician treatment and patient response to the given treatment; Robot navigation including robot sensory request and environment response and Network behavior monitoring, including packet transmission, switch behavior and network response.

These databases are rather huge collections of uncertain temporal data presenting complex features. They share some important and unique characteristics, which ask for a radically different approach for knowledge discovery and data mining. These four radically different characteristics are: 1) system gets the data gradually and the time for decision making is very crucial, 2) only a relatively low number of key observation is monitored, or a huge set of features is available 3) data sampling period might be irregular, and 4) system behaves reactively. Mining huge collection of sequential temporal database has been a recurrent issue in the past few years, but all efforts to address this issue has been facing difficulties in term of interpretation, speed, model construction and generality.

1.2 Solution approach

This proposal is a novel approach to mine temporal interactive databases by layered phases to make a multi-layer abstract model and to predict an upcoming sequence in a short term and long term run. In this approach at the beginning we try to make an abstract model of a pool of sequences including a series of perception-action pair. We extract some features from sequences to recognize and build such abstract model. In next step we drill down and make a model of internal data inside each part of the model. This process continues until users feedback or when it gets lower than a signal to noise ratio.

A data mining approach result could be more accurate, understandable and interpretable if it uses of such information as a tool in knowledge discovery process. Adding the infrastructure information of a given domain to knowledge discovery process and to use of such know how in data mining techniques helps us to provide a domain independent algorithm for similar systems which share the same features.

The notion of hierarchy and organization plays an important role in this approach. The concepts of granulation and organization play fundamental roles in human cognition as well as in the nature and in a large group of real application domains. In a general setting, granulation involves a decomposition of whole into parts and organization involves an integration of parts into whole. Even it could remind us the old familiar divide and conquer technique but we rather to look at this process as mining by layered phases. A model structure could be considered as a set of smaller model at one time and a group of such structure may make a new bigger entity for a bigger structure. In more specific terms, information granulation relates to partitioning a class of points, objects, states etc into granules, with a granule being a clump of points drawn together by locality, similarity, or functionality. An observed sequence of a system might be considered as a collection of a certain behaviors (rather than a big collection of states inside each behavior), and it might be good enough for reasoning. These systems like many other systems in the nature have the concept of organization as the main feature in their structure.

Data mining by layered phases has several advantages in terms of accuracy, interpretability, optimization and speed. The complexity of mining process is considerably less than conventional methods in which the whole system considers as one entity. However for a given database after making a rough sketch of abstract model we may drill down to lower level for further investigation.

There are several recognized systems shows same behavior with similar structure and characteristics in medicine, astronomy, Internet, marketing, and many other fields as multi agent theory and its applications. In this project we investigate on the application of mining by layered phases in three different domains in which I have spend a handful experience on each one. The first application is a fairly huge trauma care patient database hospitalized in a special Critical Care unit in Intensive Car Unit (ICU). The current database has been collected during past 15 years form 2 major hospital in LA County. The second application is simualted RoboCup domain. We would like to mine a pool of log files of agent's player in a simulated soccer game. The database is collected form RoboCup 97 and 98 from ISIS team and a few other participants in these competitions.

1.3 Brief description of approach

I now outline the primary issues to be addresses in developing a miner system uses of layered phases as the main core of this system. Figure 1 shows an overview of general approach. The main input to our model is a database of sequential perception-action data, statistical data about the domain and relevant domain knowledge. The output is the system capability to predict the out come of a given new sequence and to make a model of the whole database. The main idea is to build a high level model from a given database at first step and leave the details for the next step.

The application of mining by layered phases for short term and long term prediction is driven by the important insight that learning, model construction and mining with the presence of hidden state have shared crucial feature: they both begin learning, constructing and mining without knowing the final granularity of the model state space. The more fine prediction, suggestion and advice the more error might be observed.

The main goal is to build model observing a database of sequence . The conventional methods could not be applied as the error for huge and complex domains increase fast and the result of prediction is not reliable. In a departure of conventional techniques we would like to make a hierarchical . Rather than making from in one shot we make it gradually. In a general point of view could be any model as long as it could be represent with in a layered phase manner.

The algorithm presented here is closely tied to formal state model, the miner has a partially observable non-Markov decision process to represent its internal state distinctions, and it uses a variation on a standard technique, Baum-Welch, to learn the parameters of the model.

In a departure from standard techniques, the algorithm adds new distinctions to the model by making a top-down layer hierarchy from phases. The novel aspect of this algorithm is the technique it uses to determine how a phase should extract from a sequence and how a model will be made by rolling up and drilling down.

 

 

 

Research Subjects

  • Mining By Layered Phases
    • Signature Extraction
    • Pattern Representation
    • Phase Recognition
    • Case-based Pattern recognition

  • Self-Simialrity in Data Mining
    • Self-Similar Layered Hidden Markov Model
    • Self-Simialr Assocaition Rules
    • Exploting Slef-Simialrity in Mreditive Modeing, Study of Network Data

  • Recursive Learning

Application

  • Critical Care Database : a fairly huge trauma care patient database hospitalized in a special Critical Care unit in Intensive Car Unit (ICU). The current database has been collected during past 15 years form 2 major hospital in LA County

  • Network Data: collected during 18 weeks, from October 16 1994 to February 12 1995,on Cabletron corporate network. There are 16849 entries, representing measurements roughly every 10 minutes for 18 weeks. This network has a router with 16 ports connected to 16 links. The packet traffic of each port is investigated independently. There are 16 ports Pn on the router that connect to 16 links, which in turn connect to 16 Ethernet subnets. There are three independent variables: Bandwidth: the percentage of bandwidth utilization of a port during a 10 minute period. Packet Rate: the rate at which packets are moving through a port per minute. Collision Rate: the number of packets during a 10 minute period that have been sent through a port over the link but have collided with other packets.