A Metapattern-Based Automated Discovery Loop for Integrated Data Mining

Information Sciences Institute and Computer Science Department at

University of Southern California

  • Project Summary

This research is developing a metapattern-based discovery loop for integrated data mining. Metapatterns (also known as metaqueries) are second-order, declarative expressions that  specify the types of patterns to be discovered and assist humans in focusing on more fruitful search directions. The discovery loop is a search engine that integrates deduction, induction, and external guidance from humans, as well as internal guidance of inter-component dependencies. Given a completely new database, the system first generates an initial set of the most general metapatterns based on the meta-information of the database, and then executes these metapatterns against the database to discover actual patterns. Based on the results, new metapatterns are dynamically generated by adding more constraints to the more plausible metapatterns. In this iterative process, human discovers can analyze, create, select, and execute metapatterns, or instruct the system to pursue metapatterns on its own.  This ability not only makes the process of data mining more efficient and productive (the more expert users can use the system for inspiration of better metapatterns, and the less expert users can learn how to perform data mining in a particular domain by observation), but also provides a new method for unsupervised learning of probabilistic, relation-based patterns.

The major effort in the last year of this project has been focused on the second research thrust on mining sequential data based on "Layered Phases". This research proposes a domain independent novel technique to mine temporal databases through hierarchical modeling. The main goal is to find useful knowledge and discover hidden patterns to predict an upcoming sequence in both short term and long term. This work is categorized by four major activities: 1) Mining Through layered Phases 2) Mining through Self-Similar Layered Phases which is a special form of layered Structure and 3) Recursive learning which is the idea of self-similarity in machine learning, 4) case-based pattern recognition. Also as part of this project has been investigated in different domain , the following research and activates has been initiated and developed during past years: 1. Application of Data Mining and Knowledge Discovery in Health Care 2. Mining Network databases 3. Mining Agents behavior

  • Keywords

Data Mining, Knowledge discovery from databases,  Machine Learning, Automatic Modeling of Legacy Databases, Meta-Pattern Guided Discovery,  Integrated Data Mining Systems.

 

  • Goals, Objectives, and Targeted Activities

From June 1, 1997 through May 31, 1998, we have been focusing on applying the Metapattern technology to automatically analyzing legacy relational databases and discovering useful metadata -- characterizations of the database's intended semantics.  The main problem faced when trying to incorporate legacy databases into modern knowledge-based systems is the difficulty of obtaining such metadata. The problem is particularly acute when the data is noisy and access to human domain experts is limited -- both frequent occurrences.  The AutoModel tool developed under the methodology of DataCrystal uses advanced data mining techniques to analyze database contents.  By comparing data and attribute names in different tables, in a multi-step process, it identifies potential key and foreign key attributes and ultimately proposes an entity-relationship (ER) model for the data.  Expert knowledge about the database and the domain can be used to control the process.

Our goals for the next year is to complete the design and implementation of generating other types of metapatterns in addition to the type of transitivity, and incorporate all the results we have obtained so far in the control penal to allow users to examine, select, and execute metapatterns interactively. We will also evaluate our system in several large legacy databases that we have access to, these are mainly logistic databases for military applications.

  • Indication of Success

The specific component of DataCrystal constructed this year has been tested on several logistics databases of sizes ranging up to 100 tables with up to 60 columns each, and up to 1.5 million rows. The results generated are comparable to those obtained by purely human analysis.  However, AutoModel speeds up this previously labor intensive process by an order of magnitude -- reducing it from one taking weeks, to one taking hours.

This result has been well received by a research community for military applications, and we were requested to submit a short summary for the Significant Event Report for the Secretary of Defense.

So far, we have accomplished most of the items in our statement of work for the second year of this award. Two exception of this are: instead of using chemistry databases, we are now focus on logistic database, and due to the application's needs, we have not addressed the issue of generating non transitivity metapatterns.

  • Project Impact

The DataCrystal project has supported two different PhD graduate students in the Data Mining area, and also enabled a four-day short course on Data Mining at the UCLA Extension Program. The course is a big success, and there are 25 people in the class and they are from everywhere: 4 from Brazil, 1 from Sweden, others are in US but from places such as AllState, Fair Issc, Los Amos Nat. Lab, IBM, Digital, JPL, OOCL, etc.. Two of them are CEOs of some consulting companies. The three instructors are high quality people in Data Mining. Rekash Agrawal and Jaiwei Han are chairmen of past KDD conferences and excellent researchers in databases. People in the class stayed until the last minute, and they are all very excited and happy!

In addition, as part of this project , we have been organized several tutorials in different aspect of data mining and knowledge discovery during past years. A group of 40-150 people participated in each workshop/tutorial.

The DataCrystal technology is also licensed by GKIS, http://www.gkis.com, a Houston based company for enterprise modeling, information integration, and technology education. They plan to extend DataCrystal and use it in applications related to United Nations information technology group. (See news report at DataCrystal's home page: http://www.isi.edu/dcrystal)

  • Area Background

Data Mining is a process of discovering valuable knowledge from very large data sets, its research spans from database and statistics to machine learning and adaptive control and its application includes  scientific research, system design, business management, or many other related applications.

  • Contact Information

    Wei-Min Shen
    USC Information Sciences Institute
    4676 Admiralty Way
    Marina del Rey, CA 90292
    Phone: (310) 448-8710
    Fax : (310) 822-0751
    Email: shen@isi.edu