|

|
|
|
A Metapattern-Based
Automated Discovery Loop for Integrated Data Mining
Information
Sciences Institute and Computer
Science Department at
University
of Southern California
This research is developing a metapattern-based discovery
loop for integrated data mining. Metapatterns (also known as metaqueries)
are second-order, declarative expressions that specify the types
of patterns to be discovered and assist humans in focusing on more fruitful
search directions. The discovery loop is a search engine that integrates
deduction, induction, and external guidance from humans, as well as
internal guidance of inter-component dependencies. Given a completely
new database, the system first generates an initial set of the most
general metapatterns based on the meta-information of the database,
and then executes these metapatterns against the database to discover
actual patterns. Based on the results, new metapatterns are dynamically
generated by adding more constraints to the more plausible metapatterns.
In this iterative process, human discovers can analyze, create, select,
and execute metapatterns, or instruct the system to pursue metapatterns
on its own. This ability not only makes the process of data mining
more efficient and productive (the more expert users can use the system
for inspiration of better metapatterns, and the less expert users can
learn how to perform data mining in a particular domain by observation),
but also provides a new method for unsupervised learning of probabilistic,
relation-based patterns.
The major effort in the last year of this project has
been focused on the second research thrust on mining sequential data
based on "Layered Phases". This research proposes a domain independent
novel technique to mine temporal databases through hierarchical modeling.
The main goal is to find useful knowledge and discover hidden patterns
to predict an upcoming sequence in both short term and long term. This
work is categorized by four major activities: 1) Mining Through layered
Phases 2) Mining through Self-Similar Layered Phases which is a special
form of layered Structure and 3) Recursive learning which is the idea
of self-similarity in machine learning, 4) case-based pattern recognition.
Also as part of this project has been investigated in different domain
, the following research and activates has been initiated and developed
during past years: 1. Application of Data Mining and Knowledge Discovery
in Health Care 2. Mining Network databases 3. Mining Agents behavior
|
 |
|
Data Mining, Knowledge discovery from databases,
Machine Learning, Automatic Modeling of Legacy Databases, Meta-Pattern
Guided Discovery, Integrated Data Mining Systems.
|
|
|
-
Goals,
Objectives, and Targeted Activities
From June 1, 1997 through May 31, 1998, we have been
focusing on applying the Metapattern technology to automatically analyzing
legacy relational databases and discovering useful metadata -- characterizations
of the database's intended semantics. The main problem faced when
trying to incorporate legacy databases into modern knowledge-based systems
is the difficulty of obtaining such metadata. The problem is particularly
acute when the data is noisy and access to human domain experts is limited
-- both frequent occurrences. The AutoModel tool developed under
the methodology of DataCrystal uses advanced data mining techniques
to analyze database contents. By comparing data and attribute
names in different tables, in a multi-step process, it identifies potential
key and foreign key attributes and ultimately proposes an entity-relationship
(ER) model for the data. Expert knowledge about the database and
the domain can be used to control the process.
Our goals for the next year is to complete the design
and implementation of generating other types of metapatterns in addition
to the type of transitivity, and incorporate all the results we have
obtained so far in the control penal to allow users to examine, select,
and execute metapatterns interactively. We will also evaluate our system
in several large legacy databases that we have access to, these are
mainly logistic databases for military applications.
|
|
|
The specific component of DataCrystal constructed this
year has been tested on several logistics databases of sizes ranging
up to 100 tables with up to 60 columns each, and up to 1.5 million rows.
The results generated are comparable to those obtained by purely human
analysis. However, AutoModel speeds up this previously labor intensive
process by an order of magnitude -- reducing it from one taking weeks,
to one taking hours.
This result has been well received by a research community
for military applications, and we were requested to submit a short summary
for the Significant Event Report for the Secretary of Defense.
So far, we have accomplished most of the items in our
statement of work for the second year of this award. Two exception of
this are: instead of using chemistry databases, we are now focus on
logistic database, and due to the application's needs, we have not addressed
the issue of generating non transitivity metapatterns.
The DataCrystal project has supported two different
PhD graduate students in the Data Mining area, and also enabled a four-day
short course on Data Mining at the UCLA Extension Program. The course
is a big success, and there are 25 people in the class and they are
from everywhere: 4 from Brazil, 1 from Sweden, others are in US but
from places such as AllState, Fair Issc, Los Amos Nat. Lab, IBM, Digital,
JPL, OOCL, etc.. Two of them are CEOs of some consulting companies.
The three instructors are high quality people in Data Mining. Rekash
Agrawal and Jaiwei Han are chairmen of past KDD conferences and excellent
researchers in databases. People in the class stayed until the last
minute, and they are all very excited and happy!
In addition, as part of this project , we have been
organized several tutorials in different aspect of data mining and knowledge
discovery during past years. A group of 40-150 people participated in
each workshop/tutorial.
The DataCrystal technology is also licensed by GKIS,
http://www.gkis.com, a Houston based
company for enterprise modeling, information integration, and technology
education. They plan to extend DataCrystal and use it in applications
related to United Nations information technology group. (See news report
at DataCrystal's home page: http://www.isi.edu/dcrystal)
Data Mining is a process of discovering valuable knowledge
from very large data sets, its research spans from database and statistics
to machine learning and adaptive control and its application includes
scientific research, system design, business management, or many other
related applications.
|
|
|
Wei-Min Shen
USC Information Sciences Institute
4676 Admiralty Way
Marina del Rey, CA 90292
Phone: (310) 448-8710
Fax : (310) 822-0751
Email: shen@isi.edu
|
|