CSCI 548
Information Integration on the Web
Fall 2006
Instructor: Craig Knoblock
Meeting Time: Tuesdays 2 - 4:50pm
Location: GFS 101
Office Hours:
Tuesdays 5pm (immediately after class in the lobby of GFS)
Wednesdays 3-4pm (ISI 922)
Teaching Assistant: Matthew Michelson
TA Office Hours:
Mondays 10:30-12:00pm (SAL 200C)
This course will focus on the basic foundations and techniques in Information
Extraction and Integration. There has been a great deal of interest and research over the
last few years on this topic and the course will cover the research and tools for addressing
the technical problems. The topics covered will include structured data querying
(Xquery), view integration techniques, machine learning techniques for turning web sites
into structured data sources, high-performance query execution systems based on
streaming dataflow, constraint-based integration systems, and approaches to resolving
naming inconsistencies across sites. We will also cover the applications of information
integration to geospatial data integration.
The class will be run as a lecture course with lots student participation and hands-on
experience. As an integral part of the course each student will develop and build an
integrated Web application using the research and tools covered in the class.
Prerequisites:
CSCI561 -- Introduction to AI
CSCI585 -- Database Systems
Recommended Course:
CSCI571 -- Web Technologies
CSCI573 -- Advanced AI
Grading:
Homework -- 24%
Course project -- 26%
Quizzes -- 20%
Final Exam -- 30%
Books: There is no required textbook. We will read technical papers on each topic.
Lab: SAL 200C (there is a $175 lab fee for this course)
Course Syllabus and Schedule
- Lecture 1 (August 22)
- Topic: Introduction
- Introduction
- Overview of the course
- Course project
- Xquery (Dan Goldberg)
- Required Readings
- What is XQuery ?
Per Bothner.
- XQuery: A Guided Tour
Chapter 1 of XQuery from the Experts: A Guide to the W3C XML Query Language
Howard Katz (ed), Addison-Wesley, San Francisco, CA, 2004
- Lecture 2 (August 29)
- Topic: Wrapper Learning
- Wrapper Learning and Maintenance
- Easy Builder Tool (Matthew Michelson)
- Homework
- Assignment 1 Due (August 30 at midnight)
- XQuery
- Required Readings
-
A Hierarchical Approach to Wrapper Induction,
Ion Muslea, Steve Minton, and Craig A. Knoblock.
In Proceedings of the 3rd International Conference on Autonomous Agents 1999, Seattle, WA, 1999.
-
Wrapper Induction for Information Extraction,
Kushmerick, N., Weld, D., & Doorenbos, B.
In Proceedings of IJCAI, 1997.
- Recommended Readings:
-
Hierarchical Wrapper Induction for Semistructured Information Sources,
I. Muslea, S. Minton, C. Knoblock
Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114, 2001.
-
Wrapper induction: Efficiency and expressiveness,
Nicholas Kushmerick
Artificial Intelligence J.(special issue on Intelligent Internet Systems),(1-2):15-68,2000.
-
Wrapper maintenance: A machine learning approach,
Kristina Lerman, Steven Minton, and Craig A. Knoblock
Journal of Artificial Intelligence Research, 2003
- Lecture 3 (September 5)
- Topic: Information Extraction
- Information Extraction Systems
- CRFs and MALLET (Matthew Michelson)
- Homework
- Assignment 2 Due (September 6 at midnight)
- Wrapper Building
- Required Reading:
-
Semantic annotation of unstructured and ungrammatical text,
Matthew Michelson and Craig A. Knoblock.
In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI-2005), Edinburgh, Scotland, 2005.
-
Interactive Information Extraction with Constrained Conditional Random Fields.,
Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum.
In Proceedings of The Nineteenth National Conference on Artificial Intelligence (AAAI 2004),San Jose, CA, 2004.
- Adaptive Information Extraction from Text by Rule Induction and Generalisation,
Fabio Ciravegna,
In Proceedings of 17th International Joint Conference on
Artificial Intelligence (IJCAI 2001), Seattle, August 2001.
- Recommended Readings:
- Lecture 4 (September 12)
- Topic: Data Integration
- Data Integration
- Prometheus mediator (Snehal Thakkar)
- Homework
- Assignment 3 Due (September 13 at midnight)
- Information Extraction
- Required Readings:
- Recommended Readings:
- Lecture 5 (September 19)
- Topic: Wrapper Generation
- Automatic Wrapper Generation (Prof. Kristina Lerman)
- Automatic wrapper tools (Rattapoom Tuchinda)
- Homework
- Assignment 4 Due (September 20 at midnight)
- Data Integration
- Required Readings:
- Using the Structure of Web Sites for Automatic Segmentation of Tables
Kristina Lerman, Lise Getoor, Steven Minton and Craig Knoblock,
in Proc. of SIGMOD-2004, Paris, France,2004.
- RoadRunner: Towards Automatic Data Extraction from Large Web Sites
W. Crescenzi, G. Mecca, P. Merialdo,
The VLDB Journal, 109-118, 2001.
- Overview of AutoFeed: An Unsupervised Learning System for Generating Webfeeds,
B. Cenk Gazen and Steven Minton,
In Proceedings of AAAI, 2006
- Recommended Readings:
- Lecture 6 (September 26)
- Topic: Semantic Web
- Semantic Web Overview
- Example Application of the Semantic Web (Dr. Jose Luis Ambite)
- Semantic Web Tools (Dipsy Kapoor/Matt Michelson)
- Homework
- Assignment 5 Due (September 27 at midnight)
- Automatic Wrapper Generation
- Required Readings:
- Recommended Readings:
- Lecture 7 (October 3)
- Topic: Dataflow Execution
- Streaming Dataflow Execution
- Theseus Agent Execution System (Dr. Greg Barish)
- Homework
- Assignment 6 Due (October 4 at midnight)
- Semantic web
- Lecture 8 (October 10)
- Topic: Schema Mapping / Source Modeling
- Schema Mapping
- Source Modeling (Dr. Mark Carman)
- Homework
- Assignment 7 Due (October 11 at midnight)
- Streaming dataflow execution
- Required Readings:
- iMAP: Discovering Complex Semantic
Matches between Database Schemas,
R. Dhamanka, Y. Lee,
A. Doan, A. Halevy, and P. Domingos.
In Proceedings of SIGMOD-04, 2004.
-
Automatically Labeling the Inputs and Outputs of Web Services ,
Kristina Lerman, Anon Plangrasopchok and Craig Knoblock.
In Proceedings of AAAI-2006, Boston, MA, 2006.
- Learning semantic descriptions of web information sources.,
Mark James Carman and Craig A. Knoblock.
In Proceedings of the
Twentieth International Joint Conference on Artificial Intelligence
(IJCAI), 2007.
- Lecture 9 (October 17)
- Topic: Constraint Integration
- Constraint Integration
- Heracles Tutorial (Dr. Jose Luis Ambite)
- Homework
- Assignment 8 Due (October 18 at midnight)
- Source Modeling
- Required Readings:
- Heracles II: Conditional Constraint Networks for Interleaved Planning and Information Gathering,
Jose Luis Ambite, Craig A. Knoblock, Maria Muslea, and Steve Minton
IEEE Intelligent Systems, 2005.
- Smart Clients: Constraint Satisfaction as a Paradigm for Scaleable Intelligent Information Systems
Marc Torrens, Boi Faltings and Pearl Pu.
Constraints,
Special issue on Constraints and Agents, pages. 49-69, International Journal of Constraints, Volume 7, Issue 1. Kluwer Academic Publishers.
January 2002.
- Lecture 10 (October 24)
- Topic: Record Linkage
- Blocking (Matt Michelson)
- Matching
- Required Readings:
-
Learning Blocking Schemes for Record Linkage,
Matthew Michelson and Craig A. Knoblock,
In Proceedings of the 21st National Conference on Artificial
Intelligence (AAAI-2006), Boston, MA, 2006.
-
Learning domain-independent string transformation weights for high accuracy object identification,
Sheila Tejada, Craig A. Knoblock, and Steven Minton,
In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, Canada, 2002.
-
Adaptive Duplicate Detection Using Learnable String Similarity Measures.
Mikhail Bilenko and Raymond J. Mooney.
In Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD-2003), Washington,
DC, August 2003.
- Lecture 11 (October 31)
- Topic: Optimizing Streaming Dataflow
- Optimizing Streaming Dataflow
- Speculative Execution (Dr. Greg Barish)
- Required Readings:
- Speculative
Execution for Information Gathering Plans,
Greg Barish and Craig A. Knoblock,
Proceedings of the Sixth International Conference on AI Planning
and Scheduling (AIPS-2002), Toulouse, France. April 2002.
- Learning
value predictors for the speculative execution of information gathering
plans,
Greg Barish and Craig A. Knoblock.
In Proceedings of the 18th International Joint Conference on
Artificial Intelligence (IJCAI-2003), Acapulco, Mexico, 2003.
- Eddies:
Continuously Adaptive Query Processing,
Ron Avnur and Joseph M. Hellerstein,
In Proceedings of the 2000 ACM SIGMOD International Conference
on Management of Data,Dallas, TX, May 2000.
- Lecture 12 (November 7)
- Topic: Geospatial Data Integration
- Geospatial Data Integration
- Aligning Geospatial Sources (Dr. Jason Chen)
- Required Readings:
-
Utilizing road network data for automatic identification of road
intersections from high resolution color orthoimagery,
Ching-Chien Chen, Cyrus Shahabi, and Craig A. Knoblock.
In Proceedings of the 2nd Workshop on Spatio-Temporal Database
Management - STDBM'04, 2004.
-
Automatically and accurately conflating orthoimagery and street maps,
Ching-Chien Chen, Craig A. Knoblock, Cyrus Shahabi, Snehal Thakkar, and Yao-Yi
Chiang.
In Proceedings of the 12th ACM International Symposium on
Advances in Geographic Information Systems (ACM-GIS'04), 2004.
-
Exploiting online sources to accurately geocode addresses,
Rahul Bakshi, Craig A. Knoblock, and Snehal Thakkar.
In Proceedings of the 12th ACM International Symposium on
Advances in Geographic Information Systems (ACM-GIS'04), 2004.
-
Query processing in a geographic mediation system,
Mehdi Essid,
Omar Boucelma,
François-Marie Colonna, and
Yassine Lassoued.
In Proceedings of the 12th ACM International Symposium on
Advances in Geographic Information Systems (ACM-GIS'04), 2004.
- Lecture 13 (November 14)
- Topic: Constraint Reasoning
- Constraint Satisfaction Methods (Professor Berthe Choueiry)
- Constraint satisfaction for geospatial reasoning (Martin Michalowski)
- Project Presentations (November 21)
- Lecture 14 (November 28)
- Topic: Intellectual Property / Course Review
- Intellectual Property
- Course Review
- Final Exam (Thursday, December 7, 2-4pm)
Last Updated December 6th 2006