CSCI 548
Information Integration on the Web
Spring 2009
Instructor: Craig Knoblock and Kristina Lerman
Meeting Time: Monday and Wednesday 3.30 - 4:50pm
Location: THH 212
Office Hours:
Monday and Wednesday 4.50pm (immediately after class in the lobby of GFS)
Or by appointment (ISI 922 or by phone)
Teaching Assistant:
TA Office Hours:
This course will focus on the basic foundations and techniques in Information
Extraction and Integration. There has been a great deal of interest and research over the
last few years on this topic and the course will cover the research and tools for addressing
the technical problems. The topics covered will include structured data querying
(Xquery), view integration techniques, machine learning techniques for turning web sites
into structured data sources, high-performance query execution systems based on
streaming dataflow, constraint-based integration systems, and approaches to resolving
naming inconsistencies across sites. We will also cover the applications of information
integration to geospatial data integration.
The class will be run as a lecture course with lots student participation and hands-on
experience. As an integral part of the course each student will develop and build an
integrated Web application using the research and tools covered in the class.
Prerequisites:
CSCI561 -- Introduction to AI
CSCI585 -- Database Systems
Recommended Course:
CSCI571 -- Web Technologies
CSCI573 -- Advanced AI
Grading:
Homework -- 20%
Course project -- 30%
Quizzes -- 20%
Final Exam -- 30%
Books: There is no required textbook. We will read technical papers on each topic.
Lab:There is no lab for this course. Students should contact the instructor if they do not have access to a computer where they can install their own software.
Course Syllabus and Schedule
- Lecture 1 (January 12)
- Topic: Introduction
- Introduction
- Overview of the course
- Course project
- Required Readings
- What is XQuery ?
Per Bothner.
- XQuery: A Guided Tour
Chapter 1 of XQuery from the Experts: A Guide to the W3C XML Query Language
Howard Katz (ed), Addison-Wesley, San Francisco, CA, 2004
- Lecture 2 & 3 (January 14 & 21 )
- Topic: Wrapper Learning
- Wrapper Learning and Maintenance
- Wrapper Learning Tools (Martin Michalowski)
- Homework
- Assignment 1 Due (January 23 at midnight)
- Required Readings
-
A Hierarchical Approach to Wrapper Induction,
Ion Muslea, Steve Minton, and Craig A. Knoblock.
In Proceedings of the 3rd International Conference on Autonomous Agents 1999, Seattle, WA, 1999.
-
Wrapper Induction for Information Extraction,
Kushmerick, N., Weld, D., & Doorenbos, B.
In Proceedings of IJCAI, 1997.
- Recommended Readings:
-
Hierarchical Wrapper Induction for Semistructured Information Sources,
I. Muslea, S. Minton, C. Knoblock
Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114, 2001.
-
Wrapper induction: Efficiency and expressiveness,
Nicholas Kushmerick
Artificial Intelligence J.(special issue on Intelligent Internet Systems),(1-2):15-68,2000.
-
Wrapper maintenance: A machine learning approach,
Kristina Lerman, Steven Minton, and Craig A. Knoblock
Journal of Artificial Intelligence Research, 2003
-
Active Learning with Multiple Views,
Ion Muslea,
Journal of Artificial Intelligence Research 27, pp. 202-233
- Lecture 4 & 5 (January 26 & 28)
- Topic: Information Extraction
- Information Extraction Systems
- IE Tools
- Homework
- Assignment 2 Due (January 30 at midnight)
- Required Reading (Lecture 4):
- Required Reading (Lecture 5):
- Optional Reading:
- Lecture 6 & 7 (February 2 & 4)
- Topic: Data Integration
- Data Integration
- Data Integration Tools
- Homework
- Assignment 3 Due (February 7 at midnight)
- Required Readings:
- Recommended Readings:
- Lecture 8 & 9 (February 9 & 11)
- Topic: Wrapper Generation
- Automatic Wrapper Generation (Prof. Kristina Lerman)
- Automatic wrapper tools
- Homework
- Assignment 4 Due (February 14 at midnight)
- Required Readings (Lecture 8):
- Required Readings (Lecture 9):
- Using the Structure of Web Sites for Automatic Segmentation of Tables
Kristina Lerman, Lise Getoor, Steven Minton and Craig Knoblock,
in Proc. of SIGMOD-2004, Paris, France,2004.
- Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison
Etzioni, O., Cafarella, M., Downey, D., Popescu, A., Shakel, T., Soderland, S., Weld, D., and Yates, A. 2004 in Proceedings of AAAI.
- Recommended Readings:
- Lecture 10 (February 18)
- Topic: Mashups
- Building Mashups
- Mapping/Modeling Tools
- Lecture 11 (February 23)
- Topic: Schema Matching
- Homework
- Assignment 5 Due (February 21 at midnight)
- Lecture 12 (February 25)
- Topic: Source Modeling
- Homework
- Assignment 6 Due (February 28 at midnight)
- Lecture 13 & 14 (March 2 & 4)
- Topics: Record Linkage
- Homework
- Assignment 7 Due (March 7 at midnight)
- Required Readings:
- Learning Blocking Schemes for Record Linkage,
Matthew Michelson and Craig A. Knoblock,
In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-2006),
Boston, MA, 2006.
- Adaptive Duplicate Detection Using Learnable String
Similarity Measures.
Mikhail Bilenko and Raymond J. Mooney.
In Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD-2003), pp.39-48,Washington, DC,
August 2003.
- EntityBases: Compiling, Organizing
and Querying Massive Entity Repositories Craig A.
Knoblock, José Luis Ambite, Kavita Ganesan, Maria Muslea, Steven Minton,
Greg Barish, Evan Gamble, Claude Nanjo, Kane See, Cyrus Shahabi,
Ching-Chien Chen: International Conference on Artificial
Intelligence, 2007
- Optional Readings:
- Lecture 15 & 16 (March 9 & 11)
- Topic: Semantic Web
- Semantic Web
- Semantic Web Tools
- Homework
- Assignment 8 Due (March 14 at midnight)
- Required Readings:
- Recommended Readings:
- Lecture 17 & 18 (March 23 & 25)
- Topic: Dataflow Execution/ Optimization
- Dataflow Execution
- Dataflow Optimization
- Required Readings (Lecture 17):
- Recommended Readings (Lecture 18):
- Optional Readings:
- Slides
- Lecture 19 & 20 (March 30 & April 1)
- Topic: Social Web
- Social Tagging
- Folksonomies
- Required Readings (Lecture 19):
- Recommended Readings (Lecture 20):
- Harvesting Social Knowledge from Folksonomies,
Wu, H., Zubair, M. and Maly, K.
In Proc. of Conf. on Hypertext and Hypermedia, 2006.
- Learning value predictors for the speculative execution
of information gathering plans,
Plangprasopchok, A. and Lerman, K.
In Proc. of International World Wide Web Conference, 2009.
- Optional Readings:
- Slides
- Lecture 21 (April 6)
- Topic: Geospatial Data Integration
- Geospatial Data Integration
- Required Readings:
- Automatically conflating road vector data with
orthoimagery,
Ching-Chien Chen, Craig A. Knoblock, and
Cyrus Shahabi.
Geoinformatica, 10(4):495--530, December 2006.
- Automatically and accurately conflating orthoimagery and
street maps,
Ching-Chien Chen, Craig A. Knoblock, Cyrus
Shahabi, Snehal Thakkar, and Yao-Yi Chiang.
In Proceedings of the
12th ACM International Symposium on Advances in Geographic Information
Systems (ACM-GIS'04), 2004.
- Lecture 22 (April 8)
- Topic: Map and Imagery Fusion
- Required Readings:
- Accurate and Automatic Extraction of Road Intersections From Raster Maps,
Yao-Yi Chang, Ching-Chien Chen, Craig A. Knoblock, and
Cyrus Shahabi.
Geoinformatica, 13(2):121--157, 2008.
- Integrated text and line-art extraction from a
topographic map,
Luyang Li, George Nagy, Ashok Samal, Sharad
Seth, Yihong Xu .
International Journal on Document Analysis and
Recognition, Volume 2, Number 4 / June, 2000
- Lecture 22 (April 13)
- Topic: Constraint Integration
- Required Readings:
- Heracles II: Conditional Constraint Networks for
Interleaved Planning and Information Gathering,
Jose Luis
Ambite, Craig A. Knoblock, Maria Muslea, and Steve Minton
IEEE
Intelligence Systems, 2005
- Smart Clients: Constraint Satisfaction as a Paradigm for
Scaleable Intelligent Information Systems
Marc Torrens, Boi
Faltings and Pearl Pu.
Constraints, Special issue on
Constraints and Agents, pages. 49-69, International Journal of
Constraints, Volume 7, Issue 1. Kluwer Academic Publishers. January
2002.
- Optional Readings:
- Constraint Programming: In Pursuit of the Holy
Grail
Barták, R., in Proceedings of the Week of Doctoral
Students (WDS99), Part IV, MatFyzPress, Prague, June 1999, pp.
555-564.
- A constraint satisfaction approach to geospatial
reasoning,
Martin Michalowski and Craig A. Knoblock.
In
Proceedings of The Twentieth National
Conference on Artificial Intelligence (AAAI-05), Pittsburgh, PA, July
2005, pp. 423-429.
- Slides
- Lecture 23 (April 15)
- Topic: Intellectual Property
- Intellectual Property and Web
- Lecture 24 (April 20)
- Topic: Geocoding
- Required Readings:
- Slides
- Project Presentations 1 (April 22)
- Project Presentations 2 (April 27)
- Project Presentations 3 (April 29)
- Final Exam (Thursday, May 7, 2-4pm)
Last Updated June 30th 2009