CSCI 548
Information Integration on the Web
Spring 2008
Instructor: Craig Knoblock
Meeting Time: Tuesdays 3.30 - 6:20pm
Location: THH 212
Office Hours:
Tuesdays 6.20pm (immediately after class in the lobby of GFS)
Or by appointment (ISI 922 or by phone)
Wednesdays 3-4pm (ISI 922)
Teaching Assistant:
- Anon Plangprasopchok
- Rattapoom Tuchinda (Pipe)
TA Office Hours:
Mondays and Wednesdays 1:00-2:00pm in SAL 219 (Anon)
Fridays 1:00-3:00pm in SAL 235 (Pipe)
This course will focus on the basic foundations and techniques in Information
Extraction and Integration. There has been a great deal of interest and research over the
last few years on this topic and the course will cover the research and tools for addressing
the technical problems. The topics covered will include structured data querying
(Xquery), view integration techniques, machine learning techniques for turning web sites
into structured data sources, high-performance query execution systems based on
streaming dataflow, constraint-based integration systems, and approaches to resolving
naming inconsistencies across sites. We will also cover the applications of information
integration to geospatial data integration.
The class will be run as a lecture course with lots student participation and hands-on
experience. As an integral part of the course each student will develop and build an
integrated Web application using the research and tools covered in the class.
Prerequisites:
CSCI561 -- Introduction to AI
CSCI585 -- Database Systems
Recommended Course:
CSCI571 -- Web Technologies
CSCI573 -- Advanced AI
Grading:
Homework -- 20%
Course project -- 30%
Quizzes -- 20%
Final Exam -- 30%
Books: There is no required textbook. We will read technical papers on each topic.
Lab:There is no lab for this course. Students should contact the instructor if they do not have access to a computer where they can install their own software.
Course Syllabus and Schedule
- Lecture 1 (January 15)
- Topic: Introduction
- Introduction
- Overview of the course
- Course project
- Xquery (Dan Goldberg)
- Required Readings
- What is XQuery ?
Per Bothner.
- XQuery: A Guided Tour
Chapter 1 of XQuery from the Experts: A Guide to the W3C XML Query Language
Howard Katz (ed), Addison-Wesley, San Francisco, CA, 2004
- Lecture 2 (January 22)
- Topic: Information Extraction
- Information Extraction Systems
- IE Tools (Matthew Michelson)
- Homework
- Assignment 1 Due (January 25 at midnight)
- Required Reading:
-
Information Extraction: Distilling Structured Data from Unstructured Text,
Andrew McCallum.
ACM Queue, volume 3, Number 9, November 2005.
Semantic annotation of unstructured and ungrammatical text,
Matthew Michelson and Craig A. Knoblock.
In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI-2005), Edinburgh, Scotland, 2005.
-
Accurate Information Extraction from Research Papers using Conditional Random Fields.
Fuchun Peng and Andrew McCallum.
In Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004.
- Adaptive Information Extraction from Text by Rule Induction and Generalisation,
Fabio Ciravegna,
In Proceedings of 17th International Joint Conference on
Artificial Intelligence (IJCAI 2001), Seattle, August 2001.
- Recommended Readings:
- Lecture 3 (January 29)
- Topic: Wrapper Generation
- Automatic Wrapper Generation (Prof. Kristina Lerman)
- Automatic wrapper tools (Rattapoom Tuchinda)
- Homework
- Assignment 2 Due (February 1 at midnight)
- Data Integration
- Required Readings:
- Using the Structure of Web Sites for Automatic Segmentation of Tables
Kristina Lerman, Lise Getoor, Steven Minton and Craig Knoblock,
in Proc. of SIGMOD-2004, Paris, France,2004.
- RoadRunner: Towards Automatic Data Extraction from Large Web Sites
W. Crescenzi, G. Mecca, P. Merialdo,
The VLDB Journal, 109-118, 2001.
- Overview of AutoFeed: An Unsupervised Learning System for Generating Webfeeds,
B. Cenk Gazen and Steven Minton,
In Proceedings of AAAI, 2006
- Recommended Readings:
- Lecture 4 (February 5)
- Topic: Wrapper Learning
- Wrapper Learning and Maintenance
- Wrapper Learning Tools (Martin Michalowski)
- Homework
- Assignment 3 Due (February 8 at midnight)
- Required Readings
-
Wrapper Maintenance,
Kristina Lerman and Craig A. Knoblock,
To appear in the Encylopedia of Database Systems. Springer, Leipzig, Germany 2008.
-
A Hierarchical Approach to Wrapper Induction,
Ion Muslea, Steve Minton, and Craig A. Knoblock.
In Proceedings of the 3rd International Conference on Autonomous Agents 1999, Seattle, WA, 1999.
-
Wrapper Induction for Information Extraction,
Kushmerick, N., Weld, D., & Doorenbos, B.
In Proceedings of IJCAI, 1997.
- Recommended Readings:
-
Hierarchical Wrapper Induction for Semistructured Information Sources,
I. Muslea, S. Minton, C. Knoblock
Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114, 2001.
-
Wrapper induction: Efficiency and expressiveness,
Nicholas Kushmerick
Artificial Intelligence J.(special issue on Intelligent Internet Systems),(1-2):15-68,2000.
-
Wrapper maintenance: A machine learning approach,
Kristina Lerman, Steven Minton, and Craig A. Knoblock
Journal of Artificial Intelligence Research, 2003
- Lecture 5 (February 12)
- Topic: Data Integration
- Data Integration
- Data Integration Tools (Prof. Jose Luis Ambite)
- Homework
- Assignment 4 Due (February 15 at midnight)
- Required Readings:
- Recommended Readings:
- Lecture 6 (February 19)
- Topic: Schema Mapping / Source Modeling
- Schema Mapping/Source Modeling
- Mapping/Modeling Tools (Anon Plangprasopchok)
- Homework
- Assignment 5 Due (February 22 at midnight)
- Required Readings:
- iMAP: Discovering Complex Semantic
Matches between Database Schemas,
R. Dhamanka, Y. Lee,
A. Doan, A. Halevy, and P. Domingos.
In Proceedings of SIGMOD-04, 2004.
-
Automatically Labeling the Inputs and Outputs of Web Services ,
Kristina Lerman, Anon Plangrasopchok and Craig Knoblock.
In Proceedings of AAAI-2006, Boston, MA, 2006.
- Learning Semantic Definitions of Online Information Sources.,
Mark James Carman and Craig A. Knoblock.
Journal of Artificial Intelligence Research (JAIR) Volume 30, pages 1-50, 2007.
- Lecture 7 (February 26)
- Topic: Semantic Web
- Semantic Web (Prof. Jose Luis Ambite)
- Semantic Web Tools (Anon Plangprasopchok)
- Homework
- Assignment 6 Due (February 29 at midnight)
- Required Readings:
- Recommended Readings:
- Lecture 8 (March 4)
- Topics: Record Linkage
- Record Linkage
- Blocking / Matching Tools (Matt Michelson)
- Homework
- Assignment 7 Due (March 7 at midnight)
- Required Readings:
- Learning Blocking Schemes for Record Linkage,
Matthew Michelson and Craig A. Knoblock,
In Proceedings of the
21st National Conference on Artificial Intelligence (AAAI-2006),
Boston, MA, 2006.
- Adaptive Duplicate Detection Using Learnable String
Similarity Measures.
Mikhail Bilenko and Raymond J. Mooney.
In Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD-2003), pp.39-48, Washington, DC,
August 2003.
- EntityBases: Compiling, Organizing
and Querying Massive Entity Repositories Craig A.
Knoblock, José Luis Ambite, Kavita Ganesan, Maria Muslea, Steven Minton,
Greg Barish, Evan Gamble, Claude Nanjo, Kane See, Cyrus Shahabi,
Ching-Chien Chen: International Conference on Artificial
Intelligence, 2007
- Optional Readings:
- Slides
- Lecture 9 (March 11)
- Topic: Dataflow Execution
- Streaming Dataflow Execution
- Optimizing Streaming Dataflow (Dr. Greg Barish)
- Homework
- Assignment 8 Due (March 14 at midnight)
- Required Readings:
- An efficient and expressive language for information
gathering on the web, Greg Barish and Craig A. Knoblock.
In
Proceedings of the {AIPS-2002} Workshop on Is there life after
operator sequencing? -- Exploring real world planning, pages 5--12,
Tolouse, France,
2002.
- Integrating Network-Bound XML
Data. Zachary G. Ives, Alon Y. Halevy, Daniel S.
Weld. IEEE Data Engineering Bulletin, June 2001.
- Speculative Execution for Information Gathering
Plans,
Greg Barish and Craig A. Knoblock,
Proceedings
of the Sixth International Conference on AI Planning and Scheduling
(AIPS-2002), Toulouse, France. April 2002.
- Learning value predictors for the speculative execution
of information gathering plans,
Greg Barish and Craig A.
Knoblock.
In Proceedings of the 18th International Joint
Conference on Artificial Intelligence (IJCAI-2003), Acapulco,
Mexico, 2003.
- Optional Readings:
- Slides
- Lecture 10 (March 25)
- Topic: Constraint Integration / Social Networking
- Constraint Integration
- Social Networking (Professor Kristina Lerman)
- Required Readings:
- Heracles II: Conditional Constraint Networks for
Interleaved Planning and Information Gathering,
Jose Luis
Ambite, Craig A. Knoblock, Maria Muslea, and Steve Minton
IEEE
Intelligence Systems, 2005
- Smart Clients: Constraint Satisfaction as a Paradigm for
Scaleable Intelligent Information Systems
Marc Torrens, Boi
Faltings and Pearl Pu.
Constraints, Special issue on
Constraints and Agents, pages. 49-69, International Journal of
Constraints, Volume 7, Issue 1. Kluwer Academic Publishers. January
2002.
- A constraint satisfaction approach to geospatial
reasoning, Martin Michalowski and Craig A. Knoblock. In
Proceedings of The Twentieth National
Conference on Artificial Intelligence (AAAI-05), Pittsburgh PA, July
2005, pp. 423-429.
- Optional Readings:
- Lecture 11 (April 1)
- Topics: Mashup Construction / Geospatial Reasoning
- Mashup Construction
- Constraint satisfaction for geospatial reasoning
(Martin Michalowski)
- Required Readings:
- Video Demonstrations:
- Optional Readings:
- Slides
- Lecture 12 (April 8)
- Topic: Geospatial Data Integration
- Integrating Text and Imagery
- Map Search and Extraction (Yao-Yi Chaing)
- Required Readings:
- Automatically and accurately conflating orthoimagery and
street maps,
Ching-Chien Chen, Craig A. Knoblock, Cyrus
Shahabi, Snehal Thakkar, and Yao-Yi Chiang.
In Proceedings of the
12th ACM International Symposium on Advances in Geographic Information
Systems (ACM-GIS'04), 2004.
- Automatic extraction of road intersections from raster
maps,
Yao-Yi Chiang, Craig A. Knoblock, and Ching-Chien
Chen.
In The 13th ACM International Symposium on Advances in
Geographic Information Systems (ACM-GIS'05), Bremen,
Germany, November, 2005.
- Integrated text and line-art extraction from a
topographic map, , Luyang Li, George Nagy, Ashok Samal, Sharad
Seth, Yihong Xu, International Journal on Document Analysis and
Recognition, Volume 2, Number 4 / June, 2000
- Slides
- Lecture 13 (April 15)
- Topic: Intellectual Property / Geospatial Data
Integration
- Intellectual Property
- Vector, Map, and Image Registration and Conflation
(Dr. Jason Chen)
- Required Readings:
- Slides
Project Presentations 1 (April 22)
Project Presentations 2 (April 29)
Final Exam (Tuesday, May 13, 2-4pm)
Last Updated Auguest 14th 2008