CSCI 599: Information Integration on the Web
Spring 2002
Instructor: Craig Knoblock (knoblock@isi.edu)
Office Hours:
- Thursday 12:45 - 1:45 (PHE 416)
- Thursday 4:50 - 5:20 (THH 112)
Teaching Assistant: Ion Muslea (muslea@isi.edu)
Office Hours:
- Tuesday 2:50 - 4:50 (PHE 416 / SAL 200C)
Meeting Time:
- Thursday 2:00-4:50 (THH 112)
Location:
There is an abundance of data available on the Internet and there are
many opportunities to combine this information to build new
applications and tools. However, there are many obstacles to
exploiting the available data. In many cases the information is not
available in a structured representation, it is complicated to
navigate to the required information, the format of the information
changes over time, and the terminology used to describe the data
varies from one data source to the next.
This course will focus on the basic foundations and
techniques in
Information Integration as it applies to the Web. There has been a
great deal of interest and
research over the last few years on this
topic and the course will cover the
research and tools for
addressing the technical problems. The topics covered will include
structured
data representation such as XML, view integration
techniques, machine learning
techniques for turning web sites into
structured data sources, high-performance
query execution systems
based on dataflow, constraint-based integration
systems, and
approaches to resolving naming inconsistencies across sites.
The
class will be run as a lecture course with lots student
participation and
hands-on experience. As an integral
part of the
course each student will develop and build their own integrated Web
application using the research and tools covered in the class.
- Prerequisites: CS561a -- Introduction to AI
- Grading:
- Homework: 30%
- Class participation: 20%
- Course project: 50%
- Books: There is no required textbook. We will be reading technical papers on each topic.
Course Syllabus and Schedule
- January 10
- Introduction
- Overview of the course
- Application demonstrations
- Course requirements
- Course projects
- January 17
- Information Integration
- .Net Framework (Snehal Thakkar, Phd Student)
- January 24
- Wrapper Learning (Ion Muslea, Phd Student)
- Agent Builder (Snehal Thakkar, Phd Student)
- .Net Homework Due
- January 31
- Wrapper Maintenance
- Agent Builder (Ion Muslea, Phd Student)
- Wrapper Homework 1 Due
- February 7
- Query Plan Execution
- Dataflow architectures
- Adaptive execution strategies
- Theseus Agent Execution System (Greg Barish, PhD Student)
- Wrapper Homework 2 Due
- February 14
- View Integration (Jose Luis Ambite, Research Scientist)
- Speculative Execution in Theseus (Greg Barish, PhD Student)
- Theseus Homework 1 Due
- February 21
- The Semantic Web (Prof. Yolanda Gil)
- XML Query Processing
- Written Proposals Due
- February 28
- Heracles Constraint Integration Engine
- Theseus Homework 2 Due
- March 7
- Object Alignment
- Record Linkage
- IR Techniques
- Machine learning
- March 14
- March 21
- Query Reformulation
- Project Status Report Due
- March 28
- April 4
- Case Studies of Integration Systems:
- Ariadne
- Havasu/Emerac
- Information Manifold/InfoMaster
- Tsimmis/Hermes
- Telegraph/Tukwila
- Niagaa
- April 11
- Geospatial Data Integration
- April 18
- April 25
- April 30
- Project Presentations 2-4pm (Final Exam Slot)
Course Materials: Notes, Slides, Papers, and Homeworks
- Lecture 1:
- Papers
-
Modeling Web Sources for Information Integration
C. Knoblock, S. Minton, J.L. Ambite, N. Ashish, P. Modi, I. Muslea, A. Philpot, S. Tejada.
In Proceedings of the 15th National Conference on Artificial Intelligence, 1998.
-
The TheaterLoc Application
G. Barish, C. Knoblock, Y.-S. Chen, S. Minton, A. Philpot, C. Shahabi.
In Proceedings of the 12th Annual Conference on Innovative Applications of Artificial Intelligence
(IAAI-200), Austin, TX, 2000.
- Slides
- Lecture 2:
- Papers
- Slides
- Homework-1 (was due on Jan. 31, 2002)
- Lecture 3:
- Papers
- Slides
- Homework-2 (was due on Feb. 7, 2002)
- Lecture 4:
- Papers
- Slides
- Homework-3 (was due on Feb. 14, 2002)
- Lecture 5:
- Lecture 6:
- Papers
- Slides
- Homework-4 (due Feb. 28, 2002)
- Lecture 7:
- Papers
- Slides
- Homework-5 (due March 7, 2002)
- Lecture 8:
- Papers
- Slides
- Homework-6 (due March 21, 2002)
- Lecture 9:
- Lecture 10:
- Papers
- Slides
- Homework-7 (due March 28, 2002)
Lecture 11:
- Papers
-
Extraction Patterns for Information Extraction Tasks: A Survey
Ion Muslea
The AAAI-1999 workshop on Machine Learning for Information Extraction
-
Information Gathering Plans with Sensing Actions
Naveen Ashish, Craig A. Knoblock, and Alon Levy
European Conference on Planning, ECP-97, Toulouse, France.
-
The TSIMMIS Project: Integration of Heterogeneous Information Sources
Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland,
Y. Papakonstantinou, J. Ullman, and J. Widom.
Proceedings of IPSJ Conference, pp. 7-18, Tokyo, Japan, October
- Slides
- Homework-7 (due March 28, 2002)
Lecture 12:
- Papers
-
Automatic Data Extraction from Lists and Tables in Web Sources
Kristina Lerman, Craig A. Knoblock and Steven Minton
Automatic Text Extraction and Mining workshop (ATEM-01),
IJCAI-01, Seattle, WA, August 2001.
-
Semi-automatic wrapper generation for Internet information sources
Naveen Ashish and Craig A. Knoblock
Proceedings of the Second IFCIS International Conference on
Cooperative Information Systems, Kiawah Island, SC, 1997.
-
Learning to understand information on the internet
M. Perkowitz, R. Doorenbos, O. Etzioni and D. Weld
Journal of Intelligent Information Systems, 8(2), p133-153, 1997.
-
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
W. Crescenzi, G. Mecca, P. Merialdo
The VLDB Journal, 109-118, 2001.
- Slides
Lecture 13: