CSCI 548
Information Integration on the Web
Spring 2005
Instructor: Craig Knoblock
Meeting Time: Tuesdays 2 - 4:50pm
Location: THH 208
Office Hours:
Tuesdays 5-6pm (PHE 416)
Thursdays 3-4pm (ISI 922)
Teaching Assistant: Martin Michalowski
TA Office Hours:
Mondays 1-2:30pm (SAL 200C)
Teaching Assistant: Anshuman Chakravartty
TA Office Hours:
Tuesday 11-12:30pm (SAL 200C)
Wednesdays 1-2:30pm (SAL 200C)
Thursday 10-11:30am (SAL 200C)
Friday 2-3:30pm (SAL 200C)
Grader: Junaid Chaudhry
This course will focus on the basic foundations and techniques in Information Extraction and Integration. There has been a great deal of interest and research over the last few years on this topic and the course will cover the research and tools for addressing the technical problems. The topics covered will include structured data querying (Xquery), view integration techniques, machine learning techniques for turning web sites into structured data sources, high-performance query execution systems based on streaming dataflow, constraint-based integration systems, and approaches to resolving naming inconsistencies across sites. We will also cover applications of information integration to geospatial and biological data integration.
The class will be run as a lecture course with lots student participation and hands-on experience. As an integral part of the course each student will develop and build their own integrated Web application or related research project using the research and tools covered in the class.
Prerequisites:
CSCI561 -- Introduction to AI
CSCI585 -- Database Systems
Recommended Course:
CSCI571 -- Web Technologies
CSCI573 -- Advanced AI
Grading:
Homework -- 24%
Course project -- 35%
Proposal 5%
Demo 5%
Presentation 5%
Paper 10%
Applied techniques learned in class 5%
Innovation and Creativity 5%
Quizzes -- 11%
Final Exam -- 30%
Books: There is no required textbook. We will read technical papers on each topic.
Lab: SAL 200C (there is a $175 lab fee for this course)
Course Syllabus and Schedule
- Lecture 1 (January 11)
- Topic: Introduction
- Introduction
- Overview of the course
- Xquery
- Required Readings
- What is XQuery ?
Per Bothner.
- XQuery: A Guided Tour
Chapter 1 of XQuery from the Experts: A Guide to the W3C XML Query Language
Howard Katz (ed), Addison-Wesley, San Francisco, CA, 2004
- Lecture 2 (January 18)
- Topic: Wrapper Learning
- Wrapper Learning and Maintenance
- Agent Builder (Martin Michalowski)
- Homework
- Assignment 1 Due (January 19 at midnight)
- Xquery
- Required Readings
-
Accurately and reliably extracting data from the web: A machine learning approach,
Craig A. Knoblock, Kristina Lerman, Steven Minton, and Ion Muslea.
In Piotr S. Szczepaniak, Javier Segovia, Janusz Kacprzyk, and Lotfi A. Zadeh, editors, Intelligent Exploration of the Web, pages 275--287. Springer-Verlag, Berkeley, CA, 2003.
- Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web
Chun-Nan Hsu and Ming-Tzung Dung
Information systems Journal, 23(8):521-538,1998.
- Wrapper Induction for Information Extraction.
Kushmerick, N., Weld, D., & Doorenbos, B. (1997), IJCAI-97
- Lecture 3 (January 25)
- Topic: Wrapper Generation
- Automatic Wrapper Generation
- Advanced Agent Builder (Martin Michalowski)
- Homework
- Assignment 2 Due (January 26 at midnight)
- Wrapper Building
- Required Readings
- Automatic
Data Extraction from Lists and Tables in Web Sources
Kristina Lerman,
Craig A. Knoblock and Steven Minton,
Automatic Text Extraction and Mining
workshop (ATEM-01), IJCAI-01, Seattle, WA, August 2001.
- RoadRunner: Towards
Automatic Data Extraction from Large Web Sites
W. Crescenzi, G. Mecca,
P. Merialdo,
The VLDB Journal, 109-118, 2001.
- Using Grammatical Inference to
Automate Information Extraction from the Web,
Theodore W. Hong and Keith
L. Clark,
Lecture Notes in Computer Science,2168, 2001
- Lecture 4 (February 1)
- Topic: Dataflow Execution
- Streaming Dataflow Execution
- Theseus Agent Execution System (Dr. Greg Barish)
- Homework
- Assignment 3 Due (Feb 2 at midnight)
- Advanced Wrapper Building
- Lecture 5 (February 8)
- Topic: Data Integration
- Data Integration
- Prometheus mediator (Snehal Thakkar)
- Homework
- Assignment 4 Due (Feb 9 at midnight)
- Execution Plans
- Lecture 6 (February 15)
- Topics: Semantic Web and Optimizing Data Integration
- The Semantic Web (Prof. Yolanda Gil)
- Optimizing Integration Plans (Dr. Jose Luis Ambite)
- Homework
- Assignment 5 Due (Feb 16 at midnight)
- Mediator Integration
- Lecture 7 (February 22)
- Topic: Constraint Integration
- Constraint Integration
- Heracles Constraint Integration System (Dr. Jose Luis Ambite)
- Homework
- Assignment 6 Due (Feb 23 at midnight)
- Data integration using servlets
- Required Readings:
- Heracles II: Conditional Constraint Networks for Interleaved Planning and
Information Gathering,
Jose Luis Ambite, Craig A. Knoblock, Maria Muslea, and Steve Minton
IEEE Intelligence Systems, 2005
-
Smart Clients: Constraint Satisfaction as a Paradigm for Scaleable Intelligent Information Systems
Marc Torrens, Boi Faltings and Pearl Pu.
Constraints, Special issue on Constraints and Agents, pages. 49-69,
International Journal of Constraints, Volume 7, Issue 1.
Kluwer Academic Publishers.
January 2002.
- Lecture 8 (March 1)
- Topic: Record Linkage
- Record Linkage
- Apollo record linkage (Martin Michalowski)
- Homework
- Assignment 7 Due (March 2 at midnight)
- Constraint Integration in Heracles
- Project Proposals Due (March 1 at 1:50pm)
- Required Readings:
-
Learning domain-independent string transformation weights for high accuracy object identification,
Sheila Tejada, Craig A. Knoblock, and Steven Minton,
In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, Canada, 2002.
-
Adaptive Duplicate Detection Using Learnable String Similarity Measures.
Mikhail Bilenko and Raymond J. Mooney.
In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp.39-48, Washington, DC, August 2003.
- Lecture 9 (March 8)
- Topics: Ontology Matching and Source Modeling
- Ontology Matching
- Automatic Source Modeling
- Homework
- Assignment 8 Due (March 9 at midnight)
- Record Linkage
- Required Reading:
-
Semantic annotation of unstructured and ungrammatical text
Matthew Michelson and Craig A. Knoblock, Proceedings of the Dagstuhl workshop on Machine Learning for the Semantic Web, February 2005.
-
Biomedical Named Entity Regognition Using Conditional Random Fields and Rich Feature Sets
Burr Settles, Proceedings of the COLING 2004 International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), Geneva, Switzerland, 2004
- (LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts
Fabio Ciravegna, Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA, 2001
- Spring Break (March 15)
- Lecture 10 (March 22)
- Topics: Information Extraction and Web Mining
- Information Extraction
- Mining the Web (Rattapoom Tuchinda)
- Lecture 11 (March 29)
- Topics: Optimizing Streaming Dataflow
- Optimizing Streaming Dataflow
- Speculative Execution (Dr. Greg Barish)
- Required Readings:
- Speculative
Execution for Information Gathering Plans,
Greg Barish and Craig A. Knoblock,
Proceedings of the Sixth International Conference on AI Planning
and Scheduling (AIPS-2002), Toulouse, France. April 2002.
- Learning
value predictors for the speculative execution of information gathering
plans,
Greg Barish and Craig A. Knoblock.
In Proceedings of the 18th International Joint Conference on
Artificial Intelligence (IJCAI-2003), Acapulco, Mexico, 2003.
- Eddies:
Continuously Adaptive Query Processing,
Ron Avnur and Joseph M. Hellerstein,
In Proceedings of the 2000 ACM SIGMOD International Conference
on Management of Data,Dallas, TX, May 2000.
- Lecture 12 (April 5)
- Topics: Optimizing Data Integration
- Interleaving Planning and Sensing
- Compiling Integration Plans
- Lecture 13 (April 12)
- Topic: Geospatial Data Integration
- Geospatial information mediators
- Geospatial data fusion
- Geospatial reasoning (Martin Michalowski)
- Required Readings:
-
Utilizing road network data for automatic identification of road
intersections from high resolution color orthoimagery,
Ching-Chien Chen, Cyrus Shahabi, and Craig A. Knoblock.
In Proceedings of the 2nd Workshop on Spatio-Temporal Database
Management - STDBM'04, 2004.
-
Automatically and accurately conflating orthoimagery and street maps,
Ching-Chien Chen, Craig A. Knoblock, Cyrus Shahabi, Snehal Thakkar, and Yao-Yi
Chiang.
In Proceedings of the 12th ACM International Symposium on
Advances in Geographic Information Systems (ACM-GIS'04), 2004.
-
Exploiting online sources to accurately geocode addresses,
Rahul Bakshi, Craig A. Knoblock, and Snehal Thakkar.
In Proceedings of the 12th ACM International Symposium on
Advances in Geographic Information Systems (ACM-GIS'04), 2004.
-
Query processing in a geographic mediation system,
Mehdi Essid, Omar Boucelma, François-Marie Colonna, and Yassine Lassoued.
In Proceedings of the 12th ACM International Symposium on
Advances in Geographic Information Systems (ACM-GIS'04), 2004.
- Project Presentations (April 19)
- Project Presentations (April 26)
- Final Exam (Tuesday, May 3, 2-4pm)
Last Updated July 20th 2005