CSCI 548
Information Integration on the Web
Spring 2010
Instructor: Craig Knoblock and Kristina Lerman
Meeting Time: Monday and Wednesday 3.30 - 4:50pm
Location: ZHS 352
Office Hours:
Monday and Wednesday 4.50pm (immediately after class in the lobby of ZHS)
Or by appointment (ISI 922 or by phone)
Teaching Assistant:
TA Office Hours:
This course will focus on the basic foundations and techniques in Information
Extraction and Integration. There has been a great deal of interest and research over the
last few years on this topic and the course will cover the research and tools for addressing
the technical problems. The topics covered will include structured data querying
(Xquery), view integration techniques, machine learning techniques for turning web sites
into structured data sources, high-performance query execution systems based on
streaming dataflow, constraint-based integration systems, and approaches to resolving
naming inconsistencies across sites. We will also cover the applications of information
integration to geospatial data integration.
The class will be run as a lecture course with lots student participation and hands-on
experience. As an integral part of the course each student will develop and build an
integrated Web application using the research and tools covered in the class.
Prerequisites:
CSCI561 -- Introduction to AI
CSCI585 -- Database Systems
Recommended Course:
CSCI571 -- Web Technologies
CSCI573 -- Advanced AI
Grading:
Homework -- 20%
Course project -- 30%
Quizzes -- 20%
Final Exam -- 30%
Books: There is no required textbook. We will read technical papers on each topic.
Lab:There is no lab for this course. Students should contact the
instructor if they do not have access to a computer where they can
install their own software.
Course Syllabus and Schedule
- Lecture 1 (January 11)
- Topic: Introduction
- Introduction
- Overview of the course
- Course project
- Required Readings
- What is XQuery ?
Per Bothner.
- XQuery: A Guided Tour
Chapter 1 of XQuery from the Experts: A Guide to the W3C XML Query Language
Howard Katz (ed), Addison-Wesley, San Francisco, CA, 2004
- Lecture 2 & 3 (January 13 & 20 )
- Topic: Wrapper Learning
- Wrapper Learning, Maintenance and Generation
- Homework
- Assignment 1 Due (January 22 at midnight)
- Assignment 2 Due (January 29 at midnight)
- Required Readings
-
A Hierarchical Approach to Wrapper Induction,
Ion Muslea, Steve Minton, and Craig A. Knoblock.
In Proceedings of the 3rd International Conference on Autonomous Agents 1999, Seattle, WA, 1999.
-
Wrapper maintenance: A machine learning approach,
Kristina Lerman, Steven Minton, and Craig A. Knoblock
Journal of Artificial Intelligence Research, 2003
-
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
W. Crescenzi, G. Mecca, P. Merialdo
The VLDB Journal, 109-118, 2001
-
Overview of AutoFeed: An Unsupervised Learning System for Generating Webfeeds
B. Cenk Gazen and Steven Minton
Proceedings of AAAI, 2006
- Recommended Readings:
-
Hierarchical Wrapper Induction for Semistructured Information Sources,
I. Muslea, S. Minton, C. Knoblock
Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114, 2001.
-
Active Learning with Multiple Views,
Ion Muslea,
Journal of Artificial Intelligence Research 27, pp. 202-233
-
Wrapper Maintenance
K. Lerman and C. Knoblock
Encyclopedia of Database Systems, 2009
-
Automatic information extraction from large websites
Crescenzi, V. and Mecca, G.
J. ACM 51, 5 (Sep. 2004), 731-779. (Longer version of the RoadRunner paper)
-
Using the Structure of Web Sites for Automatic Segmentation of Tables
Kristina Lerman, Lise Getoor, Steven Minton and Craig Knoblock
In Proc. of SIGMOD, 2004, Paris, France.
- Lecture 4 & 5 (January 25 & 27)
- Topic: Data Integration
- Data Integration
- Data Integration Tools
- Homework
- Assignment 3 Due (February 5 at midnight)
- Required Readings:
- Recommended Readings:
- Lecture 6 (February 1)
- Topic: Schema Matching
- Homework
- Assignment 4 Due (February 12 at midnight)
- Lecture 7 (February 3)
- Lecture 8 & 9 (February 8 & 10)
- Topic: Semantic Web
- Semantic Web
- Semantic Web Tools
- Homework
- Assignment 5 Due (February 19 at midnight)
- Required Readings:
- Recommended Readings:
- Lecture 10 (February 17)
- Topic: Mashups
- Building Mashups
- Mapping/Modeling Tools
- Lecture 11 & 12 (February 22 & 24)
- Topic: Information Extraction
- Information Extraction Systems
- IE Tools
- Homework
- Assignment 5 Due (March 5 at midnight)
- Required Reading (Lecture 11):
- Required Reading (Lecture 12):
- Optional Reading:
- Lecture 12 & 13 (March 1 & 3)
- Topics: Record Linkage
- Homework
- Assignment 7 Due (March 12 at midnight)
- Required Readings:
- Learning Blocking Schemes for Record Linkage,
Matthew Michelson and Craig A. Knoblock,
In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-2006),
Boston, MA, 2006.
- Adaptive Duplicate Detection Using Learnable String
Similarity Measures.
Mikhail Bilenko and Raymond J. Mooney.
In Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD-2003), pp.39-48,Washington, DC,
August 2003.
- EntityBases: Compiling, Organizing and Querying Massive Entity Repositories
Craig A. Knoblock, José Luis Ambite, Kavita Ganesan, Maria Muslea, Steven Minton, Greg Barish, Evan Gamble, Claude Nanjo, Kane See, Cyrus Shahabi, Ching-Chien Chen
In Proceedings of International Conference on Artificial Intelligence, 2007
- Optional Readings:
- Lecture 14 & 15 (March 8 & March 10)
- Topic: Social Web
- Social Tagging
- Social Networks and Privacy
- Homework
- Assignment 8 Due (March 19 at midnight)
- Required Readings (Lecture 14):
- Recommended Readings (Lecture 14):
- Harvesting Social Knowledge from Folksonomies,
Wu, H., Zubair, M. and Maly, K.
In Proc. of Conf. on Hypertext and Hypermedia, 2006.
- Ht06, tagging paper, taxonomy, flickr, academic article, toread.,
Wu, H., Zubair, M. and Maly, K.
In Proc. of Conf. on Hypertext and Hypermedia, 2006.
- Required Readings (Lecture 15):
- Slides
- Lecture 16 & 17 (March 22 & March 24)
- Topic: Sentiment Analysis
- Sentiment Analysis
- Folksonomies
- Required Readings (Lecture 16):
- Recommended Readings (Lecture 16):
- Required Readings (Lecture 17):
- Recommended Readings (Lecture 17):
- Slides
- Lecture 18 (March 29)
- Topic: Geocoding
- Required Readings:
- Lecture 19 (March 31)
- Topic: Geospatial Source Discovery
- Required Readings:
- Geospatial Search Service
Prochazka, D., Motycka, A.
Information Society. Ljubljana, Slovenia, 2008
- Identifying Maps on the World Wide Web
Michelson, M.; Goel, A.; and Knoblock, C. A.
In Proceedings of the 5th International Conference on GIScience, 2008
- A Data Integration Approach to Dynamically Fusing Geospatial Sources
Thakkar, S.
Ph.D. Thesis, Department of Computer Science, University of Southern California, 2007
- Lecture 20 (April 5)
- Topic: Geospatial Data Fusion
- Required Readings:
- Slides
- Lecture 21 (April 7)
- Topic: Geotagging
- Required Readings:
-
STEWARD: Architecture of a spatio-textual search engine
M. D. Lieberman, H. Samet, J. Sankaranarayanan, and J. Sperling
In Proceedings of the 15th ACM International Symposium on Geographic Information Systems (ACM GIS'07), pages 186-193, Seattle, WA, November 2007
-
Web-a-where: Geotagging Web Content
Amitay E., Har'El N., Sivan R., Soffer A.
ACM SIGIR, 2004
- Lecture 22 (April 12)
- Topic: Map Processing
- Required Readings:
-
Accurate and Automatic Extraction of Road Intersections From Raster Maps
Y.-Y. Chiang, C. A. Knoblock, C. Shahabi, and C.-C. Chen
Geoinformatica 13, 2 (2008), 121-157
-
Integrated text and line-art extraction from a topographic map
L. Li, G. Nagy, A. Samal, S. C. Seth, and Y. Xu.
International Journal of Document Analysis and Recognition, 2(4):177-185, 2000
- Slides
- Lecture 23 (April 14)
- Topic: Intellectual Property
- Intellectual Property and Web
- Lecture 24 (April 28)
- Topic: Geospatial Reasoning
- Project Presentations 1 (April 19)
- Project Presentations 2 (April 21)
- Final Exam (Thursday, May 7, 2-4pm)
Last Updated August 17th 2010