Time: Tu/Th, 11:00am–12:20pm
Location: KAP 147
Instructors: Prof. David Chiang (chiang@isi.edu)
Prof. Liang Huang (lhuang@isi.edu)
Teaching Assistants: Shu Cai (shucai@isi.edu)
Ashish Vaswani (avaswani@isi.edu)

Course Description

Computers process massive quantities of information every day in the form of human language, yet machine understanding of human language remains one of the great challenges of computer science. How can advances in computing technology enable more intelligent processing of all this language data? Will computers ever be able to use this data to learn language like humans do? This course provides a systematic introduction to statistical models of human language, with particular attention to the structures of human language that inform them and the structured learning and inference algorithms that drive them. This is a lecture course, not a seminar course, but aims to cover both fundamental and cutting-edge research issues.

We will also teach Python in the first two lectures, which you will use in the programming assignments.

Audience This graduate course is intended for:

Prerequisites Students are expected to be proficient in programming, basic algorithms and data structures (e.g., dynamic programming, graph traversal and shortest paths, hashtables and priority queues), discrete math, and basic probability theory. Although the formal prerequisite for the course is currently CSCI 561, students who meet the preceding requirements should email the instructors to waive the prerequisite. Undergraduate students are welcome as well. If you are not sure if you have enough background for this course, you are welcome to sit in on the first few lectures. There will be a Quiz 0 in the third lecture to test if you qualify for this course.

Related Courses This course is part of USC's curriculum in natural language processing. There is a sister course, Natural Language Processing (CSCI 544), offered in the Spring semester. You can take these two in either order.


Requirements and policies

CourseworkStudents will experiment with existing NLP software toolkits and write their own programs. Grades will be based on:

All assignments and the project will be due at the beginning of class (11am) on the due date. Late assignments will be accepted with a 30% penalty up to one week late. No exceptions can be made except for a grave reason. Submissions will be made using Moodle at https://pedtek.isi.edu/moodle.

Students are expected to submit only their own work for homework assignments. They may discuss the assignments with one another but may not collaborate with or copy from one another.

Statement on Academic Integrity USC seeks to maintain an optimal learning environment. General principles of academic honesty include the concept of respect for the intellectual property of others, the expectation that individual work will be submitted unless otherwise allowed by an instructor, and the obligations both to protect one's own academic work from misuse by others as well as to avoid using another's work as one's own. All students are expected to understand and abide by these principles. Scampus, the Student Guidebook, contains the Student Conduct Code in Section 11.00, while the recommended sanctions are located in Appendix A: http://www.usc.edu/dept/publications/SCAMPUS/gov. Students will be referred to the Office of Student Judicial Affairs and Community Standards for further review, should there be any suspicion of academic dishonesty. The Review process can be found at: http://www.usc.edu/student-affairs/SJACS.

Syllabus (currently being updated for 2011)

Date Topic Instructor Slides/Readings Assignments
Tue, Aug 23 What is natural language processing for? What is the empirical approach to NLP? Rudiments of linguistic theory. Introduction to the Python programming language with application to text processing. Chiang/Huang
Thu, Aug 25 Chiang HW0: nothing to turn in.
Unit One: Sequence Models (10 lectures)
Tue, Aug 30 Basic automata theory. Finite-state acceptors and transducers. Chiang Quiz 0
Thu, Sep 1 HW1 out: FSTs
Tue, Sep 6 Probability theory and estimation. Weighted FSTs. The noisy-channel model.
Thu, Sep 8
Add/drop deadline
Tue, Sep 13 Language modeling and using language models to resolve ambiguities across a wide range of applications.
Training and testing data.
The sparse data problem; smoothing with held-out data.
Huang J&M chapter 4 [slides] HW1 due
Thu, Sep 15 HW2 out: LMs
Tue, Sep 20
Thu, Sep 22 String transformations. A simple framework for stochastically modeling many types of string transformations, such as: tagging word sequences with parts of speech, cleaning up misspelled word sequences, automatically marking-up names, organizations, and locations in raw text, etc. Estimating parameter values from annotated data. Viterbi algorithm. Huang [slides]
Tue, Sep 27 HW2 due
Thu, Sep 29 Quiz 1
HW3 out: Tagging/Transliteration
Unit Two: Tree Models (7 lectures)
Tue, Oct 4 Context-Free Grammars (CFGs) Chiang
Thu, Oct 6 Probabilistic Context-Free Grammars (PCFGs)
CKY algorithm
Tue, Oct 11 Binarization; Unary Rules HW3 due
Thu, Oct 13 Parsing as Deduction and Intersection
Forests and Hypergraphs
Dynamic Programming on Hypergraphs
From Viterbi to CKY
From Dijkstra to Knuth 77
HW4 out: Parsing
Tue, Oct 18
Thu, Oct 20
Tue, Oct 25 Synchronous Grammars and Machine Translation Quiz 2
Unit Three: Learning Language (8 lectures)
Thu, Oct 27 Unsupervised Learning Huang [slides] HW4 due
Project guidelines out
Tue, Nov 1 EM Algorithm: slow version HW5 out: EM
Thu, Nov 3 EM: fast version (Forward-Backward)
Tue, Nov 8 Theory behind EM: converging to local maximum Initial project proposal due
Thu, Nov 10
Withdraw deadline
Discriminative Learning: The Perceptron Chiang
Tue, Nov 15 Conditional Random Fields (CRFs) HW5 due
Thu, Nov 17 Max-margin methods Final project proposal due
HW6 out: Discriminative Learning
Tue, Nov 22 More on Discriminative Training Quiz 3
Thu, Nov 24 Thanksgiving – no class
Tue, Nov 29 Project interim presentations You
Thu, Dec 1 Project interim presentations You HW6 due
Wed, Dec 14 Final projects due

Suggested Readings

[LY90] The estimation of stochastic context-free grammars using the inside-outside algorithm.
K. Lari and S. J. Young.
Computer Speech and Language, 1990: 4:35-56.
[SSP95] Principles and Implementation of Deductive Parsing.
Stuart Shieber, Yves Schabes, and Fernando Pereira.
Journal of Logic Programming, 1995: 24 (1-2): 3-36.
[H08] Tutorial on Semirings and Advanced Dynamic Programming in NLP (slides).
Liang Huang
Tutorial given at COLING 2008 and NAACL 2009.
[KS09] Tutorial on Writing Systems, Transliteration, and Decipherment (slides).
Kevin Knight and Richard Sproat
Tutorial given at NAACL 2009.
[CG98] An Empirical Study of Smoothing Techniques for Language Modeling
Stanley F. Chen and Joshua Goodman
Technical Report TR 10-98, Computer Science Group, Harvard University.
[DLR77] Maximum Likelihood from Incomplete Data via the EM Algorithm
A.P. Dempster, N.M. Laird, and D.B. Rubin
Journal of the Royal Statistical Society, Vol. 39, No. 1, 1977
[C97] The EM Algorithm
Michael Collins
[D02] The Expectation Maximization Algorithm
Frank Dellaert
Technical Report, 2002
[B06] The Expectation Maximization Algorithm -- A short Tutorial.
Sean Borman

Some useful Python modules