CS 562: Empirical Methods in Natural Language Processing -- Fall 2009

Time: Tu/Th, 11:00am-12:20pm
Location: GFS 223
Instructors: David Chiang (chiang@isi.edu)
Liang Huang (lhuang@isi.edu)
Teaching Assistant: Oana Nicolov (oana@isi.edu)

Course Description

This graduate course covers the basics of statistical methods for processing human language, intended for:

  1. students who want to understand current natural-language processing (NLP) research,
  2. students interested in tools for building NLP applications,
  3. machine-learning students looking for large-scale application domains, and
  4. students seeking experience with probabilistic methods that can be applied to a range of AI problems.
Prerequisite: CS561 or permission of instructor.

Textbooks

Class Notes (these will be handed to you in the first class).
Optional text: Jurafsky and Martin, Speech and Language Processing.

Requirements and policies

Students will experiment with existing NLP software toolkits and write their own programs. There will be six programming assignments and a research project; there will be no exams. Grades will be based on:

Students are expected to submit only their own work for homework assignments. They may discuss the assignments with one another but may not collaborate with or copy from one another.

All assignments and the project will be due at the beginning of class (11am) on the due date. Late assignments will be accepted with a 20% penalty at the beginning of the following class; with a 40% penalty up to a week after the due date. No exceptions can be made except for a grave reason.

Syllabus (Subject to change!)

Date Topic Instructor Slides/Readings Assignments
Tue, Aug 25 Introduction: real-world applications of NLP
Chiang & Huang
slides -
Thu, Aug 27 Basic linguistic theory. Words, parts-of-speech, ambiguity, morphology, phrase structure, word senses, speech. Text corpora and processing tools.
Chiang
Notes: p1-6
SLP: Ch1
Assignment 0: nothing to turn in.
Unit One: Sequence models
Tue, Sep 1 Basic automata theory. Finite-state acceptors and intersection. NFA determinization and DFA minimization. Regular Expressions.
Huang
Notes: p8-13
SLP: Ch2
slides (Tu-Th)
-
Thu, Sep 3 Finite-state transducers and composition. Applications in morphology, text-to-sound conversion, word alignment. Notes: p14-19
SLP: Ch3
Assignment 1 out
Tue, Sep 8 Basic probability theory. Conditional probability, Bayes' rule, estimating parameter values from data, priors, building generative stochastic models. Notes: p20-27
slides (Tu-Th)
-
Thu, Sep 10 The noisy-channel framework. Probabilistic finite-state acceptors and transducers. Notes: p28-34 Assignment 1 due
Tue, Sep 15 Language modeling and using language models to resolve ambiguities across a wide range of applications.
Chiang
Notes: p36-53
SLP: Ch4

Adv. Smoothing: [CG98]

-
Thu, Sep 17 -
Tue, Sep 22 Training and testing data. The sparse data problem. Smoothing with held-out data. Assignment 2 out
Thu, Sep 24 -
Tue, Sep 29 String transformations. A simple framework for stochastically modeling many types of string transformations, such as: tagging word sequences with parts of speech, cleaning up misspelled word sequences, automatically marking-up names, organizations, and locations in raw text, etc. Estimating parameter values from annotated data. Various Dynamic Programming Algorithms for sequences.
Huang
Notes: p54-65
SLP: Ch5
[slides]

Semirings and DP: [H08]
Writing Systems: [KS09]

Assignment 2 due
Thu, Oct 1 Assignment 3 out
[data]
Tue, Oct 6 -
Thu, Oct 8 Problems involving incomplete data. Hidden parameters.
Chiang
Notes: p66-90
SLP: Ch6
[DLR77][C97][D02][B06]
Assignment 3 due
Tue, Oct 13 The Expectation-Maximization algorithm -
Thu, Oct 15 Assignment 4 out
Tue, Oct 20 The Forward-Backward algorithm -
Thu, Oct 22 -
Unit Two: Tree models
Tue, Oct 27 Limitations of finite-state machines. Natural language syntax. Context-free grammars.
Huang
SLP: Ch12.
[slides (Tu-Th)]
Project Guidelines out.
Thu, Oct 29 A Toy Grammar for English;
Probabilistic CFGs;
Parsing: the CKY algorithm.
SLP: Ch13-14. -
Tue, Nov 3 CKY Parsing Examples; Ambiguities in Chart.
Chomsky Normal Form and Binarization.
SLP: Ch14.
Assignment 4 due
Thu, Nov 5 Parser Evaluation Metrics.
Parsing as Deduction, Intersection, and Generation.
Packed Forests and Knuth 1977 Algorithm.
- Assignment 5 out
Tue, Nov 10 Statistical (Treebank) Parsing.
The Inside-Outside algorithm.
Tree Substitution Grammars.
- Initial project proposal due
Thu, Nov 12 Applications in Machine Translation:
  • Synchronous Grammars; Translation as Parsing;
  • Tree Transducers and Tree-to-String Translation
- -
Tue, Nov 17 - Assignment 5 due
Unit Three: Discriminative training
Thu, Nov 19 Discriminative training: Perceptron
Chiang
- Final project proposal due. Data preparation and baseline finished.
Tue, Nov 24 Maximum-entropy models. Conditional random fields. Stochastic gradient ascent. - Assignment 6 out
Thu, Nov 26 Thanksgiving - NO CLASS
Tue, Dec 1 Max-margin methods
Chiang
- -
Thu, Dec 3 Research project presentations
Yourself
- Assignment 6 due
Wed, Dec 16 - - - Research project due by e-mail

Extra Readings (not required, but recommended)

[H08] Tutorial on Semirings and Advanced Dynamic Programming in NLP (slides).
Liang Huang
Tutorial given at COLING 2008 and NAACL 2009.
[KS09] Tutorial on Writing Systems, Transliteration, and Decipherment (slides).
Kevin Knight and Richard Sproat
Tutorial given at NAACL 2009.
[CG98] An Empirical Study of Smoothing Techniques for Language Modeling
Stanley F. Chen and Joshua Goodman
Technical Report TR 10-98, Computer Science Group, Harvard University.
[DLR77] Maximum Likelihood from Incomplete Data via the EM Algorithm
A.P. Dempster, N.M. Laird, and D.B. Rubin
Journal of the Royal Statistical Society, Vol. 39, No. 1, 1977
[C97] The EM Algorithm
Michael Collins
[D02] The Expectation Maximization Algorithm
Frank Dellaert
Technical Report, 2002
[B06] The Expectation Maximization Algorithm. A short Tutorial.
Sean Borman