Computer Science 562 - Empirical Methods in Natural Language Processing, Fall 2008

Instructors: Prof. David Chiang and Prof. Kevin Knight

Teaching Assistant: Ashish Vaswani

knight@isi.edu, chiang@isi.edu, vaswani@usc.edu

Class Meeting Time:

Tues & Thu 11am-12:20pm 

Class Location:

WPH B30

Prerequisite: CS561 or permission of instructor

Course Description

This graduate course covers the basics of statistical methods for processing human language, intended for:

(1) students who want to understand current natural-language processing (NLP) research,
(2) students interested in tools for building NLP applications,
(3) machine-learning students looking for large-scale application domains, and
(4) students seeking experience with probabilistic methods that can be applied to a range of AI problems.

Students will experiment with existing NLP software toolkits and write their own programs. Grades will be based on six programming assignments (72% = 12% each) and a final project (28%); there will be no midterm or final.   Students may not collaborate on assignments --- this will be considered cheating .  Late penalty is 30% off for up to a week late, no credit thereafter.  Optional text:  Jurafsky & Martin, Speech and Language Processing.

 

Office hours: TBA.

Course software:

·         Carmel finite-state string toolkit (http://www.isi.edu/licensed-sw/carmel/)

·         Tiburon tree automata toolkit (http://www.isi.edu/licensed-sw/tiburon/)

 

Syllabus

Aug 26 (Knight)

Example state-of-the-art natural language application: Machine Translation.

Aug 28 (Chiang)

Basic linguistic theory. Words, parts-of-speech, ambiguity, morphology, phrase structure, word senses, speech. Text corpora and processing tools.

Programming Assignment 0 (no credit) out Aug 28, nothing to turn in.

Sept 2, 4 (Chiang)

Basic automata theory. Finite-state acceptors and intersection. Finite-state transducers and composition. Applications in morphology and text-to-sound conversion. Context-free grammars and parsing.

Programming Assignment 1 out Sept 4, due beginning of class Sept 11. 

Topic: Finite-state acceptors for natural language.

Sept 9, 11 (Chiang)

Basic probability theory. Conditional probability, Bayes rule, estimating paramet= er values from data, building generative stochastic models, the noisy-channel framework. Probabilistic finite-state acceptors and transducers.

Sept 16, 18, 23, 25 (Knight)

Language modeling. Estimating the frequency of English strings. Using language models to resolve ambiguities across a wide range of applications. Training and testing data. The sparse data problem. Smoothing with held-out data.

Programming Assignment 2 out Sept 23, due beginning of class Sept 30. 

Topic: Weighted finite-state acceptors for language modeling.

Sep 30; Oct 2, 7, 9 (Knight)

String transformations. A simple framework for stochastically modeling many types of string transformations, such as: tagging word sequences with parts of speech, cleaning up misspelled word sequences, automatically marking-up names, organizations, and locations in raw text, etc. Estimating parameter values from annotated data.

Programming Assignment 3 out Oct 7, due beginning of class Oct 14. 

Topic: Weighted finite-state transducers for string transformation.

Oct 14, 16, 21, 23 (Knight)

Hidden parameters.  Problems involving incomplete data.  The EM algorithm, forward-backward algorithm.

Programming Assignment 4 out Oct 21, due beginning of class Oct 30. 

Topic: Unsupervised learning of natural language structure.<= /p>

Oct 28, 30; Nov 4 (Chiang)

Syntactic structures, context-free grammars, regular tree grammars, syntax-based lang= uage models, inside-outside algorithm.

Final Project overview information given out Oct 30.. 

Nov 6, 11, 13, 18 (Chiang)

Tree transformations.  Applications= in machine translation.

Programming Assignment 5 out Nov 6, due beginning of class Nov 18. 

Topic: Modeling syntactic transformations.

Nov 20, 25; Dec 2, 4 (Chiang)        =            (Nov 27 is a holiday)

Maximum entropy, discriminative training, conditional random fields, and other lear= ning methods.

Programming Assignment 6 out Nov 25, due beginning of class Dec 4. 

Topic: Supervised training.

 

= Initial project proposal due beginning of class Nov 11.

= Final project scope settled Nov 18.

= Final project write-ups due on or before Dec 15 by email.