Computer Science 562 - Empirical Methods in Natural Language Processing, Fall 2007

 

Instructors: Prof. David Chiang and Prof. Kevin Knight

Teaching Assistant: Steve DeNeefe

knight@isi.edu , chiang@isi.edu , sdeneefe@isi.edu

Class Meeting Time: 

Tues & Thu 11am-12:20pm 

Class Location: 

WPH B30 

 

This graduate course covers the basics of statistical methods for processing human language, intended for:  

 

(1) students who want to understand current natural-language processing (NLP) research,
(2) students interested in tools for building NLP applications,
(3) machine-learning students looking for large-scale application domains, and
(4) students seeking experience with probabilistic methods that can be applied to a range of AI problems.

 

Students will experiment with existing NLP software toolkits and write their own programs. Grades will be based on seven programming assignments (70% = 10% each) and a final project (30%); there will be no midterm or final.

 

Office hours: TBA. 

 

Course software: 

Syllabus

Aug 28 (Knight)

Example state-of-the-art natural language application: Machine Translation.

Aug 30 (Chiang)

Basic linguistic theory. Words, parts-of-speech, ambiguity, morphology, phrase structure, word senses, speech. Text corpora and processing tools. 

Programming Assignment 0 (no credit) out Aug 30, nothing to turn in.

Sept 4, 6 (Chiang) 

Basic automata theory. Finite-state acceptors and intersection. Finite-state transducers and composition. Applications in morphology and text-to-sound conversion. Context-free grammars and parsing.  

Programming Assignment 1 out Sept 6, due beginning of class Sept 13 (sample solutions: part 2, part 3, part 4). 

Topic: Finite-state acceptors for natural language. 

Sept 11, 13 (Chiang) 

Basic probability theory. Conditional probability, Bayes rule, estimating parameter values from data, building generative stochastic models, the noisy-channel framework. Probabilistic finite-state acceptors and transducers.  

Sept 18, 20, 25, 27 (Knight) 

Language modeling. Estimating the frequency of English strings. Using language models to resolve ambiguities across a wide range of applications. Training and testing data. The sparse data problem. Smoothing with held-out data.  

Programming Assignment 2 out Sept 20, due beginning of class Sept 27 (get files here).  

Topic: Weighted finite-state acceptors for language modeling. 

Oct 2, 4, 9, 11 (Knight) 

String transformations. A simple framework for stochastically modeling many types of string transformations, such as: tagging word sequences with parts of speech, cleaning up misspelled word sequences, automatically marking-up names, organizations, and locations in raw text, etc. Estimating parameter values from annotated data.  

Programming Assignment 3 out Oct 4, due beginning of class Oct 11 (get files here).  

Topic: Weighted finite-state transducers for string transformation. 

Oct 16, 18, 23, 25 (Chiang/DeNeefe) 

Hidden parameters.  Problems involving incomplete data, such as: elementary cryptanalysis, transliteration, machine translation, deciphering ancient scripts.  The EM algorithm, forward-backward algorithm.

Programming Assignment 4 out Oct 18, due beginning of class Oct 25.  

Topic: Unsupervised learning of natural language structure. 

Oct 30;  Nov 1, 6 (Knight)

Syntactic structures, context-free grammars, regular tree grammars, syntax-based language models, inside-outside algorithm. 

Programming Assignment 5 out Nov 1, due beginning of class Nov 8.  

Topic: decipherment and forward-backward (get files here, see helpful paper here). 

Nov 8, 13, 15, 20 (Chiang/DeNeefe) 

Tree transformations.  Applications in machine translation.

Programming Assignment 6 out Nov 15, due beginning of class Nov 29 (extended from Nov 27).  

Topic: Modeling syntactic transformations (get files here and updated tiburon jar here).

Nov 22 -- Holiday 

Nov 27, 29;  Dec 4, 6 (Chiang)

Maximum entropy, discriminative training, conditional random fields, and other learning methods.

Programming Assignment 7 out Nov 29, due beginning of class Dec 6.  

Topic: Supervised training using Conditional Random Fields (get files here). 

 

Initial project proposal due beginning of class Nov 13 (get guideline document here).

Final project scope settled Nov 20.

Final project write-ups due on or before Dec 17 by email.