Computer Science 562 - Empirical Methods in Natural Language Processing, Fall 2004

Instructors: Profs. Kevin Knight & Daniel Marcu

Class Meeting Time:

Tues & Thurs 2:00pm – 3:20pm 

Class Location:

VHE217

Prerequisite: CS561

 

Description

This graduate course covers the basics of statistical methods for processing human language, intended for:

 

(1) students who want to understand current natural-language processing (NLP) research,
(2) students interested in tools for building NLP applications,
(3) machine-learning students looking for large-scale application domains, and
(4) students seeking experience with probabilistic methods that can be applied to a range of AI problems.

 

Students will experiment with existing NLP software toolkits and write their own programs. Grades will be based on five programming assignments (70% = 14% each) and a final project (30%); there will be no midterm or final.

 

Office hours: TBA.

 

 

 

 

Syllabus

Aug 24

Overview.

Aug 26

Basic linguistic theory. Words, parts-of-speech, ambiguity, morphology, phrase structure, word senses, speech. Text corpora and processing tools.

Programming Assignment 0 (no credit) out Aug 26, nothing to turn in.

Aug 31, Sept 2

Basic automata theory. Finite-state acceptors and intersection. Finite-state transducers and composition. Applications in morphology and text-to-sound conversion. Context-free grammars and parsing.

Programming Assignment 1 out Sept 2, due beginning of class Sept 9. 

Topic: Finite-state acceptors for natural language.

Sept 7, 9

Basic probability theory. Conditional probability, Bayes rule, estimating parameter values from data, building generative stochastic models, the noisy-channel framework. Probabilistic finite-state acceptors and transducers.

Sept 14, 16, 21, 23

Language modeling. Estimating the frequency of English strings. Using language models to resolve ambiguities across a wide range of applications. Training and testing data. The sparse data problem. Smoothing with held-out data.

Programming Assignment 2 out Sept 16, due beginning of class Sept 28. 

Topic: Weighted finite-state acceptors for language modeling.

Sept 28, 30

Guest lectures, TBA.

Oct 5, 7, 12, 14

String transformations. A simple framework for stochastically modeling many types of string transformations, such as: tagging word sequences with parts of speech, cleaning up misspelled word sequences, automatically marking-up names, organizations, and locations in raw text, etc. Estimating parameter values from annotated data.

Programming Assignment 3 out Oct 7, due beginning of class Oct 19. 

Topic: Weighted finite-state transducers for string transformation.

Oct 19, 21, 26, 28

Hidden parameters. Problems involving incomplete data, such as: elementary cryptanalysis, transliteration, machine translation, NL interfaces, deciphering ancient scripts. The EM algorithm.

Programming Assignment 4 out Oct 21, due beginning of class Nov 2. 

Topic: Unsupervised learning of natural language structure.

Nov 2, 4, 9, 11

Syntactic structures, context-free grammars, parsing, lexicalized grammars, syntax-based language models, the inside-outside algorithm.

Programming Assignment 5 out Nov 9, due beginning of class Nov 18. 

Topic: Parsing.

 

Initial project proposal due beginning of class Nov 4.

Final project proposal due Nov 11.

Project presentations in class Nov 30, Dec 2.  Final project write-ups due Dec 12 by email.

Nov 16, 18, 23

Tree automata and applications.

Nov 30, Dec 2

Project presentations by students in class.

Final project write-ups due Dec 12 by email.