Instructors: Prof. Kevin Knight and Prof. Daniel Marcu
Teaching Assistant: Jonathan May
Class Meeting Time:
Tues & Thurs 11am-12:20pm
This graduate course covers the basics of statistical methods for processing human language, intended for:
(1) students who want
to understand current natural-language processing (NLP) research,
(2) students interested in tools for building NLP applications,
(3) machine-learning students looking for large-scale application domains, and
(4) students seeking experience with probabilistic methods that can be applied to a range of AI problems.
Students will experiment with existing NLP software toolkits and write their own programs. Grades will be based on six programming assignments (72% = 12% each) and a final project (28%); there will be no midterm or final.
Office hours: TBA.
· Tiburon tree automata toolkit (http://www.isi.edu/publications/licensed-sw/tiburon/)
Aug 22Sample NLP Application: Overview of Machine Translation
Example state-of-the-art natural language application: Machine Translation.
Basic linguistic theory. Words, parts-of-speech, ambiguity, morphology, phrase structure, word senses, speech. Text corpora and processing tools.
Programming Assignment 0 (no credit) out Aug 24, nothing to turn in.Assignment 0
Aug 29, 31
Basic automata theory. Finite-state acceptors and intersection. Finite-state transducers and composition. Applications in morphology and text-to-sound conversion. Context-free grammars and parsing.
Programming Assignment 1 out Aug 31, due beginning of class Sept 7.Assignment 1
Topic: Finite-state acceptors for natural language.
Sept 5, 7
Basic probability theory. Conditional probability, Bayes rule, estimating parameter values from data, building generative stochastic models, the noisy-channel framework. Probabilistic finite-state acceptors and transducers.
Sept 12, 14, 19, 21
Language modeling. Estimating the frequency of English strings. Using language models to resolve ambiguities across a wide range of applications. Training and testing data. The sparse data problem. Smoothing with held-out data.
Programming Assignment 2 out Sept 14, due beginning of class Sept 21.
Topic: Weighted finite-state acceptors for language modeling.
Sept 26, 28; Oct 3, 5
String transformations. A simple framework for stochastically modeling many types of string transformations, such as: tagging word sequences with parts of speech, cleaning up misspelled word sequences, automatically marking-up names, organizations, and locations in raw text, etc. Estimating parameter values from annotated data.
Programming Assignment 3 out Sept 28, due beginning of class Oct 5.
Topic: Weighted finite-state transducers for string transformation.
Oct 10, 12, 17, 19
Hidden parameters. Problems involving incomplete data, such as: elementary cryptanalysis, transliteration, machine translation, NL interfaces, deciphering ancient scripts. The EM algorithm.
Programming Assignment 4 out Oct 12, due beginning of class Oct 19.
Topic: Unsupervised learning of natural language structure.
Oct 24, 26, 31
Syntactic structures, context-free grammars, parsing, lexicalized grammars, regular tree grammars, syntax-based language models, the inside-outside algorithm.
Programming Assignment 5 out Oct 26, due beginning of class Nov 2.
Topic: Modeling syntactic structure of English.
Nov 2, 7, 9, 14
Tree transformations and applications.
Programming Assignment 6 out Nov 9, due beginning of class Nov 16.
Topic: Modeling syntactic structure.
Initial project proposal due beginning of class Nov 9.
Final project scope settled Nov 16.
Final project write-ups due Dec 12 by email.
Nov 16, 21
Nov 28, 30
Current research in natural language processing.