CSCI 599, Spring 2014

Applications of Natural Language Processing:

Machine Translation


Meeting time: TTh 11:00-12:20, KAP 163
Office hours: immediately following each lecture

Instructors

Audience This graduate course is intended for PhD students (or undergraduate or masters students who want to continue to a PhD program) who want to gain a foundation for understanding current research in machine translation.


Prerequisites: CSCI 562/662 or permission of instructor. Students should have familiarity with statistical natural language processing and be comfortable with medium-sized programming projects.

Goals: This is an introduction to the field of machine translation (systems that translate speech or text from one human language to another), with a focus on statistical approaches. Three major paradigms will be covered: word-based translation, phrase-based translation, and syntax-based translation. Students will gain hands-on experience with building translation systems and working with real-world data, and they will learn how to formulate and investigate research questions in machine translation.

Textbook: Philipp Koehn, Statistical Machine Translation [Publisher] [Amazon]

Requirements

Course overview (subject to change)

Date Topic Instructor Assignments
Jan 14 No class Required:
  • Koehn, ch. 1 and 2
Read and complete selected exercises:
  • Knight, "A statistical MT tutorial workbook," 1999. [PDF] [RTF]

Part One: Word-based alignment and translation

Jan 16
IBM Models 1–5. Knight Required:
  • Koehn, ch. 4
Background:
  • Koehn, ch. 3
  • CSCI 562 notes on EM
Supplemental:
  • Brown et al, "The mathematics of statistical machine translation: parameter estimation," Computational Linguistics 19(2). [PDF]
  • Knight, "Decoding complexity in word-replacement translation models," Computational Linguistics 25(4) [PDF]
Jan 21 IBM Models 1–5. Required:
  • Vogel, "HMM-Based Word Alignment in Statistical Translation," Proc. COLING, 1996. [PDF]
Jan 23 IBM Models 1–5. Assignment 1 out.
Jan 28 n-gram language models. Absolute discounting and Kneser-Ney smoothing.
Required:
  • Koehn, ch. 7
Supplemental:
  • Chen and Goodman, "An empirical study of smoothing techniques for language modeling," Technical Report 10-98, Harvard University. [PDF]
Jan 30
Add/drop period ends
n-gram language models continued. Very large language models.
Assignment 1 due.
Feb 4 MT evaluation. BLEU. TBA Koehn, ch. 8

Part Two: Phrase-based translation and discriminative training

Feb 6 No class Koehn, ch. 5
Marcu and Wong, "A phrase-based, joint probability model for statistical machine translation." In Proc. EMNLP, 2002. [PDF]
Feb 11 Phrase-based MT. Why do we need phrases. Relationship to EBMT. Phrase extraction. Estimating phrase translation probabilities and the problem of overfitting. Knight Assignment 2 out.
Feb 13 Phrase reordering models. Chiang

Feb 18 Phrase-based decoding. Koehn, ch. 6
Feb 20 Phrase-based decoding continued. k-best lists. Koehn, "Pharaoh: a beam search decoder for phrase-based statistical machine translation models." In Proc. AMTA, 2004. [PDF]
Feb 25 Maximum entropy. Minimum error-rate training. Assignment 2 due. Koehn, ch. 9

Feb 27 Perceptron, max-margin methods. Chiang, "Hope and fear for discriminative training of statistical translation models." [PDF]
Mar 4 System combination. Assignment 3 out.

Interlude: Subword translation

Mar 6 Transliteration. Integrating traditional translation rules. Knight Koehn, ch. 10
Mar 11 Integrating morphology into translation.
Mar 13 Decoding with lattices for morphology and word segmentation. Assignment 3 due.
Mar 18 Spring break

Mar 20 Spring break


Part Three: Syntax-based translation

Mar 25 Hierarchical and syntax-based MT. Why do we need syntax. Synchronous context-free grammars and TSGs.
Chiang Koehn, ch. 11
Chiang, "An introduction to synchronous grammars."
Mar 27 Extracting synchronous CFGs and TSGs from parallel data. Estimating rule probabilities and the problem of overfitting.
Assignment 4 out.
Apr 1 Extracting synchronous TSGs from tree-tree data and the problem of nonisomorphism.
Apr 3 CKY decoding. Chiang, "Hierarchical phrase-based translation."
Apr 8 CKY with an n-gram language model. Assignment 4 due.
Apr 10 More CKY decoding: Binarization. k-best lists. Decoding with lattices. Huang et al., "Binarization for Synchronous Context-Free Grammars"
Huang and Chiang, "Better k-best Parsing"
Apr 15 Source-side tree decoding. Huang et al., "Statistical Syntax-Directed Translation"
Project proposals due.
Apr 17 Syntax-based language models. Knight
Apr 22 Beyond synchronous CFGs and TSGs. Knight, "Capturing Practical Natural Language Transformations"
Apr 24 Semantics-based translation.
Apr 29 Project presentations

May 1 Project presentations

May 14Projects due

Course policies

Students are expected to submit only their own work for homework assignments. They may discuss assigned problems with one another but may not write solutions together or copy solutions from one another. University policies on academic integrity will be closely observed.


All assignments and the project will be due at the beginning of class on the due date. Late assignments will be accepted with a 30% penalty up to a week after the due date. No exceptions can be made except for a grave reason.


Statement for Students with Disabilities

Any student requesting academic accommodations based on a disability is required to register with Disability Services and Programs (DSP) each semester. A letter of verification for approved accommodations can be obtained from DSP. Please be sure the letter is delivered to me (or to TA) as early in the semester as possible. DSP is located in STU 301 and is open 8:30 a.m.–5:00 p.m., Monday through Friday. The phone number for DSP is (213) 740-0776.

 

Statement on Academic Integrity

USC seeks to maintain an optimal learning environment. General principles of academic honesty include the concept of respect for the intellectual property of others, the expectation that individual work will be submitted unless otherwise allowed by an instructor, and the obligations both to protect one’s own academic work from misuse by others as well as to avoid using another’s work as one’s own. All students are expected to understand and abide by these principles. Scampus, the Student Guidebook, contains the Student Conduct Code in Section 11.00, while the recommended sanctions are located in Appendix A: http://www.usc.edu/dept/publications/SCAMPUS/gov/. Students will be referred to the Office of Student Judicial Affairs and Community Standards for further review, should there be any suspicion of academic dishonesty. The Review process can be found at: http://www.usc.edu/student-affairs/SJACS/.