ISI 2012 Summer Internships in Natural Language Processing
USC
University of Southern California

Summer 2012 Internships in Natural Language Processing
USC Information Sciences Institute
Natural Language Group

Projects   Past Interns   Intern publications   FAQs


(view of the ISI building from the Pacific. photo credit: Jason Riesa.)

APPLICATIONS NOW CLOSED FOR 2012

Thank you for your application. The USC/ISI Natural Language Processing Summer 2012 Internships Committee is reviewing all applications received on or before February 29, 2012. We plan to make decisions over the week of Feb 29–Mar 6, 2012, then begin notifications. If you missed this deadline, please apply next year.

---

We are looking for interested and qualified students (graduate and undergraduate) to spend the summer working with ongoing research projects at USC/ISI on natural language processing, machine learning, statistical modeling, machine translation, automata, and other areas. These are paid internships. Prior experience in natural language processing is not necessarily required: we will get you interested! Good programming skills are required for the types of projects we do.

Internships will be available for a three month period during the summer of 2012. We arrange a time for joint tutorials at the beginning of the summer, and we have presentations at the end.

How to apply

Projects

For summer research projects, we are interested in the following areas:

1. Semantics-Based Machine Translation

Translating human languages (e.g., Chinese to English) is a longstanding challenge for computer science. We are developing statistical algorithms that extract large amounts of relevant translation knowledge automatically from bilingual text. This summer we are planning to organize a team project on semantics-based translation: automatically annotating parallel text with meaning structures, designing automata to map these structures from/to syntactic trees in source/target languages, extracting probabilistic mapping rules from data, decoding, further design of semantic structures, etc. We are inspired by Warren Weaver's notion (1949): "Thus it may be true that the way to translate from Chinese to Arabic, or from Russian to Portuguese, is not to attempt the direct route, shouting from tower to tower. Perhaps the way is to descend, from each language, down to the common base of human communication -- the real but as yet undiscovered universal language -- and then re-emerge by whatever particular route is convenient."

2. Decipherment

Code-cracking and machine translation have an intimately tied history. The first job of proto-computers was to crack military codes, and the idea arose soon afterwards of treating foreign language as a code for English. We have several investigations planned, inspired by results we reported in 2011: (1) cracking of the Copiale cipher (see also here and here) , (2) discoveries concerning the Voynich manuscript, and (3) the training of a statistical machine translation system without the need for parallel data.

3. Creative Language

In the not-too-distant future, stories, poems, songs, and advertisements will be written by machines, and by human/machine collaborations. We are starting down this path now (for example, see here and here), and there are many research avenues to pursue.

4. Translation and Documentation of Endangered Languages

In the last 50 years, natural language processing research has touched barely 1% of the world's languages. In 100 years, 90% of them will be extinct or nearly so. What can we offer to support the urgent task of documenting and analyzing the world's endangered languages? We are exploring the use of machine translation to accelerate the global language documentation effort. Project members will develop software to be used for language translation and data collection, new models for automatic learning of families of endangered languages. These technologies are designed to enable speakers of endangered languages who have no specialized linguistic training to create large collections of translated oral literature, providing an authentic and interpretable record of their language, serving current and future generations of scholars, teachers, and learners. We will field-test these methods in Papua New Guinea, where 800 of the world's 6000 languages are spoken, focusing on three languages in the Goroka/Kainantu region: Alekano, Benabena, and Kamano. This internship includes travel for several weeks, expenses covered, to Papua New Guinea.


Research Environment

Summer internship projects are supervised by Kevin Knight, David Chiang, and Liang Huang, and interns also interact and collaborate closely with the rest of the Natural Language Group. Our group's research environment includes weekly seminars and reading groups, opportunities for teaching and advising, an active program for summer students, large quantities of linguistic resources, and a 2000-processor supercomputing cluster completely dedicated to natural language research at USC/ISI.

USC/ISI is an academic research institute that is part of USC's Viterbi School of Engineering; many USC/ISI scientists hold research faculty positions in the computer science department. The Natural Language Group is part of USC/ISI's Intelligent Systems Division which carries out a wide range of artificial intelligence research.

USC/ISI is located in Marina del Rey on the Southern California coast, an excellent location convenient to beaches, restaurants, boating, bike paths, and shopping. Note: we are not located on the main campus of USC (which is near downtown LA).


Past Interns

Our summer program is well established! Past students are listed below. People really loved this place -- six of them (marked with a *) interned twice, and seven of them (marked with a ^) joined ISI eventually (as a PhD student, visiting PhD student, or research scientist).

2011: Licheng Fang (Rochester), Sravana Reddy* (Chicago), Xuchen Yao (JHU).

2010: Yoav Goldberg (Ben Gurion, Israel), Ann Irvine (Hopkins), Sravana Reddy (Chicago), Alexander "Sasha" Rush (MIT).

2009: Michael Auli (University of Edinburgh), Paramveer Dhillon (Penn), Erica Greene^ (Haverford), Adam Pauls (UC Berkeley)

2008: Amittai Axelrod (University of Washington), John DeNero (UC Berkeley), Kyle Gorman (Penn Linguistics), Catalin Tirnauca (Universitat Rovira i Virgili)

2007: Michael Bloodgood (Delaware), Jennifer Gillenwater (Rice University), Carmen Heger (Dresden), Wei Ho (Princeton).

2006: Joseph Turian (NYU), Chenhai Xi (Pitt), Victoria Fossum*^ (Michigan), Liang Huang*^ (Penn), Jason Riesa*^ (JHU), Oana-Diana Postolache^ (Saarland).

2005: Victoria Fossum (Michigan), Mark Hopkins* (UCLA), Liang Huang (Penn), Behrang Mohit (Pitt), Preslav Nakov (Berkeley), Jason Riesa (JHU), Hao Zhang (Rochester).

2004: Madhur Ambastha (Rochester), Michel Galley* (Columbia), David Kauchak (UCSD).

2003: Michel Galley (Columbia), Mark Hopkins (UCLA), Beata Klebanov (Hebrew University), Ana-Maria Popescu (University of Washington), Lara Taylor (UCSD).

2002: Chris Ackerman (USC), Emil Ettelaie (USC), Yuling Hsueh (USC), John Lee (Waterloo/MIT), Bo Pang (Cornell)

2001: Abdessamad Echihabi (USC), Hal Daume III^ (CMU), Michael Laszlo (Waterloo), Dragos Stefan Munteanu^ (Iowa), Rebecca Rees (BYU), Radu Soricut^ (Iowa)

1994-2000: Estibaliz Amorrortu, Vasileios Hatzivassiloglou (Columbia), Michael Jahr (Stanford), Larry Kite (USC), Magdalena Romera (USC), Maki Watanabe (USC).


Intern Publications

We always aim to solve interesting and novel scientific problems, and to publish the results in the best conferences. Sample papers that have come from past student internships:

"Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation" (J. Riesa, A. Irvine, D. Marcu), Proceedings of EMNLP 2011.

"Language-independent parsing with empty elements" (S. Cai, D. Chiang, Y. Goldberg), Proceedings of ACL 2011.

"Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation" (E. Greene, T. Bodrumlu, K. Knight), Proceedings of EMNLP 2010.

"Efficient optimization of an MDL-inspired objective function for unsupervised part-of-speech tagging" (A. Vaswani, A. Pauls, D. Chiang), Proceedings of ACL 2010.

"Unsupervised Syntactic Alignment with Inversion Transduction Grammars" (A. Pauls, D. Klein, D. Chiang, K. Knight), Proceedings of NAACL 2010.

"Bayesian Inference for Finite-State Transducers" (D. Chiang, J. Graehl, K. Knight, A. Pauls, S. Ravi), Proceedings of NAACL 2010.

"Binarization of Synchronous Context-Free Grammars" (L. Huang, H. Zhang, D. Gildea, K. Knight), Computational Linguistics, 2009.

"Fast Consensus Decoding over Translation Forests" (J. DeNero, D. Chiang, and K. Knight). Proceedings of ACL 2009.

"Forest Rescoring: Faster Decoding with Integrated Language Models" (L. Huang and D. Chiang), Proceedings of ACL 2007.

"Scalable Inference and Training of Context-Rich Syntactic Models" (M. Galley, J. Graehl, K. Knight, D. Marcu, S. DeNeefe, W. Wang, and I. Thayer), Proceedings of ACL 2006, poster session.

"Synchronous Binarization for Machine Translation" (H. Zhang, L. Huang, D. Gildea, K. Knight), Proceedings of NAACL 2006.

"Statistical Syntax-Directed Translation with Extended Domain of Locality" (L. Huang, K. Knight, A. Joshi), Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA-06).

"Building an English-Iraqi Arabic Machine Translation System for Spoken Utterances with Limited Resources" (J. Riesa, B. Mohit, K. Knight, D. Marcu), Proceedings of Interspeech 2006.

"Text Simplification for Information Seeking Applications" (B. Beigman Klebanov, K. Knight, D. Marcu), In: On the Move to Meaningful Internet Systems, eds. R. Meersman and Z. Tari, Lecture Notes in Computer Science (3290), Springer-Verlag, 2004.

"What's in a Translation Rule?" (M. Galley, M. Hopkins, K. Knight, D. Marcu), Proceedings of NAACL 2004.

"Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences" (B. Pang, K. Knight, and D. Marcu), Proceedings of NAACL 2003.

"Using a Large Monolingual Corpus to Improve Translation Accuracy" (R. Soricut, K. Knight, and D. Marcu), Proceedings of the 6th Association for Machine Translation in the Americas Conference (AMTA-2002).

"Processing Comparable Corpora With Bilingual Suffix Trees" (D. Munteanu and D. Marcu), Proceedings of EMNLP 2002.

"A Noisy-Channel Model for Document Compression" (H. Daume III and D. Marcu), Proceedings of ACL 2002.

"An Unsupervised Approach to Recognizing Discourse Relations" (D. Marcu and A. Echihabi), Proceedings of ACL 2002.

"Fast Decoding and Optimal Decoding for Machine Translation" (U. Germann, M. Jahr, K. Knight, D. Marcu, and K. Yamada), Proceedings of ACL 2001. ACL Best Paper award.

"An Empirical Study in Multilingual Natural Language Generation: What Should a Text Planner Do?" (D. Marcu, L. Carlson, and M. Watanabe), The 1st International Conference on Natural Language Generation INLG'2000, Mitzpe Ramon, Israel, 2000.

"Experiments in Constructing a Corpus of Discourse Trees" (D. Marcu, E. Amorrortu, and M. Romera), ACL'99 Workshop on Standards and Tools for Discourse Tagging, Univ. Maryland, 1999.

"Two-Level, Many-Paths Generation," (K. Knight and V. Hatzivassiloglou), Proceedings of ACL 1995.


Frequently Asked Questions

(Please read these before contacting us...)