Internships in Natural Language Processing
USC/Information Sciences Institute
Summer 2001

We are looking for interested and qualified students (graduate and undergraduate) to spend the summer working with ongoing research projects at USC/ISI on natural language processing, machine learning, statistical modeling, automatic translation, human/computer dialog, discourse analysis, and other areas. Please click here and here for descriptions of some of these projects.

Currently we have positions open in the following areas:

1. Statistical Machine Translation

Translating human languages (e.g., Chinese to English) is a longstanding challenge for computer science. We are developing and applying statistical algorithms to this problem, extracting large amounts of relevant translation knowledge automatically from bilingual text (e.g., Hong Kong government documents). We face many interesting challenges in this quest to improve significantly on the quality of commercially available translators, and to build translation systems for "smaller" languages (e.g., Tamil and Tetun) that have not yet received significant commercial interest.

2. Statistical Summarization

When humans produce summaries of documents, they do not simply extract sentences and concatenate them. Rather, they create new sentences that are grammatical, that cohere with one another, and that capture the most salient pieces of information in the original document. Given that large collections of text/abstract pairs are available online, it is now possible to envision algorithms that are trained to mimic this process. We have already developed statistical algorithms capable of compressing sentences; these algorithms produce short sentences that are grammatical and that retain the most important pieces of information in the original sentences. Current plans call for scaling up the statistical-based compression techniques that we developed for sentences, so that they are applicable to texts. This work will use discourse and summarization corpora in order to build statistical models that produce coherent abstracts.

3. Alignment and Exploitation of Biological and Natural Language Sequences

The analysis of sequences in molecular biology (e.g., DNA and proteins) is of great scientific and practical interest. The same is true of natural language sequences (e.g., newswire and bilingual text). Both fields have just witnessed an explosion in available online data. We have sequenced the human genome: but now what? We have tens of millions of words that have been translated manually from English to French: but now what? Fortunately, many sequence-analysis algorithms developed in one field can be usefully applied to the other. For example, if we imagine the New York Times as a live organism, then we can view its linear text stream as its "DNA," and this may run to billions of characters. Le Monde (the French daily) is a related organism with a similar function, but different "DNA." Biological algorithms now exist for aligning the DNA of two organisms -- by applying these kinds of algorithms to natural language text, we can automatically align stories, sentences, phrases, and words. From this aligned data, we know how to automatically construct translation systems. Likewise, algorithms developed in statistical machine translation may be profitably applied to biological sequences. USC has expertise in both of these fields; we also have a newly-installed cluster computer for executing large-scale sequence computations.

4. Human/Computer Dialog for Automated Agents in Simulations

Human/computer speech dialog is a research area of increasing importance. We are working on dialog in the context of virtual-reality simulations, where automated agents interact with people and with each other. Natural language is critical for making these simulations seem real. We are also working on "chatterbot" technology to provide robust, realistic conversation capabilities for automated agents. This work is being carried out in collaboration with USC's new Institute for Creative Technologies, which is bringing together Hollywood scriptwriters, game designers, artificial intelligence scientists, and state-of-the-art virtual reality graphics/sound to build compelling simulated worlds.

5. Question Answering

A search on Google for "Natural Language Processing ISI" returns 9,340 hits. And fortunately, the first hit does point to our group web page. This may be sufficient for certain users. But others, may be interested in finding answers to more focused questions: "Where is ISI located?" "Who are the people who work at ISI?" "Who joined the ISI group over the last six month?" "Why is ISI the best place for a summer internship?" "Why should I go to ISI for an Internship and not to X?" Current search engines are not capable of returning short, precise answers to these questions. In the best case scenario, the user is left with the task of finding the right snippets of information in dozens of web pages and in putting them together in order to answer the question she is interested in. The work on Question Answering will focus on solving technical NLP challenges that would enable the development of a system capable of automatically answering questions such as those listed above.


The Internships will be available for a three months period, preferably during the summer of 2001. The starting date is negotiable.

If you are interested, please contact either Kevin Knight (knight@isi.edu) or Daniel Marcu (marcu@isi.edu)! Please include a resume and let us know what is/are the project(s) that you would be interested in. We plan to make decisions by February 28, 2001.

For more information about NLP research activities at USC/ISI, please click here. ISI is an academic research institute that is part of USC's School of Engineering. Click here to see a range of artificial intelligence research projects in ISI's Intelligent Systems Division. USC/ISI is located in Marina del Rey on the Southern California coast, convenient to beaches, restaurants, boating, bike paths, and shopping.