Machine Translation of Natural Languages


The Machine Translation (MT) research group at ISI is developing programs that translate Japanese, Arabic, and Spanish texts into English. These programs operate over unrestricted newspaper text. There is a growing market for such general-purpose MT systems, but commercially-available translation acccuracy leaves much to be desired. Our aim is to substantially improve on this accuracy, and to make it easier to develop MT systems for new language pairs.

This project was supported by the Department of Defense from 1994 to 1999.


To this end, we are investigating the use of large-scale semantic representations and reasoning not employed in commercial MT. In order to scale up these AI techniques, we are also investigating methods for automatically gathering linguistic knowledge inductively (statistically) from large online text collections.

Our research on MT has spanned many language processing topics, including: robust syntactic parsing, large-scale grammars, semantic interpretation, large-scale conceptual models, natural language generation, statistical language models, transliteration, morphology, and web-based development environments. Within these topics we explore a range of techniques, including hand-coding, supervised learning, and unsupervised training.

Our MT systems are also deployed in a prototype translating copy machine that converts paper documents from one language to another.

Project Members

Kevin Knight, project leader

Ulf Hermjakob, senior research scientist

Bonnie Stalls, computational linguist

Yaser Alonaizan, graduate student research assistant

Ulrich Germann, computational linguist

Philipp Köhn, graduate student research assistant

Irene Langkilde, graduate student research assistant

Kenji Yamada, graduate student research assistant

Eduard Hovy, senior project leader


Knight, K. and K. Yamada. 1999.
A Computational Approach to Deciphering Unknown Scripts. Proceedings of the ACL Workshop on Unsupervised Learning in Natural Language Processing. Get paper in PostScript.

We propose and evaluate computational techniques for deciphering unknown scripts. We focus on the case in which an unfamiliar script encodes a known language. The decipherment of a brief document or inscription is driven by data about the spoken language. We consider which scripts are easy or hard to decipher, how much data is required, and whether the techniques are robust against language change over time.

Germann, U. 1999.
A Deterministic Dependency Parser for Japanese. Proceedings of the MT Summit VII: MT in the Great Translation Era. Get paper in PostScript.

We present a rule-based, deterministic dependency parser for Japanese. It was implemented in C++, using object classes that reflect linguistic concepts and thus facilitate the transfer of linguistic intuitions into code. The parser first chunks morphemes into one-word phrases and then parses from the right to the left. The average parsing accuracy is 83.6%.

Germann, U. 1998.
Making Semantic Interpretation Parser-Independent. Proceedings of the 4th AMTA Conference. Get paper in PostScript.

We present an approach to semantic interpretation of syntactially parsed Japanese sentences that works largely parser-independent. The approach relies on a standardized parse tree format that restricts the number of syntactic configurations that the semantic interpretation rules have to anticipate. All parse trees are converted to this format for semantic interpretation. This setup allows us not only to apply the same set of semantic interpretation rules from different parsers, but also to independently develop parsers and semantic interpretation rules.

Knight, K. and Al-Onaizan, Y. 1998.
Translation with Finite-State Devices. Proceedings of the 4th AMTA Conference. Get paper in PostScript.

Statistical models have recently been applied to machine translation with interesting results. Algorithms for processing these models have not received wide circulation, however. By contrast, general finite-state transduction algorithms have been applied in a variety of tasks. This paper gives a finite-state reconstruction of statistical translation and demonstrates the use of standard tools to compute statistically likely translations. Ours is the first translation algorithm for "fertility/permutation" statistical models to be described in replicable detail.

Stalls, B. and Knight, K. 1998.
Translating Names and Technical Terms in Arabic Text. COLING/ACL Workshop on Computational Approaches to Semitic Languages. Montreal, Quebéc.
Get paper in PostScript.

It is challenging to translate names and technical terms from English into Arabic. Translation is usually done phonetically: different alphabets and sound inventories force various compromises. For example, Peter Streams may come out as "bytr strymz". This process is called transliteration. We address here the reverse problem: given a foreign name or loanword in Arabic text, we want to recover the original in Roman script. For example, an input like "bytr strymz" should yield an output like Peter Streams. Arabic presents special challenges due to unwritten vowels and phonetic-context effects. We present results and examples of use in an Arabic-to-English machine translator.

Germann, U. 1998.
Visualization of Protocols of the Parsing and Semantic Interpretation Steps in a Machine Translation System. COLING-ACL Workshop on Content Visualization and Intermedia Representations. Montreal, Quebéc.
Get paper in PostScript.

In this paper, we describe a tool for the visualization of process protocols produced by the parsing and semantic interpretation modules in a complex machine translation system. These protocols tend to reach considerable sizes, and error tracking in them is tedious and time-consuming. We show how the data in the protocols can be made more easily accessible by extracting a procedural trace, by splitting the protocols into a collection of cross-linked hypertext files, by indexing the files, and by using simple text formatting and sorting of structural elements.

Langkilde, I. and Knight, K. 1998.
The Practical Value of N-Grams in Generation. Proceedings of the International Natural Language Generation Workshop. Niagra-on-the-Lake, Ontario.
Get paper in PostScript.

We examine the practical synergy between symbolic and statistical language processing in a generator called Nitrogen. The analysis provides insight into the kinds of linguistic decisions that bigram frequency statistics can make, and how it improves scalability. We also discuss the limits of bigram statistical knowledge. We focus on specific examples of Nitrogen's output.

Langkilde, I. and Knight, K. 1998.
Generation that Exploits Corpus-based Statistical Knowledge. Proceedings of the ACL/COLING-98. Montreal, Quebéc.
Get paper in PostScript.

We describe novel aspects of a new natural language generator called Nitrogen. This generator has a highly flexible input representation that allows a spectrum of input from syntactic to semantic depth, and shifts the burden of many linguistic decisions to the statistical post-processor. The generation algorithm is compositional, making it efficient, yet it also handles non-compositional aspects of language. Nitrogen's design makes it robust and scalable, operating with lexicons and knowledge bases of one hundred thousand entities.

Knight, K. 1997.
Automating Knowledge Acquisition for Machine Translation. AI Magazine 18(4).
Get draft of paper in PostScript.

This article surveys some of the recent literature in corpus-based approaches to machine translation.

Knight, K. and J. Graehl. 1997.
Machine Transliteration. Proceedings of the ACL-97. Madrid, Spain.
Get paper in PostScript.

It is challenging to translate names and technical terms across languages with different alphabets and sound inventories. These items are commonly transliterated, i.e., replaced with approximate phonetic equivalents. For example, "computer" in English comes out as "konpyuutaa" in Japanese. Translating such items from Japanese back to English is even more challenging, and of practical interest, since transliterated items make up the bulk of text phrases not found in bilingual dictionaries. We describe and evaluate a method for performing backwards transliterations by machine. This method uses a generative model, incorporating several distinct stages in the transliteration process.

Yamada, K. 1996.
A Controlled Skip Parser. Proceedings of the 2nd AMTA Conference. Montreal, Quebéc.
Get paper in PostScript.

Real-world natural language sentences are long and complex, and always contain unexpected grammatical constructions. It even includes noise and ungrammaticality. This paper describes the Controlled Skip Parser, a program that parses such real-world sentences by skipping some of the words in the sentence. The new feature of this parser is that it can control its behavior to find out which words to skip, without using domain-specific knowledge. Statistical information (N-grams), which is a generalized approximation of the grammar learned from past successful experiences, is used for the controlled skip. Experiments on real newspaper articles are shown, and our experience with this parser in a machine translation system is described.

Knight, K. 1996.
Learning Word Meanings by Instruction. Proceedings of the American Association of Artificial Intelligence AAAI-96. Portland, OR.
Get paper in PostScript.

We develop techniques for learning the meanings of unknown words in context. Working within a compositional semantics framework, we write down equations in which a sentence's meaning is some combination function of the meaning of its words. When one of the words is unknown, we ask for a paraphrase of the sentence. We then compute the meaning of the unknown word by inverting parts of the semantic combination function. This technique can be used to learn word-concept mappings, decomposed meanings, and mappings between syntactic and semantic roles. It works for all parts of speech.

Knight, K., I. Chander, M. Haines, V. Hatzivassiloglou, E.H. Hovy, M. Iida, S.K. Luk, R.A. Whitney, and K. Yamada. 1995
Filling Knowledge Gaps in a Broad-Coverage MT System. Proceedings of the 14th IJCAI Conference. Montreal, Quebéc.
Get paper in PostScript.

Knowledge-based machine translation (KBMT) techniques yield high quality in domains with detailed semantic models, limited vocabulary, and controlled input grammar. Scaling up along these dimensions means acquiring large knowledge resources. It also means behaving reasonably when definitive knowledge is not yet available. This paper describes how we can fill various KBMT knowledge gaps, often using robust statistical techniques. We describe quantitative and qualitative results from JAPANGLOSS, a broad-coverage Japanese-English MT system.

Knight, K. and V. Hatzivassiloglou. 1995.
Two-Level, Many-Paths Generation. Proceedings of the ACL-95. Cambridge, MA.
Get paper in PostScript.

Large-scale natural language generation requires the integration of vast amounts of knowledge: lexical, grammatical, and conceptual. A robust generator must be able to operate well even when pieces of knowledge are missing. It must also be robust against incomplete or inaccurate inputs. To attack these problems, we have built a hybrid generator, in which gaps in symbolic knowledge are filled by statistical methods. We describe algorithms and show experimental results. We also discuss how the hybrid generation model can be used to simplify current generators and enhance their portability, even when perfect knowledge is in principle obtainable.

Hatzivassiloglou, V. and K. Knight. 1995.
Unification-Based Glossing. Proceedings of the 14th IJCAI Conference. Montreal, Quebéc.
Get paper in compressed PostScript.

We present an approach to syntax-based machine translation that combines unification-style interpretation with statistical processing. This approach enables us to translate any Japanese newspaper article into English, with quality far better than a word-for-word translation. Novel ideas include the use of feature structures to encode word lattices and the use of unification to compose and manipulate lattices. Unification also allows us to specify abstract features that delay target-language synthesis until enough source-language information is assembled. Our statistical component enables us to search efficiently among competing translations and locate those with high English fluency.

Knight, K., I. Chander, M. Haines, V. Hatzivassiloglou, E.H. Hovy, M. Iida, S.K. Luk, A. Okumura, R.A. Whitney, and K. Yamada. 1994.
Integrating Knowledge Bases and Statistics in MT. Proceedings of the 1st AMTA Conference. Columbia, MD.
Get paper in PostScript.

We summarize recent machine translation (MT) research at the Information Sciences Institute of USC, and we describe its application to the development of a Japanese-English newspaper MT system. Our work aims at scaling up grammar-based, knowledge-based MT techniques. This scale-up involves the use of statistical methods, both in acquiring effective knowledge resources and in making reasonable linguistic choices in the face of knowledge gaps.

Knight, K. and S. Luk. 1994.
Building a Large-Scale Knowledge Base for Machine Translation. Proceedings of the American Association of Artificial Intelligence AAAI-94. Seattle, WA.
Get paper in PostScript.

Knowledge-based machine translation (KBMT) systems have achieved excellent results in constrained domains, but have not yet scaled up to newspaper text. The reason is that knowledge resources (lexicons, grammar rules, world models) must be painstakingly handcrafted from scratch. One of the hypotheses being tested in the PANGLOSS machine translation project is whether or not these resources can be semi-automatically acquired on a very large scale. This paper focuses on the construction of a large ontology (or knowledge base, or world model) for supporting KBMT. It contains representations for some 70,000 commonly encountered objects, processes, qualities, and relations. The ontology was constructed by merging various online dictionaries, semantic networks, and bilingual resources, through semi-automatic methods. Some of these methods (e.g., conceptual matching of semantic taxonomies) are broadly applicable to problems of importing/exporting knowledge from one KB to another. Other methods (e.g., bilingual matching) allow a knowledge engineer to build up an index to a KB in a second language, such as Spanish or Japanese.

Knight, K. and I. Chander. 1994.
Automated Postediting of Documents. Proceedings of the American Association of Artificial Intelligence AAAI-94. Seattle, WA.
Get paper in PostScript.

Large amounts of low- to medium-quality English texts are now being produced by machine translation (MT) systems, optical character readers (OCR), and non-native speakers of English. Most of this text must be postedited by hand before it sees the light of day. Improving text quality is tedious work, but its automation has not received much research attention. Anyone who has postedited a technical report or thesis written by a non-native speaker of English knows the potential of an automated postediting system. For the case of MT-generated text, we argue for the construction of postediting modules that are portable across MT systems, as an alternative to hardcoding improvements inside any one system. As an example, we have built a complete self-contained postediting module for the task of article selection (a, an, the) for English noun phrases. This is a notoriously difficult problem for Japanese-English MT. Our system contains over 200,000 rules derived automatically from online text resources. We report on learning algorithms, accuracy, and comparisons with human performance.

Okumura, A. and E.H. Hovy. 1994.
Lexicon-to-Ontology Concept Association Using a Bilingual Dictionary. Proceedings of the 1st AMTA Conference. Columbia, MD.
Get paper in PostScript.

This paper describes a semi-automatic method for associating a Japanese lexicon with a semantic concept taxonomy called an ontology, using a Japanese-English bilingual dictionary as a "bridge". The ontology supports semantic processing in a knowledge-based machine translation system by providing a set of language-neutral symbols with semantic information. To put the ontology to use, lexical items of each language of interest must be linked to appropriate ontology items. The association of ontology items with lexical items of various languages is a process fraught with difficulty: since much of this work depends on the subjective decisions of human workers, large MT dictionaries tend to be subject to some dispersion and inconsistency. The problem we focus on here is how to associate concepts in the ontology with Japanese lexical entities by automatic methods, since it is too difficult to define adequately many concepts manually. We have designed three algorithms to associate a Japanese lexicon with the concepts of the ontology: the equivalent-word match, the argument match, and the example match.

NLG overview | Project Members | Projects| Demonstrations | Publications

Created and maintained by Katya Shuldiner