Computer Language Translation System Romances the Rosetta Stone

July 24, 2003

University of Southern California computer scientist Franz Josef Och has developed a single system that can translate between any two languages.

Dr. Och echoed one of the most famous boasts in the history of engineering after his software scored highest among 23 Arabic- and Chinese-to-English translation systems, commercial and experimental, tested recently by the U.S. Commerce Department's National Institute of Standards and Technology.

"Give me a place to stand on, and I will move the world," said the great Greek scientist Archimedes, after providing a mathematical explanation for the lever.

"Give me enough parallel data, and you can have a translation system for any two languages in a matter of hours," said Och, a specialst at the USC School of Engineering's Information Sciences Institute.

Och is a standout exponent of a newer method of using computers to translate one language into another that has become more successful in recent years as the ability of computers to handle large bodies of information has grown, and the volume of translations in digital form has exploded, on (for example) multilingual newspaper and government web sites.

Och's method uses such matched bilingual texts, the computer-encoded equivalents of the famous Rosetta Stone inscriptions-gigabytes of Rosetta Stones.

"Our approach uses statistical models to find the most likely translation for a given input," Och explained.

"It is quite different from the older, symbolic approaches to machine translation used in most existing commercial systems, which try to encode the grammar and the lexicon of a foreign language in a computer program that analyzes the grammatical structure of the foreign text, and then produces English based on hard rules," he continued.

"Instead of telling the computer how to translate, we let it figure it out by itself. First, we feed the system it with a parallel corpus, that is, a collection of texts in the foreign language and their translations into English.

"The computer uses this information to tune the parameters of a statistical model of the translation process. During the translation of new text, the system tries to find the English sentence that is the most likely translation of the foreign input sentence, based on these statistical models."

This method ignores, or rather rolls over, explicit grammatical rules and dictionary lists of vocabulary in favor of letting the computer itself find matchup patterns between a given Chinese or Arabic (or any other language) texts and English translations.

Such abilities have grown, as computers have improved, by enabling them to move from using individual words as the basic unit to using groups of words-phrases.

Different human translators' versions of the same text will often vary considerably. Another key improvement has been the use of multiple English human translations to allow the computer to more freely and widely check its rendering by a scoring system.

This not coincidentally allows researchers to quantitatively measure improvement in translation on a sensitive and useful scale.

The original work along these lines dates back to the late 1980s and early 1990s and was done by Peter F. Brown and his colleagues at IBM's Watson Research Center.

Much of the improvement and expansion of the method was done in Germany, at the Aachen University of Technology (Rheinisch-Westfaelischen Hochschule Aachen), where Och did post-doctoral work.

"One of the great advantages of the statistical approach," Och explained, "is that most of the work goes into components that are language-independent. As long as you give me enough parallel data to train the system on, you can have a new system in a matter of days, if not hours."

Och's ability to work quickly was tested recently in June, 2003, when researchers all over the country (and in England) raced in a "Surprise Language" exercise sponsored by the Defense Advanced Research Projects Agency to create machine translation tools to deal with texts in Hindi.

Creation of the parallel texts needed by Och's system was complicated by the fact that Hindi is written in a non-Latin script, which has numerous different digital encodings instead of one or two standard ones.

Before his system could begin to translate, enormous effort was required to reconcile this diversity and give Och and other translators the volumes of Hindi and matched English text necessary.

Once this was done, however, Och was quickly able to set up and then speedily improve his translations. The quality of his Hindi system is now being evaluated against those created by other scientists at the same time.

Och's translations proved best in the recent 2003 NIST head- to-head benchmarking tests against 7 Arabic systems (5 research and 2 commercial-off-the-shelf products) and 14 Chinese systems (9 research and 5 off-the-shelf). In the 2002 evaluations they had proved similarly superior.

The researcher discussed his methods at a NIST post- mortem workshop on the benchmarking held July 22-23 at Johns Hopkins University in Baltimore, Maryland.