Out of Africa, into Grammatical English

January 21, 2011

Most world languages don't have the abundant resources of texts in electronic form that computer scientists have used to create the now widely-used machine translation systems that turn (for example) English into Chinese. Now, Kevin Knight of ISI, who helped pioneer these earlier systems, is working as part of a multi-university team in a five-year effort to find less statistical, more semantic points of attack on 'low-density' languages, starting with some spoken in Africa.

The strategy aims not only at developing a paradigm for quickly developing translation systems for languages like Kinyarwanda and Malagasy, but also at improving the state of the art of machine translation for the high-density targets.

It may even lead toward partial realization of a long-held dream: finding a consistent path through the wild variations in natural human languages to a common core of meaning &ndash a vision referred to in the project proposal title, "The Linguistic-Core Approach to Structured Translation and Analysis of Low-Resource Languages."

Statistical translation, according to the research summary for the MURI (Multidisciplinary University Research Initiative) project recently funded by the U.S. Army Research Office, is based on glueing together phrases found in computer searches of huge volumes of parallel texts into sentences that pass statistical tests

But "even systems trained on large parallel document collections mistranslate simple sentences. This is not surprising: current MT systems have limited knowledge of linguistic structure and thus cannot effectively capture translation patterns.

"New advances will require deeper, more linguistically-realistic models of translation, integrating what we know about how syntax, word formation, and semantics operate across a wide range of natural language," the research summary notes.

The goal is to dramatically expand existing capabilities to process low-resource and typologically diverse languages by attempting to (in the words of the proposal)

1. automatically amplify hand-crafted syntactic and semantic knowledge to obtain comprehensive coverage for a range of tasks;
2. uncover language-specific and language- neutral semantic representations through comparative language analysis, exploiting the diversity of expression exhibited by world languages; and
3. effectively apply these structures to a variety of linguistic tasks including information extraction and summarization as well as translation

These new method grows out of beyond-statistical, syntax-based translation refinements that Knight, (left) an ISI fellow and project leader who is also a research associate professor in the Viterbi School Department of Computer Science and colleagues and colleagues have developed in the past 6 years, working mostly on Chinese-English texts, methods that now outperform the simpler statistical versions.

The effort will attempt to incorporate into the systems the kind of information that human learners of a new language must learn &ndash the wild variations in grammatical structure that character symbolic speech. "Bantu languages have around 18 noun classes," the introduction notes. "This is akin to having 18 genders, but instead distinguishing animals, man-made artifacts, etc."

Enormous as the diversity is, comparative linguistics aided by computer analysis tools has made progress finding common threads and parallelism in the profusion of grammars. The new project will attempt to try to use this progress to help improve machine translation. For example: languages come in families, which display obvious similarities in vocabulary (English 'home' German 'heim' Russian 'dom' Latin 'domus' Sanskrit 'dama') and some grammatical features. The project will try to build information from these relationships into its modeling.

In addition to advancing machine translation technology, a major aim is creation of machine translation prototypes for African languages like Kinyarwanda and Malagasy, which have not been researched as much as many other languages. The project will continue for five years.

In addition to the Viterbi School's ISI (David Chaing of ISI will be working with Knight), other participants include Carnegie Mellon University's Language Technologies Institute (whose director, Jaime Carbonell is the lead researcher on the project), the University of Texas at Austin's Linguistics Department, and the Massachusetts Institute of Technology EECS Department.