Research Home

A Probabilistic Approach to Rewriting for Machine Translation and Abstracting


In this project, we aim to develop new corpus-based methods for improving the accuracy of machine translation (MT) and for rapidly developing systems for new language pairs. We also aim to extend the same corpus-based techniques and software to the problems of natural language generation and abstracting.

Commercial MT accuracy has not shown significant improvement over the years. By contrast, commercial speech recognition (SR) accuracy has advanced substantially. Speech scientists attribute this advance to a statistical corpus-based approach, faster computers, more data, shared toolkits, better acoustic/linguistic models, and common problem sets. The MT field is only starting to develop these things. We stand now where speech recognition stood in 1974 -- there are some intriguing initial results, and we see a clear but challenging path laid out before us. By automatically analyzing large collections of human translations (bilingual text corpora), existing statistical MT techniques can slightly outperform the best commercial systems, at least for resource-rich languages like French and English. However, the statistical approach has not been adopted widely, because of problems such as:

  • Translation accuracy leaves much to be desired.
  • Training requires vast amounts of bilingual text corpora.
  • Bilingual training corpora are often domain-specific.
  • There is as yet no satisfactory role for syntactic processing.
  • The algorithms require both mathematical sophistication and substantial computation.

We aim to address these challenges by building full-scale MT systems and to develop and evaluate new statistical MT techniques. Our technical approach will be to broaden the scope of training materials to include monolingual foreign-language text, morphological analyzers, dictionaries, and treebanks. Monolingual text often comes in vast quantities, even for languages (like Arabic) not usually associated with computers and computerization. By bootstrapping off a relatively modest bilingual corpus, we can reach into monolingual text to get new word and phrase translations, and to reduce the domain-specificity of the linguistic knowledge we obtain. We also aim at techniques that dispense with bilingual text altogether, by applying a type of cryptanalysis to foreign-language text. We expect that broadening the scope of training materials will (1) improve accuracy in resource-rich environments, and (2) and extend the rapid-ramp up benefits of statistical MT to a wide range of languages.

We will build syntactic models of English by leveraging off ongoing USC/ISI research on natural language generation (NLG). This will allow us to probabilistically produce and rank English tree structures instead of flat word sequences. We also plan research into statistical, trainable models of translation that are sensitive to these tree structures. We aim further at an unsupervised approach to extracting tree structures from bilingual text corpora automatically. Finally, we will investigate new ideas for estimating context-dependent word-translation probabilities ("word senses") and morphological models.

Statistical models of English have been used previously in MT, NLG, and SR, but they are also directly applicable to the problem of synthesizing short abstracts from long documents. While many summarization systems can extract a set of important sentences from a document, they do not typically return coherent text. In our novel formulation of the abstracting problem, we view the conversion of a long document into a short one as a statistical MT problem. Just as in MT, a statistical model of English takes responsibility for ensuring coherence of the final text output (abstract). We will expand our tree-based English model to include discourse relations between text segments; no one has yet built a formal, probabilistic model at this level, and we expect it to have many applications. We will also produce a translation model responsible for estimating how much important material has been deleted in the construction of a particular abstract. The overall goal of this work is to produce a broad-coverage, accurate, and fluent abstracting system, by training off a large collection of human abstract/extract or abstract/document pairs. Finally, because probabilistic systems are relatively easy to integrate, we propose to evaluate models that combine abstracting, MT, and information retrieval.