![]() |
![]() |
||||
![]() |
|||||
ReWrite
A Probabilistic Approach to Rewriting for Machine Translation and Abstracting |
|||||
Description
In this project, we aim to develop new corpus-based methods for
improving the accuracy of machine translation (MT) and for rapidly
developing systems for new language pairs. We also aim to extend the
same corpus-based techniques and software to the problems of natural
language generation and abstracting. Commercial MT accuracy has not shown significant improvement over
the years. By contrast, commercial speech recognition (SR) accuracy
has advanced substantially. Speech scientists attribute this advance
to a statistical corpus-based approach, faster computers, more data,
shared toolkits, better acoustic/linguistic models, and common problem
sets. The MT field is only starting to develop these things. We stand
now where speech recognition stood in 1974 -- there are some intriguing
initial results, and we see a clear but challenging path laid out before
us. By automatically analyzing large collections of human translations
(bilingual text corpora), existing statistical MT techniques can slightly
outperform the best commercial systems, at least for resource-rich languages
like French and English. However, the statistical approach has not been
adopted widely, because of problems such as: We aim to address these challenges by building full-scale MT systems
and to develop and evaluate new statistical MT techniques. Our technical
approach will be to broaden the scope of training materials to include
monolingual foreign-language text, morphological analyzers, dictionaries,
and treebanks. Monolingual text often comes in vast quantities, even for
languages (like Arabic) not usually associated with computers and computerization.
By bootstrapping off a relatively modest bilingual corpus, we can reach
into monolingual text to get new word and phrase translations, and to reduce
the domain-specificity of the linguistic knowledge we obtain. We also aim at
techniques that dispense with bilingual text altogether, by applying a type
of cryptanalysis to foreign-language text. We expect that broadening the
scope of training materials will (1) improve accuracy in resource-rich
environments, and (2) and extend the rapid-ramp up benefits of statistical
MT to a wide range of languages.
We will build syntactic models of English by leveraging off ongoing
USC/ISI research on natural language generation (NLG). This will allow
us to probabilistically produce and rank English tree structures instead
of flat word sequences. We also plan research into statistical, trainable
models of translation that are sensitive to these tree structures. We aim
further at an unsupervised approach to extracting tree structures from
bilingual text corpora automatically. Finally, we will investigate new
ideas for estimating context-dependent word-translation probabilities
("word senses") and morphological models.
Statistical models of English have been used previously in MT, NLG,
and SR, but they are also directly applicable to the problem of
synthesizing short abstracts from long documents. While many
summarization systems can extract a set of important sentences from a
document, they do not typically return coherent text. In our novel
formulation of the abstracting problem, we view the conversion of a
long document into a short one as a statistical MT problem. Just as in
MT, a statistical model of English takes responsibility for ensuring
coherence of the final text output (abstract). We will expand our
tree-based English model to include discourse relations between text
segments; no one has yet built a formal, probabilistic model at this
level, and we expect it to have many applications. We will also
produce a translation model responsible for estimating how much
important material has been deleted in the construction of a
particular abstract. The overall goal of this work is to produce a
broad-coverage, accurate, and fluent abstracting system, by training
off a large collection of human abstract/extract or abstract/document
pairs. Finally, because probabilistic systems are relatively easy to
integrate, we propose to evaluate models that combine abstracting, MT,
and information retrieval. |
|||||
Copyright ©
2003 The University of Southern California. All RIGHTS RESERVED. |
|||||