Statistical MT
systems learn translational information from parallel bilingual texts and
achieve reasonable performance for any language pair when sufficient training
data is available. Unfortunately, parallel data is a scarce resource. Publicly
available parallel corpora cover few language pairs, are relatively small, and
lack domain variety. This is a serious bottleneck for the development of statistical
translation systems.
My work addresses
this lack of parallel data by exploiting comparable texts, which are not
strictly parallel, but related. A good example are the
numerous multilingual news feeds produced by news agencies such as Agence France Presse, Xinhua News, BBC, and so on. In my Ph.D. thesis, I develop
new algorithms for finding, within such comparable corpora, translationally
equivalent segments at various levels of granularity: documents, sentences, and
sub-sentential phrases. My research produced the first algorithm capable of
distinguishing parallel from non-parallel sentence pairs independently of any
surrounding context. This makes it possible to find parallel sentences even
within document pairs which are non-parallel. The approach can be extended to
document level, allowing me to identify document pairs which are literal
translations of each other, with higher accuracy than previously existing
approaches. In order to be able to mine very noisy corpora, I was also the
first to develop an algorithm for finding parallel phrases within (possibly)
non-parallel sentences. For example, given two documents that contain the
English and Romanian sentences in the example below, this algorithm is capable
of selecting only the boldfaced fragments that are mutual translations of each
other.
Who withdrew money from the company shortly before the announcement?
Iata lista persoanelor care si-au retras banii de la companie.
Here is the list of people who withdrew money from the company.
All these algorithms work
independent of context (and are therefore robust to the noise in the corpus),
use little bilingual initial information (a dictionary, or a small parallel
corpus), and are efficient enough to scale to very large comparable corpora.
Most importantly, they are the only ones able to date to automatically acquire
data that improves the end-to-end performance of state-of-the-art statistical
MT systems; and have demonstrated their potential to make an impact for both
resource-scarce and resource-rich language pairs.
My
work opens several interesting research directions. The ability to distinguish
between useful data and noise, at various levels (document, sentence or phrase),
is potentially useful to machine learning researchers that deal with noisy
input data. The growing body of work that aims to gather and use data from
volunteers (e.g. www.openmind.org) has
created a strong need for such an ability. My
experience with comparable corpora is also relevant to domain adaptation
research. I have successfully managed to use seed information from one domain and
improve performance on a different one; further work on aligning data across
domains might help characterize the distance between them, or identify
general versus domain-specific properties. Another interesting application of
comparable data alignment methods is the discovery of paraphrases in
monolingual texts. I have done work on obtaining paraphrases from bilingual
parallel corpora, and using them to improve summarization evaluation. By
adapting my algorithms for monolingual corpora, I can potentially exploit a
richer resource of paraphrases, and make a significant impact in other NLP
applications, such as summarization, question answering, or automatic evaluation
of machine translation.
For future
research, besides improving my algorithms through the application of more
knowledge sources (such as syntactic structure and named-entity information), I
also intend to work on improving the state of the art in statistical machine
translation. I am particularly interested in improving the domain adaptability
of translation engines, either by manipulating their training data, or by
improving the underlying statistical models to make them aware of domain
distinctions.