Research

10/30/06

Home
Vitae
Research
Papers
Courses
Contact Info

 

Statistical MT systems learn translational information from parallel bilingual texts and achieve reasonable performance for any language pair when sufficient training data is available. Unfortunately, parallel data is a scarce resource. Publicly available parallel corpora cover few language pairs, are relatively small, and lack domain variety. This is a serious bottleneck for the development of statistical translation systems.

My work addresses this lack of parallel data by exploiting comparable texts, which are not strictly parallel, but related. A good example are the numerous multilingual news feeds produced by news agencies such as Agence France Presse, Xinhua News, BBC, and so on. In my Ph.D. thesis, I develop new algorithms for finding, within such comparable corpora, translationally equivalent segments at various levels of granularity: documents, sentences, and sub-sentential phrases. My research produced the first algorithm capable of distinguishing parallel from non-parallel sentence pairs independently of any surrounding context. This makes it possible to find parallel sentences even within document pairs which are non-parallel. The approach can be extended to document level, allowing me to identify document pairs which are literal translations of each other, with higher accuracy than previously existing approaches. In order to be able to mine very noisy corpora, I was also the first to develop an algorithm for finding parallel phrases within (possibly) non-parallel sentences. For example, given two documents that contain the English and Romanian sentences in the example below, this algorithm is capable of selecting only the boldfaced fragments that are mutual translations of each other.

Who withdrew money from the company shortly before the announcement?
Iata lista persoanelor care si-au retras banii de la companie.
Here is the list of people who withdrew money from the company.

All these algorithms work independent of context (and are therefore robust to the noise in the corpus), use little bilingual initial information (a dictionary, or a small parallel corpus), and are efficient enough to scale to very large comparable corpora. Most importantly, they are the only ones able to date to automatically acquire data that improves the end-to-end performance of state-of-the-art statistical MT systems; and have demonstrated their potential to make an impact for both resource-scarce and resource-rich language pairs.

My work opens several interesting research directions. The ability to distinguish between useful data and noise, at various levels (document, sentence or phrase), is potentially useful to machine learning researchers that deal with noisy input data. The growing body of work that aims to gather and use data from volunteers (e.g. www.openmind.org) has created a strong need for such an ability. My experience with comparable corpora is also relevant to domain adaptation research. I have successfully managed to use seed information from one domain and improve performance on a different one; further work on aligning data across domains might help characterize the ‘distance’ between them, or identify general versus domain-specific properties. Another interesting application of comparable data alignment methods is the discovery of paraphrases in monolingual texts. I have done work on obtaining paraphrases from bilingual parallel corpora, and using them to improve summarization evaluation. By adapting my algorithms for monolingual corpora, I can potentially exploit a richer resource of paraphrases, and make a significant impact in other NLP applications, such as summarization, question answering, or automatic evaluation of machine translation.

For future research, besides improving my algorithms through the application of more knowledge sources (such as syntactic structure and named-entity information), I also intend to work on improving the state of the art in statistical machine translation. I am particularly interested in improving the domain adaptability of translation engines, either by manipulating their training data, or by improving the underlying statistical models to make them aware of domain distinctions.

     

Home | Vitae | Research | Papers | Courses | Contact Info

This site was last updated 01/27/03